Fix a Kubernetes (EKS) Datalore installation - SingleInstanceTypeAgentPool - Failed to create instance when refilling pool

Valerii · October 17, 2022, 8:14am

Hello.

I’ve installed Datalore into an EKS (AWS K8s) cluster using the following instructions using Helm:

https://www.jetbrains.com/help/datalore/install-datalore-enterprise.html
https://www.jetbrains.com/help/datalore/helm-specific-instructions.html

We’ve used the latest helm chart version 0.2.4

I’ve used the following values.yaml file (sensitive values replaced with x’s)
replicaCount: 1

dataloreVersion: ""

serverImage:
  repository: jetbrains/datalore-server
  pullPolicy: IfNotPresent
postgresImage:
  repository: jetbrains/datalore-postgres
  pullPolicy: IfNotPresent
databaseCommandImage:
  repository: jetbrains/datalore-database-command
  pullPolicy: IfNotPresent

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::xxxxxxxxxxxx:role/datalore"
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: "datalore"

podAnnotations: {}

dataloreSecurityContext:
  runAsUser: 5000
postgresSecurityContext:
  runAsUser: 999

securityContext:
  fsGroup: 5000

service:
  type: ClusterIP
  port: 8080

computationPort: 4060

nodePorts:
  enabled: false
  httpPort: 30090
  agentsManagerPort: 30091
  computationPort: 30092
  httpInternalPort: 30093

ingress:
  enabled: true
  className: ""
  annotations:
    kubernetes.io/ingress.class: nginx
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
  hosts:
    - host: datalore.xxxxxx.com
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls:
  - secretName: datalore.xxxxxx.com-tls
    hosts:
    - datalore.xxxxxx.com

dataloreResources: {}
postgresResources: {}

volumes: []
#  - name: storage
#    emptyDir: { }
#  - name: postgresql-data
#    emptyDir: { }

volumeClaimTemplates:
- metadata:
    name: storage
  spec:
    accessModes: [ "ReadWriteOnce" ]
    storageClassName: "gp2"
    resources:
      requests:
        storage: 100Gi
- metadata:
    name: postgresql-data
  spec:
    accessModes: [ "ReadWriteOnce" ]
    storageClassName: "gp2"
    resources:
      requests:
        storage: 100Gi

nodeSelector: {}

tolerations: []

affinity: {}

dbRootPassword: "xxxxxxxxxxxxxxxxxxxx"
internalDatabase: true

sqlServerHost: ""
sqlServerPort: ""

agentsConfig: {}
connectionChecker: {}
introspection: {}
namespacesLoader: {}
plansConfig: []
logbackConfig: ""
customEnvs: {}
dataloreEnv:
  DATALORE_PUBLIC_URL: "datalore.xxxxxx.com"
  MAIL_ENABLED: "true"
  FORCE_EMAIL_VERIFICATION: "true"
  MAIL_SENDER_EMAIL: "xxxxxx@xxxxxx.com"
  MAIL_SENDER_NAME: "xxx Datalore service"
  MAIL_SENDER_USERNAME: "XXXXXXXXXXXXXXXXXXX"
  MAIL_SENDER_PASSWORD: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
  MAIL_SMTP_SERVER: "email-smtp.us-east-1.amazonaws.com"
  MAIL_SMTP_PORT: "587"

The pod starts ok but I can see some errors in the logs & it seems that Datalore is not operational.
Logs:

10:32:40.408 ERROR [Datalore EDT Manager] j.d.n.s.c.a.i.SingleInstanceTypeAgentPool - Failed to create instance when refilling pool
java.lang.RuntimeException: io.kubernetes.client.openapi.ApiException:
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.kubernetes.KubernetesInstanceManager.doCreate(KubernetesInstanceManager.java:6)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.doCreate(BaseInstanceManager.kt:8)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.create(BaseInstanceManager.kt:7)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.l.createInstances(l.java:8)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.SingleInstanceTypeAgentPool.d(SingleInstanceTypeAgentPool.java:37)
at jetbrains.ocelot.base.edt.server.a.lambda$wrap$1(a.java:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.kubernetes.client.openapi.ApiException:
at io.kubernetes.client.openapi.ApiClient.handleResponse(ApiClient.java:974)
at io.kubernetes.client.openapi.ApiClient.execute(ApiClient.java:886)
at io.kubernetes.client.openapi.apis.CoreV1Api.createNamespacedPodWithHttpInfo(CoreV1Api.java:9907)
at io.kubernetes.client.openapi.apis.CoreV1Api.createNamespacedPod(CoreV1Api.java:9873)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.kubernetes.KubernetesInstanceManager.doCreate(KubernetesInstanceManager.java:45)
… 11 common frames omitted
10:32:50.420 ERROR [Datalore EDT Manager] j.d.n.s.c.a.i.SingleInstanceTypeAgentPool - Failed to create instance when refilling pool
java.lang.RuntimeException: io.kubernetes.client.openapi.ApiException:
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.kubernetes.KubernetesInstanceManager.doCreate(KubernetesInstanceManager.java:6)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.doCreate(BaseInstanceManager.kt:8)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.create(BaseInstanceManager.kt:7)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.l.createInstances(l.java:8)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.SingleInstanceTypeAgentPool.d(SingleInstanceTypeAgentPool.java:37)
at jetbrains.ocelot.base.edt.server.a.lambda$wrap$1(a.java:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.kubernetes.client.openapi.ApiException:
at io.kubernetes.client.openapi.ApiClient.handleResponse(ApiClient.java:974)
at io.kubernetes.client.openapi.ApiClient.execute(ApiClient.java:886)
at io.kubernetes.client.openapi.apis.CoreV1Api.createNamespacedPodWithHttpInfo(CoreV1Api.java:9907)
at io.kubernetes.client.openapi.apis.CoreV1Api.createNamespacedPod(CoreV1Api.java:9873)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.kubernetes.KubernetesInstanceManager.doCreate(KubernetesInstanceManager.java:45)
… 11 common frames omitted

Please advise on how to fix it.

Thank you.

igro · October 17, 2022, 3:05pm

Hello @Valerii,

You might have installed Datalore in a non-default namespace, therefore some additional configuration is required:

specify namespace in the agentsConfig field
add DATABASES_K8S_NAMESPACE environment variable to dataloreEnv

Please see documentation for more details: Helm-specific instructions | Datalore

Hope this helps!

Best regards,
Igor

Valerii · October 17, 2022, 4:51pm

Thank you I will try that.

I can see that the fix for the AgentsConfig to use the .Release.Namespace is implemented in the helm chart in main since September 9: By default, use release namespace in agents configmap · JetBrains/datalore-configs@e9106e4 · GitHub

But it wasn’t there in the helm chart I’m using & it’s the latest one you can find

$ helm repo add https://jetbrains.github.io/datalore-configs/charts
$ helm repo update
$ helm search repo datalore

NAME             	CHART VERSION	APP VERSION 	DESCRIPTION
datalore/datalore	0.2.4        	2022.2.3    	           
datalore/hub     	0.2.4        	2022.2.15039

github.com

JetBrains/datalore-configs/blob/2022.2.3/charts/datalore/templates/agents-config-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ include "datalore.fullname" . }}-agents-config
  labels:
    {{- include "datalore.labels" . | nindent 4 }}
data:
  agents-config.yaml: |-
{{- if .Values.agentsConfig }}
{{- tpl (toYaml .Values.agentsConfig) . | nindent 4 }}
{{- else }}
    k8s:
      instances:
        - id: k8s-datalore-agent
          label: "K8s Local"
          description: "Local K8s instance"
          features:
            - "1 CPU cores"
            - "2 GB RAM"
          minAllowed: 1

This file has been truncated. show original

Same goes for DATABASES_K8S_NAMESPACE

So I’ve replaced all files in the helm chart folder with the ones from main branch aside from the Chart.yaml file as the pods using an image with tag 0.2.5-SNAPSHOT are failing with ImagePullBackOff error.
After helm upgrade a new pod was created:
k8s-datalore-agent-a0odvmdwvhzzo8hu & no errors in the logs which seems to be a good sign.

Thank you!

igro · October 17, 2022, 5:39pm

Yes, this change will be included in the upcoming 2022.3 release. Please stay tuned in!