Server does not destroy idle agents

sympe · June 6, 2023, 2:23pm

Agents that have started when opening a notebook are not removed when the notebook is closed.
The cluster ends up with a many agents that need to be removed manually.

Serverlog:

ERROR [Datalore EDT Manager] j.d.n.s.c.a.i.ExtendedInstanceManagerImpl Agent docker-swarm-base-LL8ulkHr should be destroyed but still running!

Environment:
Datalore 2023.2
Docker swarm 23.0.1

igro · June 7, 2023, 10:54am

Hello @sympe,

I’m sorry to disappoint you, but Docker Swarm environment is not officially supported by Datalore, therefore I unfortunately won’t be able to help you.

Thank you!

Best regards,
Igor Medovolkin
QA Engineer in Datalore

Brent_Spiner · June 8, 2023, 7:15pm

I have the same problem. Datalore Enterprise 2023.2 on Google Cloud Kubernetes.

I created an extra data pool with n2-standard-4 machines. Then added the following configuration:

- id: agent-4
        label: "Agent 4"
        description: "K8s instance"
        features:
          - "4 CPU cores"
          - "16 GB RAM"
        initialPoolSize: 0
        numCPUs: 2
        cpuMemoryText: "8 GB"
        numGPUs: 0
        gpuMemoryText: ""
        default: false
        yaml:
          apiVersion: v1
          kind: Pod
          metadata:
            name: k8s-agent
            labels:
              podType: dataloreKubernetesAgent
          spec:
            enableServiceLinks: false
            containers:
              - name: agent
                image: jetbrains/datalore-agent:2023.2  
                securityContext:
                  privileged: true
                env:
                  - name: MAX_HEAP_SIZE
                    value: 512m
                volumeMounts:
                  - mountPath: /etc/datalore/logback-config
                    name: logback-config
                resources:
                  limits:
                    cpu: "4"
                    memory: "16384Mi"
                  requests:
                    cpu: "2400m"
                    memory: "8192Mi"
            volumes:
              - name: logback-config
                configMap:
                  name: datalore-logback-config

The machine is started up as expected, but then Kubernetes complains it can’t downsize:

Pod is blocking scale down because it's not backed by a controller

And it recommends to add this flag:

Set annotation 'cluster-autoscaler.kubernetes.io/safe-to-evict': 'true' for pod or define a controller, such as a deployment or ReplicaSet, for the pod

Maybe it is a good idea to add this to the helm script?