Docker and GPU agent

Hi,

I am having trouble creating gpu agent for Datalore Enterprise docker installation.
I have this as my agents-config.yaml

docker:
  network: datalore-agents-network
  dataloreHost: datalore
  instances:
    - id: basic-agent
      default: true
      label: "docker-base"
      description: "docker-base"
      image: docker.io/jetbrains/datalore-agent:2023.1
    - id: gpu-agent
      label: "gpu-agent"
      image: docker.io/jetbrains/datalore-agent:evaluator-gpu-v0.3.0-no-fuse
      deviceRequests:
        - capabilities: [ [ "gpu" ] ]

GPU in use is GeForce GTX 1080 Ti

What image should I use for GPU agent configuration?
Am I missing some configuration declarations?
At the moment whenever is update the agents-config.yaml I can see the “gpu-agent” upon trying to create a new notebook but it is grayed out and unselectable.

When I tried to frankenstein something together into the default instance like this:

docker:
  network: datalore-agents-network
  dataloreHost: datalore
  instances:
    - id: basic-agent
      default: true
      label: "docker-base"
      description: "docker-base"
      image: docker.io/jetbrains/datalore-agent:2023.1
      deviceRequests:
        - capabilities: [ [ "gpu" ] ]

I got this error:

21:40:27.092 ERROR [Datalore EDT Manager] j.d.n.s.c.agentsManager.impl.AgentPool   Failed to create instance when refilling pool
com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"could not select device driver \"nvidia\" with capabilities: [[gpu]]"}

I’m also following this;

I enabled the GPU environment variable for the container start command as follows:
additionalOptions: "-e NVIDIA_VISIBLE_DEVICES=all"

I also added deviceRequests so my configurations looks like as below

docker:
  network: datalore-agents-network
  dataloreHost: datalore
  instances:
    - id: basic-agent
      label: "docker-base"
      description: "docker-base"
      default: true
      image: docker.io/jetbrains/datalore-agent:2023.2
      deviceRequests:
        - driver: "xxx"
          deviceIds: ["xxx"]
          capabilities: [ [ "gpu" ] ]
external:
  instances:
    - id: "xxx"
      label: "xxx"
      description: "xxx"
      image: jetbrains/datalore-agent:2023.3
      command: "podman"
      additionalOptions: "-e NVIDIA_VISIBLE_DEVICES=all"

but still I get the below empty array of GPUs

Hi,

Unfortunately your provided agents conf did not even show gpu machine in the list of available machines.

I tried to bump up the jetbrains/datalore-server and the agents’ image version to 2023.3. And the gpu machine became selectable. This is my agents-config.yaml

docker:
  network: datalore-agents-network
  dataloreHost: datalore
  instances:
    - id: basic-agent
      default: true
      label: "docker-base"
      description: "docker-base"
      image: docker.io/jetbrains/datalore-agent:2023.3
    - id: gpu-agent
      default: false
      label: "gpu-agent"
      description: "gpu-agent"
      image: docker.io/jetbrains/datalore-agent:2023.3
      command: "docker"
      additionalOptions: "--gpus all"
      deviceRequests:
        - driver: "gpu-agent"
          capabilities: [ [ "gpu" ] ]

But the gpu-agent base machine is still unable to start the datalore server gives the following error:

15:35:20.705 ERROR [Datalore EDT Manager] j.d.n.s.c.e.b.PlanStateComputationResourcesMonitor Unexpected token release for 3NTtszpfcHg4SBDszWmdXT: this token is not registered in ResourcesMonitor
15:35:23.864 INFO  [Computation edt_1_1] j.d.n.s.c.c.ComputationIdProviderImpl    Start computation with id bJVDMZUGKjVVty17EdqnyT for session NOTEBOOK session DaBmX3Ck7Blib4H28lYWW9 owned by Qy5l6doKhZV2KOqlTpRIbT [with source Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5] by user Qy5l6doKhZV2KOqlTpRIbT
15:35:23.869 INFO  [Computation edt_1_1] j.d.n.s.c.c.ComputationIdProviderImpl    Started computation with id bJVDMZUGKjVVty17EdqnyT for session NOTEBOOK session DaBmX3Ck7Blib4H28lYWW9 owned by Qy5l6doKhZV2KOqlTpRIbT [with source Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5] by user Qy5l6doKhZV2KOqlTpRIbT
15:35:24.755 WARN  [Datalore EDT Manager] j.d.n.s.c.a.impl.pool.AgentPool          Failed to create single instance
com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"could not select device driver \"gpu-agent\" with capabilities: [[gpu]]"}

	at com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:247)
	at com.github.dockerjava.core.DefaultInvocationBuilder.post(DefaultInvocationBuilder.java:102)
	at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:31)
	at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:13)
	at com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21)
	at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35)
	at com.github.dockerjava.core.command.StartContainerCmdImpl.exec(StartContainerCmdImpl.java:43)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.docker.DockerInstanceManager.doCreate(DockerInstanceManager.kt:69)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.doCreate(BaseInstanceManager.kt:5)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.create(BaseInstanceManager.kt:3)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.SingleInstanceTypeInstanceManagerImpl.createInstances(SingleInstanceTypeInstanceManagerImpl.kt:13)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.a(AgentPool.kt:80)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.g(AgentPool.kt:110)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.pull(AgentPool.kt:119)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.a(InstanceProviderImpl.kt:2)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.supplyInstance(InstanceProviderImpl.kt:65)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:7)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:8)
	at jetbrains.datalore.base.common.async.Asyncs.lambda$select$4(Asyncs.java:99)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.edt.RunnableWithAsync$SafeAsync.success(RunnableWithAsync.java:12)
	at jetbrains.datalore.base.common.edt.RunnableWithAsync.lambda$successFromPlain$1(RunnableWithAsync.java:30)
	at jetbrains.datalore.base.common.edt.RunnableWithAsync.run(RunnableWithAsync.java:19)
	at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$scheduleRunnableWithAsync$0(ExecutorEdtManager.java:29)
	at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$wrap$1(ExecutorEdtManager.java:46)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
15:35:24.756 ERROR [Datalore EDT Manager] j.d.n.s.c.a.impl.InstanceProviderImpl    Failed to serve agent request for Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5, computation id bJVDMZUGKjVVty17EdqnyT
com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"could not select device driver \"gpu-agent\" with capabilities: [[gpu]]"}

	at com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:247)
	at com.github.dockerjava.core.DefaultInvocationBuilder.post(DefaultInvocationBuilder.java:102)
	at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:31)
	at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:13)
	at com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21)
	at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35)
	at com.github.dockerjava.core.command.StartContainerCmdImpl.exec(StartContainerCmdImpl.java:43)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.docker.DockerInstanceManager.doCreate(DockerInstanceManager.kt:69)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.doCreate(BaseInstanceManager.kt:5)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.create(BaseInstanceManager.kt:3)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.SingleInstanceTypeInstanceManagerImpl.createInstances(SingleInstanceTypeInstanceManagerImpl.kt:13)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.a(AgentPool.kt:80)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.g(AgentPool.kt:110)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.pull(AgentPool.kt:119)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.a(InstanceProviderImpl.kt:2)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.supplyInstance(InstanceProviderImpl.kt:65)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:7)
	at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:8)
	at jetbrains.datalore.base.common.async.Asyncs.lambda$select$4(Asyncs.java:99)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
	at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
	at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
	at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
	at jetbrains.datalore.base.common.edt.RunnableWithAsync$SafeAsync.success(RunnableWithAsync.java:12)
	at jetbrains.datalore.base.common.edt.RunnableWithAsync.lambda$successFromPlain$1(RunnableWithAsync.java:30)
	at jetbrains.datalore.base.common.edt.RunnableWithAsync.run(RunnableWithAsync.java:19)
	at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$scheduleRunnableWithAsync$0(ExecutorEdtManager.java:29)
	at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$wrap$1(ExecutorEdtManager.java:46)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
15:35:24.771 WARN  [Datalore EDT Manager] j.d.n.s.c.c.ComputationControllerImpl    Computation creation failed for: NOTEBOOK session DaBmX3Ck7Blib4H28lYWW9 owned by Qy5l6doKhZV2KOqlTpRIbT [with source Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5]

Does this ring any bells?

Kind regards,
Anders

Hi @andersm9 ,

I posted this that I’m also having the same issue and following the topic, not as a solution to the issue.

Thanks :slight_smile:

It’s also a bit weird that container is aware of the GPU but datalore can’t see it.

I fixed my problem.
I had to install nvidia-container-toolkit

I googled: “datalore Failed to start the machine: Status 500: {“message”:“could not select device driver "" with capabilities: [[gpu]]”}”

Solution came from this could not select device driver "" with capabilities: [[gpu]]. · Issue #1034 · NVIDIA/nvidia-docker · GitHub

current agents-config.yaml

docker:
  network: datalore-agents-network
  dataloreHost: datalore
  instances:
    - id: basic-agent
      default: true
      label: "docker-base"
      description: "docker-base"
      image: docker.io/jetbrains/datalore-agent:2023.3
    - id: gpu-agent
      default: false
      label: "gpu-agent"
      description: "gpu-agent"
      image: docker.io/jetbrains/datalore-agent:2023.3
      command: "docker"
      additionalOptions: "--gpus all"
      deviceRequests:
        #- driver: "gpu-agent"
        - capabilities: [ [ "gpu" ] ]

seems to work is the name of the game at the moment.
Docs should include nvidia-container-toolkit as a dependency.

Anders

1 Like

Nvidia-container-toolkit also helped me as well after you mentioned it.
So what I did:

  • Installed nvidia-container-toolkit.
  • Generated nvidia.yaml file by using it (Installation Guide — container-toolkit 1.13.1 documentation).
  • Since I’m using podman, I had to add --device nvidia.com/gpu=all --group-add keep-groups to the additionalOptions of the agents-config.yaml.
  • I also had to chmod 666 the device files in /dev/dri (path depends on device)
  • Restart the datalore server by using new agents-config.yaml