Hi,
I am having trouble creating gpu agent for Datalore Enterprise docker installation.
I have this as my agents-config.yaml
docker:
network: datalore-agents-network
dataloreHost: datalore
instances:
- id: basic-agent
default: true
label: "docker-base"
description: "docker-base"
image: docker.io/jetbrains/datalore-agent:2023.1
- id: gpu-agent
label: "gpu-agent"
image: docker.io/jetbrains/datalore-agent:evaluator-gpu-v0.3.0-no-fuse
deviceRequests:
- capabilities: [ [ "gpu" ] ]
GPU in use is GeForce GTX 1080 Ti
What image should I use for GPU agent configuration?
Am I missing some configuration declarations?
At the moment whenever is update the agents-config.yaml I can see the “gpu-agent” upon trying to create a new notebook but it is grayed out and unselectable.
When I tried to frankenstein something together into the default instance like this:
docker:
network: datalore-agents-network
dataloreHost: datalore
instances:
- id: basic-agent
default: true
label: "docker-base"
description: "docker-base"
image: docker.io/jetbrains/datalore-agent:2023.1
deviceRequests:
- capabilities: [ [ "gpu" ] ]
I got this error:
21:40:27.092 ERROR [Datalore EDT Manager] j.d.n.s.c.agentsManager.impl.AgentPool Failed to create instance when refilling pool
com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"could not select device driver \"nvidia\" with capabilities: [[gpu]]"}
I’m also following this;
I enabled the GPU environment variable for the container start command as follows:
additionalOptions: "-e NVIDIA_VISIBLE_DEVICES=all"
I also added deviceRequests
so my configurations looks like as below
docker:
network: datalore-agents-network
dataloreHost: datalore
instances:
- id: basic-agent
label: "docker-base"
description: "docker-base"
default: true
image: docker.io/jetbrains/datalore-agent:2023.2
deviceRequests:
- driver: "xxx"
deviceIds: ["xxx"]
capabilities: [ [ "gpu" ] ]
external:
instances:
- id: "xxx"
label: "xxx"
description: "xxx"
image: jetbrains/datalore-agent:2023.3
command: "podman"
additionalOptions: "-e NVIDIA_VISIBLE_DEVICES=all"
but still I get the below empty array of GPUs
Hi,
Unfortunately your provided agents conf did not even show gpu machine in the list of available machines.
I tried to bump up the jetbrains/datalore-server and the agents’ image version to 2023.3. And the gpu machine became selectable. This is my agents-config.yaml
docker:
network: datalore-agents-network
dataloreHost: datalore
instances:
- id: basic-agent
default: true
label: "docker-base"
description: "docker-base"
image: docker.io/jetbrains/datalore-agent:2023.3
- id: gpu-agent
default: false
label: "gpu-agent"
description: "gpu-agent"
image: docker.io/jetbrains/datalore-agent:2023.3
command: "docker"
additionalOptions: "--gpus all"
deviceRequests:
- driver: "gpu-agent"
capabilities: [ [ "gpu" ] ]
But the gpu-agent base machine is still unable to start the datalore server gives the following error:
15:35:20.705 ERROR [Datalore EDT Manager] j.d.n.s.c.e.b.PlanStateComputationResourcesMonitor Unexpected token release for 3NTtszpfcHg4SBDszWmdXT: this token is not registered in ResourcesMonitor
15:35:23.864 INFO [Computation edt_1_1] j.d.n.s.c.c.ComputationIdProviderImpl Start computation with id bJVDMZUGKjVVty17EdqnyT for session NOTEBOOK session DaBmX3Ck7Blib4H28lYWW9 owned by Qy5l6doKhZV2KOqlTpRIbT [with source Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5] by user Qy5l6doKhZV2KOqlTpRIbT
15:35:23.869 INFO [Computation edt_1_1] j.d.n.s.c.c.ComputationIdProviderImpl Started computation with id bJVDMZUGKjVVty17EdqnyT for session NOTEBOOK session DaBmX3Ck7Blib4H28lYWW9 owned by Qy5l6doKhZV2KOqlTpRIbT [with source Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5] by user Qy5l6doKhZV2KOqlTpRIbT
15:35:24.755 WARN [Datalore EDT Manager] j.d.n.s.c.a.impl.pool.AgentPool Failed to create single instance
com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"could not select device driver \"gpu-agent\" with capabilities: [[gpu]]"}
at com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:247)
at com.github.dockerjava.core.DefaultInvocationBuilder.post(DefaultInvocationBuilder.java:102)
at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:31)
at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:13)
at com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21)
at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35)
at com.github.dockerjava.core.command.StartContainerCmdImpl.exec(StartContainerCmdImpl.java:43)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.docker.DockerInstanceManager.doCreate(DockerInstanceManager.kt:69)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.doCreate(BaseInstanceManager.kt:5)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.create(BaseInstanceManager.kt:3)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.SingleInstanceTypeInstanceManagerImpl.createInstances(SingleInstanceTypeInstanceManagerImpl.kt:13)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.a(AgentPool.kt:80)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.g(AgentPool.kt:110)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.pull(AgentPool.kt:119)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.a(InstanceProviderImpl.kt:2)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.supplyInstance(InstanceProviderImpl.kt:65)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:7)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:8)
at jetbrains.datalore.base.common.async.Asyncs.lambda$select$4(Asyncs.java:99)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.edt.RunnableWithAsync$SafeAsync.success(RunnableWithAsync.java:12)
at jetbrains.datalore.base.common.edt.RunnableWithAsync.lambda$successFromPlain$1(RunnableWithAsync.java:30)
at jetbrains.datalore.base.common.edt.RunnableWithAsync.run(RunnableWithAsync.java:19)
at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$scheduleRunnableWithAsync$0(ExecutorEdtManager.java:29)
at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$wrap$1(ExecutorEdtManager.java:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
15:35:24.756 ERROR [Datalore EDT Manager] j.d.n.s.c.a.impl.InstanceProviderImpl Failed to serve agent request for Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5, computation id bJVDMZUGKjVVty17EdqnyT
com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"could not select device driver \"gpu-agent\" with capabilities: [[gpu]]"}
at com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:247)
at com.github.dockerjava.core.DefaultInvocationBuilder.post(DefaultInvocationBuilder.java:102)
at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:31)
at com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:13)
at com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21)
at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35)
at com.github.dockerjava.core.command.StartContainerCmdImpl.exec(StartContainerCmdImpl.java:43)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.docker.DockerInstanceManager.doCreate(DockerInstanceManager.kt:69)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.doCreate(BaseInstanceManager.kt:5)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.instanceManager.core.BaseInstanceManager.create(BaseInstanceManager.kt:3)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.SingleInstanceTypeInstanceManagerImpl.createInstances(SingleInstanceTypeInstanceManagerImpl.kt:13)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.a(AgentPool.kt:80)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.g(AgentPool.kt:110)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.pool.AgentPool.pull(AgentPool.kt:119)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.a(InstanceProviderImpl.kt:2)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.InstanceProviderImpl.supplyInstance(InstanceProviderImpl.kt:65)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:7)
at jetbrains.datalore.notebook.server.computation.agentsManager.impl.AgentsManagerImpl$requestAgent$1$1.apply(AgentsManagerImpl.kt:8)
at jetbrains.datalore.base.common.async.Asyncs.lambda$select$4(Asyncs.java:99)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.Asyncs.lambda$map$3(Asyncs.java:50)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.async.SimpleAsync.lambda$success$0(SimpleAsync.java:20)
at jetbrains.datalore.base.common.observable.event.Listeners.fire(Listeners.java:18)
at jetbrains.datalore.base.common.async.SimpleAsync.success(SimpleAsync.java:11)
at jetbrains.datalore.base.common.async.ThreadSafeAsync.success(ThreadSafeAsync.java:23)
at jetbrains.datalore.base.common.edt.RunnableWithAsync$SafeAsync.success(RunnableWithAsync.java:12)
at jetbrains.datalore.base.common.edt.RunnableWithAsync.lambda$successFromPlain$1(RunnableWithAsync.java:30)
at jetbrains.datalore.base.common.edt.RunnableWithAsync.run(RunnableWithAsync.java:19)
at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$scheduleRunnableWithAsync$0(ExecutorEdtManager.java:29)
at jetbrains.datalore.base.jvm.edt.ExecutorEdtManager$ExecutorEdt.lambda$wrap$1(ExecutorEdtManager.java:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
15:35:24.771 WARN [Datalore EDT Manager] j.d.n.s.c.c.ComputationControllerImpl Computation creation failed for: NOTEBOOK session DaBmX3Ck7Blib4H28lYWW9 owned by Qy5l6doKhZV2KOqlTpRIbT [with source Qy5l6doKhZV2KOqlTpRIbT/LwOzLdXc2XpJcwamSF1AL5]
Does this ring any bells?
Kind regards,
Anders
Hi @andersm9 ,
I posted this that I’m also having the same issue and following the topic, not as a solution to the issue.
Thanks 
It’s also a bit weird that container is aware of the GPU but datalore can’t see it.
I fixed my problem.
I had to install nvidia-container-toolkit
I googled: “datalore Failed to start the machine: Status 500: {“message”:“could not select device driver "" with capabilities: [[gpu]]”}”
Solution came from this could not select device driver "" with capabilities: [[gpu]]. · Issue #1034 · NVIDIA/nvidia-docker · GitHub
current agents-config.yaml
docker:
network: datalore-agents-network
dataloreHost: datalore
instances:
- id: basic-agent
default: true
label: "docker-base"
description: "docker-base"
image: docker.io/jetbrains/datalore-agent:2023.3
- id: gpu-agent
default: false
label: "gpu-agent"
description: "gpu-agent"
image: docker.io/jetbrains/datalore-agent:2023.3
command: "docker"
additionalOptions: "--gpus all"
deviceRequests:
#- driver: "gpu-agent"
- capabilities: [ [ "gpu" ] ]
seems to work is the name of the game at the moment.
Docs should include nvidia-container-toolkit as a dependency.
Anders
1 Like
Nvidia-container-toolkit
also helped me as well after you mentioned it.
So what I did:
- Installed
nvidia-container-toolkit
.
- Generated
nvidia.yaml
file by using it (Installation Guide — container-toolkit 1.13.1 documentation).
- Since I’m using
podman
, I had to add --device nvidia.com/gpu=all --group-add keep-groups
to the additionalOptions
of the agents-config.yaml
.
- I also had to
chmod 666
the device files in /dev/dri
(path depends on device)
- Restart the datalore server by using new
agents-config.yaml