Dead loop when iterate through pytorch dataloader

jony101 · February 16, 2021, 5:56pm

Hi,

I have a small CNN model that works fine on my PC-CPU.
I needed more computations power, so I moved all my data and code to Datalore.

Unfortunately, when I’m initializing my model over the Datalore Sheet, my model reaches only the lines where it needs to iterate over PyTorch dataloader and then entering into a dead loop.
This means that it can’t pull image samples for training, the model gets stuck, and I must restart the kernel.
This bug does not occur when I’m using my own PC with the same code for training.
I also tried to iterate over the dataloader outside of the model, using the code:

images, labels = iter(train_dataloader).next()

and this line of code reproduces the bug described above.

I keep searching the web for a solution but didn’t find anything.

will be greatly appreciated an answer.
Thanks!

igro · February 18, 2021, 4:07pm

Hello Jonathan,

Sorry, I missed the topic on the forum and replied to this question in the email.

It might be useful for other users so I will post the answer here as well.

In this case the issue is caused by slow WebDav performance and as a temporary workaround you can copy your data to the computational agent (machine), to do so you need to execute the following commands:

create a temporary folder:

!mkdir /tmp/data

copy your data from Workspace Files to it (it might take a while):

!cp -r /data/workspace_files/data /tmp/data

Ensure the files were copied (list directory):

!ls /tmp/data

Use /tmp/data in place of /data/workspace_files/data

Please note, that all the data that is stored on the machine will be lost once the computation is stopped.

We are still investigating the ways to improve WebDav performance.