Issues Running PySpark on Remote Hadoop/Yarn Cluster

sean · October 24, 2022, 3:11am

Hello all,

I’ve been loving working with Datalore for the last few days, however, I’m having a lot of trouble getting it to work with PySpark. Specifically, I’m running a Hadoop cluster on the same private network as the datalore host node, and I need the datalore-agent to use the remote yarn to launch jobs. The synchronous communication with the yarn master works as expected, however, any time a “collect()” happens (either explicitly or implicitly) the spark client opens an ephemeral port within the docker-agent container to receive the data. This port cant be accessed from the other hosts on the network, and the kernel hangs indefinitely until killed. Has anyone figured out a way around this? I tried adding a port mapping to the agent-config, but this did not end up opening any port bindings, also, it would cause collisions when multiple agents are running. Is there any documentation on running Spark in Datalore on a remote yarn cluster?

Any help is appreciated!

Thanks,
Sean