Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA Hadoop/Spark not working with Slurm submission scripts #276

Open
casty8 opened this issue Mar 16, 2018 · 3 comments
Open

RDMA Hadoop/Spark not working with Slurm submission scripts #276

casty8 opened this issue Mar 16, 2018 · 3 comments

Comments

@casty8
Copy link

casty8 commented Mar 16, 2018

I have configured rdma hadoop and spark by myself in an InfiniBand cluster and it works, but when I try to use the submission script magpie.sbatch-srun-spark-with-yarn-and-hdfs (just for testing hadoop by now), it allocates the nodes perfectly in slurm but doesn't work properly. ResourceManager appears on jps command but it doesn't start, showing and InfiniBand error in the resourcemanager.out error, while not showing errors in de .log file, so nodemanagers.log files show a connection problem to the resource manager node.

Seems like these scripts are not ready for this RDMA version of hadoop and spark, because I can make it work fine by myself with the conf files provided in the hadoop guide that I followed, any suggestions??

I would really appreciate any help you can provide.

@chu11
Copy link
Member

chu11 commented Mar 16, 2018

I have never tested with RDMA hadoop, so I don't know if Magpie works with it. Obviously, any number of changes to RDMA hadoop can make it not work with Magpie, as Magpie assumes the hadoop scripts work in a certain way, the patches apply to it cleanly, the same configuration & tool options exist, etc. etc.

Without any knowledge of your situation, here's a guess on the problem.

Magpie assumes the node's hostname (as it is configured in Slurm), such as "foo[1-10]" is the hostname to use for networking communication. i.e. the Hadoop nodemanager works off the node & port of foo1:1234 and connects to the datanode on foo2:5678.

If your cluster is not like this, then perhaps the Infiniband portion of RDMA Hadoop is confused, b/c its trying to connect to the host/IP that Magpie configured for it, which is not the host/IP it wants.

Other than that, I think good ole fashioned log/conf file debugging is the way to go. I'm glad to help. If you use the script magpie-gather-config-files-and-logs-script.sh, it's a good way to gather conf/log files to begin debugging with.

@casty8
Copy link
Author

casty8 commented Mar 21, 2018

I'm trying to launch the job with the script that you recommended me but I'm getting some errors like these in the slurm-jobid.out file:
magpie-output-config-files-script.sh: 10: [: 0: unexpected operator
magpie-gather-config-files-and-logs-script.sh: 29: [: y: unexpected operator

Also showing this one:
Magpie Internal Error: Magpie_get_networkedhdfspath called without HDFS networked path set used

I have tried to modify some config files provided by Magpie adding some options used in my own files, but it still doesn't work.

@chu11
Copy link
Member

chu11 commented Mar 22, 2018

I'm unsure of your setup, but there seems to be something core/basic wrong. Unclear of what it could be.

For

 magpie-output-config-files-script.sh: 10: [: 0: unexpected operator

the error is this line

if [ "${MAGPIE_CLUSTER_NODERANK}" == "0" ]                                                               

The environment variable MAGPIE_CLUSTER_NODERANK isn't defined, leading to the script error. But this is defined by Magpie in magpie/exports/magpie-exports-submission-type. So some error is occurring that isn't causing that environment variable to be generated. This is a pretty core part of Magpie and it probably means your setup is unique in some way that it can't calculate your job's noderank.

Perhaps you can try a stupid test. If you run a job, can you output the environment variables SLURM_NODEID, SLURM_NNODES, SLURM_JOB_NODELIST, SLURM_JOB_NAME, and SLURM_JOB_ID on each node of your allocation? Because Magpie needs these, and I believe at the moment it simply assumes Slurm always provides them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants