Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code running for distributed graph training #7543

Open
onepiecewiley opened this issue Jul 19, 2024 · 3 comments
Open

Code running for distributed graph training #7543

onepiecewiley opened this issue Jul 19, 2024 · 3 comments

Comments

@onepiecewiley
Copy link

Now, I want to run the graphsage distributed code in the examples/distributed directory, but I don’t have an actual machine, so I used vmware to build three virtual machines as nodes for distributed training. However, although I followed the readme to deploy the environment, set up NFS, etc., I found that I would get an error after running the code on node0 (the main node), saying
"(fordgl) wiley@wiley-virtual-machine:/home/ubuntu/workspace$ python /home/ubuntu/workspace/dgl/tools/launch.py ​​–workspace /home/ubuntu/workspace/dgl/examples/distributed/graphsage/ --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config data/reddit.json --ip_config ip_config.txt “python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000”
The number of OMP threads per trainer is set to 2
/home/ubuntu/workspace/dgl/tools/launch.py:148: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead thread.setDaemon(True) Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.128 ‘c d /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1.
Traceback (most recent call last):
File “node_classification.py”, line 5, in
import dgl
ModuleNotFoundError: No module named ‘dgl’
Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=2; python3 node_classification.py --graph_name red dit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85. 128 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_G RAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=0 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epoch s 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distribute d/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=1 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non- zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1.
cleanup process runs
Task failed”
I did some preliminary investigation and the error message said that the dgl package and torch package could not be found. I found that after executing the run command, the machine itself used the python interpreter in /usr/bin, instead of the fordgl environment named fordgl that I created with conda (this environment has all the packages). I set the environment variables, but it still didn’t help. Every time an error is reported, the python interpreter in /usr/bin is used instead of the python interpreter in the fordgl environment. Now I don’t know what to do. How can I solve this problem?

@onepiecewiley
Copy link
Author

Does the lanuch.py ​​script use the python interpreter in /usr/bin by default? How to solve this problem?

@Rhett-Ying
Copy link
Collaborator

conda env is not used when launch distributed training. If you want to use conda env, you could try to specify the conda python like `python launch.py … “conda_python node_classification.py xxx” . I am not sure if work.

Copy link

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants