Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Torch distributed runs do not get cancelled with sky cancel #3742

Closed
romilbhardwaj opened this issue Jul 10, 2024 · 4 comments
Closed
Assignees
Labels

Comments

@romilbhardwaj
Copy link
Collaborator

Distributed training runs launched with torch.distributed.run or torchrun do not seem to get killed when we run sky cancel

As a workaround, users must add pkill -f -9 <script_name> in their run section to make sure stale processes from previous runs are killed.

Example YAML - https://gist.github.com/romilbhardwaj/2759fd92d0678254db3c326c5d7549fe

sky launch -c test nemo.yaml
sky cancel nemo
# Task gets cancelled in SkyPilot, but ssh test nvidia-smi still shows GPUs utilized and python processes for training running
@romilbhardwaj
Copy link
Collaborator Author

Same happens with distributed deepspeed jobs launched with deepspeed <script>.

@romilbhardwaj
Copy link
Collaborator Author

Workaround for deepspeed: run sky exec <cluster_name> --num-nodes <n> -- pkill -9 -i -f deepspeed

@Michaelvll Michaelvll added the P0 label Aug 29, 2024
@landscapepainter landscapepainter self-assigned this Aug 30, 2024
@landscapepainter
Copy link
Collaborator

landscapepainter commented Aug 30, 2024

Without running any distributed training job, cancelling a simple job running within a container seems to not work in general. And the example YAML provided above also runs the task within a container.

I ran a very simple test to see if simple jobs get cancelled that is ran within container, and current master branch fails to terminate the process of the job.

Repro:
cancel_test.yaml:

workdir: ~/cancel_test

resources:
  image_id: docker:ubuntu:20.04

run:
  python3 cancel.py

~/cancel_test/cancel.py:

import time

time.sleep(7200)
print('completed!')
$ sky launch cancel_test.yaml -c mycluster --cloud gcp -y
$ sky cancel mycluster 1
$ ssh mycluster
$ ps aux | grep cancel
root        6612  0.0  0.0  11492  7844 ?        SN   05:14   0:00 python3 cancel.py

This process with PID 6612 should be terminated, but remained. cc @romilbhardwaj @Michaelvll

@landscapepainter
Copy link
Collaborator

Closing as it's resolved with #3919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants