-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Torch distributed runs do not get cancelled with sky cancel
#3742
Comments
Same happens with distributed deepspeed jobs launched with |
Workaround for deepspeed: run |
Without running any distributed training job, cancelling a simple job running within a container seems to not work in general. And the example YAML provided above also runs the task within a container. I ran a very simple test to see if simple jobs get cancelled that is ran within container, and current master branch fails to terminate the process of the job. Repro:
This process with PID |
Closing as it's resolved with #3919 |
Distributed training runs launched with
torch.distributed.run
ortorchrun
do not seem to get killed when we runsky cancel
As a workaround, users must add
pkill -f -9 <script_name>
in theirrun
section to make sure stale processes from previous runs are killed.Example YAML - https://gist.github.com/romilbhardwaj/2759fd92d0678254db3c326c5d7549fe
The text was updated successfully, but these errors were encountered: