[Core] Torch distributed runs do not get cancelled with `sky cancel` #3742

romilbhardwaj · 2024-07-10T01:34:57Z

Distributed training runs launched with torch.distributed.run or torchrun do not seem to get killed when we run sky cancel

As a workaround, users must add pkill -f -9 <script_name> in their run section to make sure stale processes from previous runs are killed.

Example YAML - https://gist.github.com/romilbhardwaj/2759fd92d0678254db3c326c5d7549fe

sky launch -c test nemo.yaml
sky cancel nemo
# Task gets cancelled in SkyPilot, but ssh test nvidia-smi still shows GPUs utilized and python processes for training running

The text was updated successfully, but these errors were encountered:

romilbhardwaj · 2024-08-05T21:24:36Z

Same happens with distributed deepspeed jobs launched with deepspeed <script>.

romilbhardwaj · 2024-08-06T05:52:49Z

Workaround for deepspeed: run sky exec <cluster_name> --num-nodes <n> -- pkill -9 -i -f deepspeed

landscapepainter · 2024-08-30T06:04:16Z

Without running any distributed training job, cancelling a simple job running within a container seems to not work in general. And the example YAML provided above also runs the task within a container.

I ran a very simple test to see if simple jobs get cancelled that is ran within container, and current master branch fails to terminate the process of the job.

Repro:
cancel_test.yaml:

workdir: ~/cancel_test

resources:
  image_id: docker:ubuntu:20.04

run:
  python3 cancel.py

~/cancel_test/cancel.py:

import time

time.sleep(7200)
print('completed!')

$ sky launch cancel_test.yaml -c mycluster --cloud gcp -y
$ sky cancel mycluster 1
$ ssh mycluster
$ ps aux | grep cancel
root        6612  0.0  0.0  11492  7844 ?        SN   05:14   0:00 python3 cancel.py

This process with PID 6612 should be terminated, but remained. cc @romilbhardwaj @Michaelvll

landscapepainter · 2024-09-11T05:26:01Z

Closing as it's resolved with #3919

Michaelvll added the P0 label Aug 29, 2024

landscapepainter self-assigned this Aug 30, 2024

This was referenced Aug 31, 2024

[Core] 'sky cancel' failing to terminate first job started with 'sky launch' #3898

Closed

[Core] 'sky cancel' failing to terminate jobs ran within docker container #3899

Closed

landscapepainter closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Torch distributed runs do not get cancelled with `sky cancel` #3742

[Core] Torch distributed runs do not get cancelled with `sky cancel` #3742

romilbhardwaj commented Jul 10, 2024

romilbhardwaj commented Aug 5, 2024

romilbhardwaj commented Aug 6, 2024

landscapepainter commented Aug 30, 2024 •

edited

Loading

landscapepainter commented Sep 11, 2024

[Core] Torch distributed runs do not get cancelled with sky cancel #3742

[Core] Torch distributed runs do not get cancelled with sky cancel #3742

Comments

romilbhardwaj commented Jul 10, 2024

romilbhardwaj commented Aug 5, 2024

romilbhardwaj commented Aug 6, 2024

landscapepainter commented Aug 30, 2024 • edited Loading

landscapepainter commented Sep 11, 2024

[Core] Torch distributed runs do not get cancelled with `sky cancel` #3742

[Core] Torch distributed runs do not get cancelled with `sky cancel` #3742

landscapepainter commented Aug 30, 2024 •

edited

Loading