Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] 'sky cancel' failing to terminate first job started with 'sky launch' #3898

Closed
landscapepainter opened this issue Aug 31, 2024 · 0 comments · Fixed by #3919
Closed
Assignees
Labels

Comments

@landscapepainter
Copy link
Collaborator

landscapepainter commented Aug 31, 2024

Basically, current master branch fails to terminate the job created with sky launch which has job id 1. After launching job with id 1, running sky exec to launch a second job with job id 2 gets successfully terminated with sky cancel. So current sky cancel behaves differently on the job with job id 1 and the ones get ran after.

Repro:
cancel_test.yaml:

workdir: ~/cancel_test

run:
  python3 cancel.py

~/cancel_test/cancel.py:

import time

time.sleep(7200)
print('completed!')
$ sky launch cancel_test.yaml -c mycluster --cloud gcp -y
$ sky cancel mycluster 1
$ ssh mycluster
$ ps aux | grep cancel
root        6612  0.0  0.0  11492  7844 ?        SN   05:14   0:00 python3 cancel.py

This may be the root cause of #3742

Version & Commit info:

  • sky -v: PLEASE_FILL_IN
  • sky -c: PLEASE_FILL_IN
@landscapepainter landscapepainter changed the title [Core] sky cancel failing to terminate first job started with sky launch [Core] 'sky cancel' failing to terminate first job started with 'sky launch' Aug 31, 2024
@landscapepainter landscapepainter self-assigned this Aug 31, 2024
@Michaelvll Michaelvll added the P0 label Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants