-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Update subprocess_daemon.py to terminate all child processes from ray job when 'sky cancel' is ran #3919
Changes from 3 commits
bdfd2f4
31444a3
fe1418e
95eb570
b69cde1
4e1e7f3
5c34c72
7643b3a
8320572
f7b92c3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,35 @@ | ||
"""Sky subprocess daemon. | ||
|
||
Wait for parent_pid to exit, then SIGTERM (or SIGKILL if needed) the child | ||
processes of proc_pid. | ||
""" | ||
|
||
import argparse | ||
import os | ||
import sys | ||
import time | ||
|
||
import psutil | ||
|
||
if __name__ == '__main__': | ||
|
||
def daemonize(): | ||
"""Separates the process from its parent process with double-forking.""" | ||
# First fork | ||
if os.fork() > 0: | ||
# original process terminates. | ||
sys.exit() | ||
|
||
# Continues to run from first forked child process. | ||
# Detach from parent environment | ||
os.setsid() | ||
|
||
# Second fork | ||
if os.fork() > 0: | ||
# The first forked child process terminates. | ||
sys.exit() | ||
# Continues to run from second forked child process. | ||
|
||
|
||
if __name__ == '__main__': | ||
daemonize() | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--parent-pid', type=int, required=True) | ||
parser.add_argument('--proc-pid', type=int, required=True) | ||
|
@@ -28,29 +46,33 @@ | |
if process is None: | ||
sys.exit() | ||
|
||
children = [] | ||
if parent_process is not None: | ||
# Wait for either parent or target process to exit. | ||
while process.is_running() and parent_process.is_running(): | ||
try: | ||
# if process is terminated by the time reaching this line, | ||
# it returns an empty list. | ||
tmp_children = process.children(recursive=True) | ||
if tmp_children: | ||
children = tmp_children | ||
children.append(process) | ||
Michaelvll marked this conversation as resolved.
Show resolved
Hide resolved
|
||
except psutil.NoSuchProcess: | ||
pass | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we move this in the while loop? Is it to make sure that even if the If that is the purpose, we should mention that in the comment. (style nit: comment should be more about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason you mentioned is exactly correct on why I moved this into Thanks for the reminder advice! The comment is updated at 4e1e7f3 |
||
time.sleep(1) | ||
|
||
try: | ||
children = process.children(recursive=True) | ||
children.append(process) | ||
except psutil.NoSuchProcess: | ||
sys.exit() | ||
|
||
for pid in children: | ||
for child in children: | ||
try: | ||
pid.terminate() | ||
child.terminate() | ||
except psutil.NoSuchProcess: | ||
pass | ||
continue | ||
|
||
# Wait 30s for the processes to exit gracefully. | ||
time.sleep(30) | ||
|
||
# SIGKILL if they're still running. | ||
for pid in children: | ||
for child in children: | ||
try: | ||
pid.kill() | ||
child.kill() | ||
except psutil.NoSuchProcess: | ||
pass | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we have already set
start_new_session=True
when we start this daemon, which should already dosetsid
for this process. Do we know why that does not work?skypilot/sky/skylet/log_lib.py
Lines 211 to 219 in 2e545b8
Reference: https://docs.python.org/3/library/subprocess.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the
start_new_session=True
flag certainly does start a new session by runningsetsid
. See the three comparison below:ray
job with the flag,start_new_session=True
, and without the daemonization forsubprocess_daemon.py
I added(what we currently have atmaster
branch):As you can see from above, the
SID
, session id, is different for the process runningsubprocess_daemon.py
andray::task
(4536
and3527
). And this is due to thestart_new_session=True
flag. And you can see that process runningsubprocess_daemon.py
is a child process of theray::task
process, which kills the daemon process whenray
'sstop_job
API is ran.ray
job with the flag,start_new_session=True
, and daemonization forsubprocess_daemon.py
I added(what we currnetly have at this PR):As you can see from above, the
SID
is different(4526
and3516
) for the process runningsubprocess_daemon.py
andray::task
. Also, the parent process id,PPID
, of the process runningsubprocess_daemon.py
is notray::task
anymore with the daemonization added from this PR. As thesubdaemon_process.py
process is completely detached from it's original parent process,ray::task
, thesubprocess_daemon.py
process does not get killed anymore whenray
'sstop_job
API is ran.ray
job without the flag,start_new_session=True
, but with daemonization forsubprocess_daemon.py
I added(just to show session group is separated withstart_new_session=True
flag):As you can see from above, the
SID
is identical(3502
) for bothray::task
andsubprocess_deamon.py
as I removed thestart_new_session=True
flag, so the flag is indeed creating a new session when it's launchingsubprocess_daemon.py
as a process.In my understanding, creating a process with a separate session group doesn't necessarily detaches the process(
subprocess_daemon.py
) from it's parent process(ray::task
) in terms of process hierarchy. So the additional daemonization is necessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comprehensive investigation for this @landscapepainter ! This is very useful! I am wondering if it's possible that we can add some additional arguments to the subprocess call in python to achieve this detachment. That might be cleaner than having the current handling in the
subprocess_daemon.py
because the fork can cause two processes (though one will be exited immediately) to be generated every time we start this demon process?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll I did some extensive search on this before I implemented current solution, and did more research in case I missed any.
The observed reason why
ray
'sstop_job
api killssubprocess_daemon.py
is because the process is a child process ofray::task
within the process hierarchy although the session group and process group are different, which implies the possible solution is to runsubprocess_daemon.py
in a separate process hierarchy fromray::task
process. Unfortunately, there isn't an option forsubprocess.Popen()
to completely detach the child process from the process hierarchy.Another approach that seems feasible is to update the way how we create ray job scripts from
RayCodeGen
and submit jobs to perhaps run thesubprocess_daemon.py
separately from ray job. But this seems to include much larger refactoring task since now skypilot needs access to the process IDs generated within remoteray::task
to pass as an argument to and runsubprocess_daemon.py
outside of the job script.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. If we
daemonize
it in this file already, do we still need to addstart_new_session
in ourPopen
call?Also, please leave a comment at the top of this file or this function to say why we daemonize this process, e.g., to make sure
ray job stop
does not kill this process, so this process can handle any leaked processes from user program.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_new_session
is redundant at this point as you mentioned, and confirmed it works fine without it. Removed the option, and also updated the docstring.