Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.3.5: processes of the "w" Windows launcher executables can't be killed anymore #175

Closed
bastimeyer opened this issue Jul 22, 2022 · 31 comments
Assignees

Comments

@bastimeyer
Copy link

Describe the bug

After the launcher files were updated in 8d2acb5 (0.3.5), processes spawned from the "w" launchers can't be killed anymore. I am facing this issue when trying to spawn and kill the processes from a NodeJS context. Calling childprocess.kill() via NodeJS's child_process module returns success, but the python process keeps running.

This is a major problem for me, because my NodeJS application (GUI application via NW.js / Electron) is running a python CLI application built with the distlib launcher executables, and it needs to use the "w" launchers, because otherwise a console window pops up, which is bad for the user experience.

0.3.3 is the most recent release which is working fine.

0.3.4 had the immediate 3221225477 / 0xC0000005 exit code error, and I already had to ignore this version when building standalone installers of my python application. Fortunately, pip had reverted back to distlib 0.3.3 until recently, so installing via pip was no problem either, but now it has merged the distlib 0.3.5 release, which is once again bad, and the issue needs to be fixed.
Related (old 0.3.4 issue): takluyver/pynsist#243 (comment)

Apparently, these are the recent launcher changes:
https://bitbucket.org/vinay.sajip/simple_launcher/commits/

To Reproduce

I've created a reproduction repo (already did this for 0.3.4). Please see the build config and build+run scripts for the 0.3.5 issue:
https://github.com/bastimeyer/distlib-bug/tree/53f520efec1a6173b7ca037d26c89a6dec8ca633

The resulting GH action logs:
https://github.com/bastimeyer/distlib-bug/actions/runs/2718462897

0.3.5-t64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911880?check_suite_focus=true#step:7:1
0.3.5-w64 (failure): https://github.com/bastimeyer/distlib-bug/runs/7467912062?check_suite_focus=true#step:7:1

0.3.3-t64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911132?check_suite_focus=true#step:7:1
0.3.3-w64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911332?check_suite_focus=true#step:7:1

Environment

  • OS, including version: Windows (any)
  • Version of this library: 0.3.5
@bastimeyer
Copy link
Author

Pinging @vsajip, since he doesn't have the repo on his watch list and probably wasn't notified.
takluyver/pynsist#243 (comment)

@vsajip
Copy link
Collaborator

vsajip commented Jul 25, 2022

Whoops, not sure why I wasn't watching - I've rectified that. Thanks for the tip.

@vsajip vsajip self-assigned this Jul 25, 2022
@vsajip
Copy link
Collaborator

vsajip commented Jul 31, 2022

It's a bit tricky to see what's going on. With the w64.exe that is part of distlib 0.3.5, I can create two launcher test executables from the same source script - one with a shebang referencing c:/python38/python.exe and the other with a shebang referencing the interpreter from a venv created from c:/python38/python.exe. The test executable using the venv's shebang appears not to respond to a subprocess.Popen.kill(), but the other test executable does respond correctly and shuts down. In the failing case, the test executable itself is closed, but the Python interpreter it spawned isn't. For the moment, mystified 😕

@vsajip
Copy link
Collaborator

vsajip commented Aug 1, 2022

OK ... it looks as if in the venv case, the Python interpreter is re-invoking the script using the base interpreter - I will put together a test case for this.

@vsajip
Copy link
Collaborator

vsajip commented Aug 2, 2022

Do you experience the same with your GUI executables, @cbrnr ?

@cbrnr
Copy link

cbrnr commented Aug 2, 2022

I will check. What are the steps to reproduce this problem?

@vsajip
Copy link
Collaborator

vsajip commented Aug 2, 2022

Thanks. Just launch a GUI executable (built using a recent pip which has vendored distlib 0.3.5) using e.g. subprocess.Popen() and then try to kill it using subprocess.Popen.kill().

@vsajip
Copy link
Collaborator

vsajip commented Aug 2, 2022

I seem to have got closer to finding the cause of this problem. I have set up a test repository with some GitHub test steps. The summary is:

  • On Python 3.6, all launchers behave as expected.
  • On Python 3.7 and later, windowed executables don't terminate when subprocess.Popen.kill() is called on them.

The same launchers (from distlib 0.3.5) were used for all tests. Here is an example test run.

What changed is that from Python 3.7.2 onwards,

venv on Windows no longer copies the original binaries, but creates redirector scripts named python.exe and pythonw.exe instead.

So, it seems that the pythonw.exe redirector script isn't behaving as expected.

I'd appreciate a second opinion from @eryksun and/or @mbikovitsky, as they were involved in the recent changes to the launcher code during work to fix this pip issue. It may be that this issue needs to be raised against the CPython project itself, rather than here.

@mbikovitsky
Copy link

I think there are two related issues here.

The first one is in the distlib launcher:

  1. The "windowed" version of the launcher calls clear_app_starting_state after launching the child process but before assigning it to the job object.
  2. Inside of clear_app_starting_state there is a call to WaitForInputIdle without a timeout.
  3. In the case of running inside a virtualenv, the child process we're waiting for is Python's venvlauncher. That's the pythonw.exe inside the virtualenv that is responsible for finding and launching the actual Python interpreter.
  4. This launcher, AFAICT, never becomes "idle". Therefore, the WaitForInputIdle call in the parent never returns, the child is never assigned to a job, and so when the parent is killed the child lives on.

That last part is a little bit of speculation on my part, but I have some circumstantial evidence:

  1. When the distlib launcher is killed, the child process does not belong to a job. This can be observed by launching the tests under WinDbg with windbg.exe -g -o python run_launcher_tests.py -w, which will break when the distlib launcher is terminated. Then, looking at the process tree in Process Hacker or Process Explorer, we can see that the venvlauncher is not part of a job.
  2. When the distlib launcher is killed, it's blocked inside WaitForInputIdle. Again, this can be observed in the same way using WinDbg. Looking around the call site, this WaitForInputIdle is exactly the one inside clear_app_starting_state.
  3. The clear_app_starting_state performs more work in order to clear the "idle" state than the similar code in the venvlauncher. In the comments there's an empirical explanation of why this is necessary. This might explain why WaitForInputIdle never returns.

If my reasoning is correct, then it will also explain why launching the Python interpreter directly, without going through the venvlauncher, works (run_launcher_tests.py -s "" -w). In that case the WaitForInputIdle returns when the Python test script shows the message box (the message box is probably enough to trigger the "idle" state).

I think the immediate fix is to move the call to AssignProcessToJobObject immediately after the process creation. Unfortunately I can't easily test this right now, so until then it's just a theory.

The other issue, perhaps more suitable for the CPython project, is the divergence between the distlib and virtualenv launchers. They perform different tasks, but face similar problems: proxying the "idle" state, forwarding standard I/O, etc. Perhaps there should be some collaboration to avoid duplication of work (and bugs).

@vsajip
Copy link
Collaborator

vsajip commented Aug 3, 2022

In that case the WaitForInputIdle returns when the Python test script shows the message box (the message box is probably enough to trigger the "idle" state).

That suggests that expected behaviour (with respect to WaitForInputIdle) depends on the specifics of the script being run. Arguably, pythonw.exe should clear the idle state reliably, independently of what script it's running. Would you agree?

The other issue, perhaps more suitable for the CPython project, is the divergence between the distlib and virtualenv launchers.

Sure, the simple_launcher was adapted from my work on the Python launcher (py.exe / pyw.exe), and this was well before venvlauncher was introduced. There has been some divergence, over the years, but normally (as you might expect for a volunteer project) some reported issue is the trigger for looking at divergences and deciding what to do about them.

@mbikovitsky
Copy link

Arguably, pythonw.exe should clear the idle state reliably, independently of what script it's running. Would you agree?

No, I think the current behaviour of the interpreter is correct. If the interpreter were to always clear the "idle state" on startup, it might introduce subtle bugs in existing code. For instance, consider the following setup:

  1. Script a.py creates a message box on startup and waits for the user to click "OK".
  2. Script b.py launches script a.py, calls WaitForInputIdle to know when it's ready to accept input, searches for the message box it created, and programmatically clicks the "OK" button.

This works right now because WaitForInputIdle returns when a.py has created the message box and is ready to accept input. If the Python interpreter were to clear the idle state itself, we would have a race condition: b.py would start examining the child process while it's still initializing.

In fact, this is the exact use case that the documentation for WaitForInputIdle describes.

That suggests that expected behaviour (with respect to WaitForInputIdle) depends on the specifics of the script being run.

Yep.

This means that both the distlib and venv launchers should proxy this state. I think the current solution in clear_app_starting_state is fine, except that maybe a timeout should be added to the wait. If WaitForInputIdle times out, don't set the idle state. And the same thing should be done in the venv launcher as well.

@vsajip
Copy link
Collaborator

vsajip commented Aug 3, 2022

I think the immediate fix is to move the call to AssignProcessToJobObject immediately after the process creation.

This worked in an initial test on my system. Good call 👏

@vsajip
Copy link
Collaborator

vsajip commented Aug 3, 2022

it might introduce subtle bugs in existing code. For instance, consider the following setup

Good point.

@vsajip vsajip closed this as completed in 52de225 Aug 3, 2022
@vsajip
Copy link
Collaborator

vsajip commented Aug 3, 2022

I've rebuilt all the launchers with the change that @mbikovitsky suggested, and the above commit adds them to distlib. Local tests seem to work. Please try with the updated launchers and once I receive confirmation that the change fixes the problem, I'll see about cutting a release. Reopening for now, will close when the release is done.

@vsajip vsajip reopened this Aug 3, 2022
@bastimeyer
Copy link
Author

Thanks, looks like it's working fine now. 👍

I've added 52de225 to my reproduction repo:

I've also done a production-build test and built my python application using that distlib commit and killing the process of the "...w.exe" executable now works.

@eryksun
Copy link

eryksun commented Aug 4, 2022

Sure, the simple_launcher was adapted from my work on the Python launcher (py.exe / pyw.exe), and this was well before venvlauncher was introduced. There has been some divergence, over the years, but normally (as you might expect for a volunteer project) some reported issue is the trigger for looking at divergences and deciding what to do about them.

Python 3.11 uses a new implementation of the py launcher that was developed by Steve Dower. It's implemented in "PC/launcher2.c". It's mostly complete, but still ironing out corner cases and niche features (e.g. the "/usr/bin/env" search is just getting implemented).

The venv launchers are still implemented in "PC/launcher.c". It would be nice if the script launcher could bypass the environment launcher. That requires coordinated development, or maybe implementing a common launcher for scripts and virtual environments.

@vsajip
Copy link
Collaborator

vsajip commented Aug 4, 2022

Python 3.11 uses a new implementation of the py launcher that was developed by Steve Dower. It's implemented in "PC/launcher2.c"

Oh, right - I had no idea. Do you have any information on the reason for the re-implementation? I've seen no discussion on discuss.python.org ...

@vsajip
Copy link
Collaborator

vsajip commented Aug 4, 2022

I just took a quick look at PC/launcher2.c - it doesn't seem to implement any of the functionality for proxying idle state, etc. I'll try to start a discussion on discuss.python.org.

@vsajip
Copy link
Collaborator

vsajip commented Aug 4, 2022

Thanks, looks like it's working fine now.

Thanks for the quick feedback. I think I will implement one more pending change - that timeout in the WaitForInputIdle that @mbikovitsky mentioned. When that's ready, I'll ask your indulgence regarding re-testing the updated launchers.

@vsajip
Copy link
Collaborator

vsajip commented Aug 4, 2022

I propose to change clear_app_starting_state() to the following:

#define INPUT_IDLE_TIMEOUT 1000

static void
clear_app_starting_state() {
    MSG msg;
    HWND hwnd;
    DWORD rc;

    PostMessageW(0, 0, 0, 0);
    GetMessageW(&msg, 0, 0, 0);
    /*
     * Proxy the child's input idle event.
     * See https://github.com/pypa/distlib/issues/175#issuecomment-1204452974
     */
    rc = WaitForInputIdle(child_process_info.hProcess, INPUT_IDLE_TIMEOUT);
    if (rc == 0) {
        /*
         * Signal the process input idle event by creating a window and pumping
         * sent messages. The window class isn't important, so just use the
         * system "STATIC" class.
         */
        hwnd = CreateWindowExW(0, L"STATIC", L"PyLauncher", 0, 0, 0, 0, 0,
                               HWND_MESSAGE, NULL, NULL, NULL);
        /* Process all sent messages and signal input idle. */
        PeekMessageW(&msg, hwnd, 0, 0, 0);
        DestroyWindow(hwnd);
    }
}

Does this seem reasonable? I tried it like this and local tests seem to work. I'm also assuming that one second is reasonable as a timeout.

@eryksun
Copy link

eryksun commented Aug 4, 2022

There's no need to pass a finite timeout to WaitForInputIdle(). The child could take an arbitrary amount of time before it creates a window and starts processing window messages. If the child never goes idle, then the wait will return WAIT_FAILED, and the launcher will move on to proxy the exit status. It just has to ensure that the child is assigned to the job object beforehand.

@vsajip
Copy link
Collaborator

vsajip commented Aug 4, 2022

If the child never goes idle, then the wait will return WAIT_FAILED

If the timeout is INFINITE, when does the wait return? Only when the child exits?

@eryksun
Copy link

eryksun commented Aug 4, 2022

If the timeout is INFINITE, when does the wait return? Only when the child exits?

If the child exits, that's the "child never goes idle" case, which returns WAIT_FAILED. Otherwise, WaitForInputIdle(hProcess, INFINITE) returns when the input idle event of the child process is signaled. In that case (i.e. rc == 0), the script launcher signals its own input idle event by creating a hidden window and peeking messages. That way anything that calls WaitForInputIdle() on the launcher is indirectly checking the input idle state of the child process.

@vsajip
Copy link
Collaborator

vsajip commented Aug 4, 2022

If no finite timeout needs to be passed to WaitForInputIdle(), then I guess the current launchers in the distlib repo are good to go, as long as there are no other adverse reports.

@vsajip
Copy link
Collaborator

vsajip commented Aug 6, 2022

I've made an additional change to the launchers, implementing the improved control key/message handling as suggested by @eryksun. Please check and confirm that the updated launchers in this commit still work in your use cases.

@bastimeyer
Copy link
Author

Please check and confirm that the updated launchers in this commit still work in your use cases.

Built my python app with distlib @ 37df85a (current master), and everything's working as expected.

@vsajip
Copy link
Collaborator

vsajip commented Aug 6, 2022

Thanks for the feedback. That's good to know - I'll aim to cut a release before too long.

@eryksun
Copy link

eryksun commented Aug 6, 2022

I've made an additional change to the launchers, implementing the improved control key/message handling as suggested by @eryksun. Please check and confirm that the updated launchers in this commit still work in your use cases.

Note that this change only affects console scripts. The console script launcher doesn't load "user32.dll" (e.g. it uses no shell32 functions such as SHGetFolderPathW), so when the session is ending for logoff or shutdown, the system sends it CTRL_LOGOFF_EVENT or CTRL_SHUTDOWN_EVENT. Also, when the attached console (classic or a tab in Terminal) is closing, the system sends CTRL_CLOSE_EVENT.

The GUI script launcher loads "user32.dll" (for WaitForInputIdle, CreateWindowExW, DestroyWindow, PeekMessageW, GetMessageW, and PostMessageW), so the system doesn't send it CTRL_LOGOFF_EVENT or CTRL_SHUTDOWN_EVENT. To support gracefully exiting of Python scripts at logoff or shutdown, the GUI launcher would have to create a hidden window, pump messages, and handle WM_QUERYENDSESSION and WM_ENDSESSION. When the session is ending, remove Python from the job object and exit. The need to create a hidden window means the WaitForInputIdle() call would need a reasonable timeout such as 10 seconds. After WaitForInputIdle() returns due to success or timeout, create the hidden window to watch for the session ending. This has the secondary effect of signalling the launcher's input idle event.

@vsajip
Copy link
Collaborator

vsajip commented Aug 8, 2022

I'm a little short of time at the moment. Let's see if not doing the above leads to any reported problems. If anyone wants to work up a PR for the above comment, please feel free.

@pradyunsg
Copy link
Member

@vsajip Wanna close this out, now that 0.3.6 has been released?

@vsajip
Copy link
Collaborator

vsajip commented Aug 26, 2022

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants