-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% CPU load after cancel #695
Comments
Infinite call recursion in anyio/src/anyio/_backends/_asyncio.py Line 469 in e0529a3
|
Without SIGINT: #!/usr/bin/env python3
from anyio import CancelScope, create_task_group, run, sleep
async def shield_task() -> None:
with CancelScope(shield=True):
await sleep(60)
async def task() -> None:
async with create_task_group() as tg:
tg.start_soon(shield_task)
async def main() -> None:
async with create_task_group() as tg:
tg.start_soon(task)
tg.cancel_scope.cancel()
run(main) |
I can repro this too. But infinite call recursion? How do you figure that? |
Yeah, not infinite call recursion. The top level cancel scope continuously retries cancellation because it only sees that its immediate child task ( |
You can see the same behavior by modifying the async def main() -> None:
async with create_task_group() as tg:
tg.start_soon(task)
await wait_all_tasks_blocked()
tg.cancel_scope.cancel() |
The trick is, I suppose, how to make it figure out that it shouldn't try to cancel the middle task which is waiting on the task which is in a shielded scope. |
Has anyone found a fix for this? We ran into it and it's killing our services. Currently looking at moving to Trio to avoid it |
Can you describe your use case where it's doing this? |
It's in a fairly complex Starlette app so it's hard to point to one thing, but it seems to be happening for us with HTTPX cancelations (among potential other things) inside of other cancel scopes. We noticed services in our cluster starting to get pinned to 100% CPU usage, and investigated. After a lot of digging, we realized that even after all open requests closed, there were still tasks in the event loop that should have already been cancelled. They weren't things we expected to use much/any CPU. It was, in all the cases we reproduced, HTTP calls. However, that's the main thing that the service was doing in our reproduction so its possible other codepaths which get cancelled/timed out would have similar issues. It mainly seems to happen when the event loop is overloaded, so it's possibly some kind of race-condition with cancellation (in our specific case) that triggers us getting into this state, but the actual end state looks like its the same as this (an orphaned cancelling task that consumes all of the CPU). |
I'll try to get a fix for this into the next release, but I have to say it's a pretty tricky one to fix. |
My best attempt at fixing this involved shielding the part of |
Regrettably I wasn't able to devise a fix for this in time for the v4.4.0 release which I had to put out as it had a veritable ton of fixes that needed to get released. I believe that this needs to be fixed in tandem with #698. |
We ended up moving to Trio to avoid this, so this is no longer urgent for us. |
is there any progress on solving this problem? Is this planned for v4.5.0, or will it be delayed? |
This is the issue holding back the 4.5.0 release. I've made progress locally by wrapping the critical part of |
I've started working on a joint fix for this and #698, as they apparently cannot be solved separately without breaking existing tests. The complexity is mind boggling, however, so please be understanding as I work through this :) |
I can report some progress: all task group tests now pass on Python 3.9. 4 tests still fail on 3.8 due to lack of cancel messages, but I'm trying to work around the problem. |
I believe I've exhausted every available avenue for making this work on Python 3.8. I could work around the problem if it was just about Python code, but asyncio's C API is rigid and unyielding and resists all my attempts to make this work right. Given the impending EOL date of Python 3.8 in October, I'm going to save myself the trouble and just make Python 3.9 the minimum required version. |
Ok, I finally have a working fix. Now it's just a matter of getting all these PRs reviewed and merged. |
I decided to release v4.5.0 without this fix, as it requires dropping support for Python 3.8. I will release v4.5.1 as soon as it gets merged though. |
Things to check first
I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release
AnyIO version
4.3.0
Python version
3.9.2, 3.12.1
What happened?
After Ctrl+C the program uses 100% CPU.
Looks like the problem in (since
call_soon
works without problems):anyio/src/anyio/_backends/_asyncio.py
Line 231 in e0529a3
https://github.com/python/cpython/blob/72dbea28cd3fce6fc457aaec2107a8e453073297/Lib/asyncio/base_events.py#L871
How can we reproduce the bug?
Ctrl+C
The text was updated successfully, but these errors were encountered: