Fix debugger funceval deadlock #72179

noahfalk · 2022-07-14T12:48:10Z

Visual Studio devs reported that debugger funcevals were deadlocking because the
PinnedHeapHandleTableCrst was held while threads were suspended. This change
refactors that code path so that the AllocateHandles() operation where the lock is held
gets split into potentially multiple iterations and the GC allocation where the debugger
could suspend is extracted outside the region where the critical section is held. There is
another path into AllocateHandles that takes a different lock and theoretically it could
generate a similar deadlock, but I didn't do all the refactoring necessary to prevent
that path.

I'm still testing to confirm this works as intended, but please take a look
@davidwr @hoyosjs @mikem8361 @cshung

Fixes dotnet#60565 Visual Studio devs reported that debugger funcevals were deadlocking because the PinnedHeapHandleTableCrst was held while threads were suspended. This change refactors that code path so that the AllocateHandles() operation where the lock is held gets split into potentially multiple iterations and the GC allocation where the debugger could suspend is extracted outside the region where the critical section is held. There is another path into AllocateHandles that takes a different lock and theoretically it could generate a similar deadlock, but I didn't do all the refactoring necessary to prevent that path.

ghost · 2022-07-14T12:48:24Z

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #60565

Visual Studio devs reported that debugger funcevals were deadlocking because the
PinnedHeapHandleTableCrst was held while threads were suspended. This change
refactors that code path so that the AllocateHandles() operation where the lock is held
gets split into potentially multiple iterations and the GC allocation where the debugger
could suspend is extracted outside the region where the critical section is held. There is
another path into AllocateHandles that takes a different lock and theoretically it could
generate a similar deadlock, but I didn't do all the refactoring necessary to prevent
that path.

I'm still testing to confirm this works as intended, but please take a look
@davidwr @hoyosjs @mikem8361 @cshung

Author:	noahfalk
Assignees:	-
Labels:	`area-Diagnostics-coreclr`
Milestone:	-

davidwr · 2022-07-14T14:29:52Z

Sorry, I think you miss typed my user XD

Maybe @davidwrighton

jkotas · 2022-07-14T15:53:39Z

The tests are hitting intermittent crashes and asserts, e.g.:

Assert failure(PID 32290 [0x00007e22], Thread: 32296 [0x7e28]): result == NULL
    File: /__w/1/s/src/coreclr/vm/appdomain.cpp Line: 972

I think that the logic in this PR is close to impossible to follow. Can this be refactored so that it is easier to follow? For example, can we have a method that looks like this?

{
    take the lock
    use preallocated handles if we have enough
    release the lock
}

allocate enough handles

{
     take the lock
     add the new handles for bookkeeping
     release the lock
}

and still resolve the handle from the cache on the 2nd attempt. Refactored using a NewHolder and initializing the outsideLockWork at the top of the function reduce the logic complexity and make it more robust to overlooked return points.

noahfalk · 2022-07-14T23:21:18Z

@jkotas - yeah I saw the test failure log earlier and recognized the case I missed after seeing my assert failure. I also refactored the logic to hopefully make it simpler to the reader. Let me know if you if you think I was successful or if you have other suggestions.

For example, can we have a method that looks like this? ...

Unfortunately two passes isn't guaranteed to be enough, we need a retry loop. It is possible that thread A does the first step, then thread B does both steps, then when thread A returns attempting to complete step 2 it discovers the internal state has changed and the preallocated bucket is no longer the correct size.

Another alternative if folks prefer is that I can probably write a lock-free version. The downside is that it adds more complexity in terms of potential thread interleavings and memory ordering issues.

jkotas · 2022-07-14T23:54:01Z

the preallocated bucket is no longer the correct size.

The array that was allocated outside the lock should be still sufficient to full-fill the request on the current thread. I do not think we need any retry loops.

noahfalk · 2022-07-15T05:45:23Z

The array that was allocated outside the lock should be still sufficient to full-fill the request on the current thread. I do not think we need any retry loops.

I can certainly thread them on to the linked list, but what are you thinking should happen afterwards? As a concrete example two threads could concurrently each ask for 1 handle at a point where the NextBucketSize is 4000 and we'd thread both of those large arrays into the list. Currently there is a special case for allocations of size 1 where it searches beyond the head node, but for any other request size 3999 of the newly added array slots would be unusable. Were you thinking that we'd couple this with changes to searching algorithm that try harder to reuse previously allocated nodes or we'd just count on re-entrant operations to be rare and accept the loss of efficiency if it happened?

noahfalk · 2022-07-15T05:56:30Z

Is the core of your concern the performance/correctness/efficiency of the algorithm with the loop, or just that it is hard as a reader to understand the logic?

jkotas · 2022-07-15T18:59:03Z

Yes, this change has a risk of pathological performance issues. This risk is both designs (abandon the redundant allocation, add the redundant allocation to the list). The problem is that these allocations go to pinned heap. The pinned heap is used rarely, and it cannot be compacted. It means that once you create a fragment on this heap, there is a good chance that it won't be re-used for anything. We run into similar problems when we explored use of the pinned heap for string literals.

The core of my concern is that this is turning something simple (allocate a piece of managed memory under a lock) into something pretty complex, hard to reason about and potential source of pathological performance issues. Allocating managed memory under a lock is done all the time in C# and I am sure there are number of places where it is done in VM. The solution implemented here won't scale as a fix.

noahfalk · 2022-07-19T08:43:36Z

@jkotas - I haven't tested this updated version beyond that it builds so far, but feel free to take a look to see if it aligns with what we discussed.

I'll ping again and convert back to a real PR once I've had a chance to better validate that it works as intended.

jkotas · 2022-07-19T18:08:25Z

Yes, this looks good. Thanks!

noahfalk · 2022-07-21T23:15:39Z

Failure is known issue: #70969

noahfalk · 2022-07-21T23:36:08Z

OK this is looking good again. In the old code the PinnedHeapHandleTable was synchronized by one of two different locks, either the AppDomainHandleTable lock or the GlobalStringLiteralMap. In the new code AppDomainHandleTable lock is renamed to PinnedHeapHandleTable lock and this lock is always what synchronizes the PinnedHeapHandleTable code. In the string literal code path the GlobalStringLiteralMap is taken first and the PinnedHeapHandleTable lock is taken second, but the PinnedHeapHandleTable is no longer reliant on that outer GlobalStringLiteralMap lock for correctness.

In terms of testing I can verify under a debugger that I can suspend in the GC allocation point and the PinnedHeapHandleTable lock isn't held. This doesn't of course prevent other locks from being held so at best it is a partial fix for the issue. Nobody had a known repro so I wasn't able to verify anything more specifically. I also confirmed the race cases on the InterlockedExchange paths worked how they were intended by forcing them with a native debugger.

jkotas · 2022-07-21T23:42:34Z

In the string literal code path the GlobalStringLiteralMap is taken first

I suspect that this locks is also exposed to the debugger deadlock.

noahfalk · 2022-07-22T02:35:56Z

I suspect that this locks is also exposed to the debugger deadlock.

Yeah I expect so as well.

janvorli

LGTM, thank you!

noahfalk · 2022-07-22T23:50:30Z

Merging, failure is known issue: #70969

dotnet-issue-labeler bot added the area-Diagnostics-coreclr label Jul 14, 2022

ghost assigned noahfalk Jul 14, 2022

Remove retries

fd898f2

noahfalk marked this pull request as draft July 19, 2022 08:40

This was referenced Jul 19, 2022

system.net.quic.functional.tests failing with stack buffer overflow #72429

Closed

System.Net.Http.Functional.Tests.SyncHttpHandler_DiagnosticsTest.SendAsync_SuppressedGlobalStaticPropagationEnvVar failing in CI #72428

Closed

noahfalk added 2 commits July 20, 2022 03:04

Fix contracts and an incorrect InterlockedCompareExchange

15a449b

Fix contract

400eedf

noahfalk marked this pull request as ready for review July 21, 2022 23:21

jkotas approved these changes Jul 21, 2022

View reviewed changes

noahfalk requested a review from janvorli July 22, 2022 02:32

janvorli approved these changes Jul 22, 2022

View reviewed changes

noahfalk merged commit 463efbb into dotnet:main Jul 22, 2022

karelz mentioned this pull request Aug 5, 2022

system.collections.concurrent.tests failed in CI #73038

Closed

ghost locked as resolved and limited conversation to collaborators Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix debugger funceval deadlock #72179

Fix debugger funceval deadlock #72179

noahfalk commented Jul 14, 2022

ghost commented Jul 14, 2022

davidwr commented Jul 14, 2022 •

edited

Loading

jkotas commented Jul 14, 2022 •

edited

Loading

noahfalk commented Jul 14, 2022

jkotas commented Jul 14, 2022

noahfalk commented Jul 15, 2022

noahfalk commented Jul 15, 2022

jkotas commented Jul 15, 2022 •

edited

Loading

noahfalk commented Jul 19, 2022

jkotas commented Jul 19, 2022

noahfalk commented Jul 21, 2022

noahfalk commented Jul 21, 2022

jkotas commented Jul 21, 2022

noahfalk commented Jul 22, 2022

janvorli left a comment

noahfalk commented Jul 22, 2022

Fix debugger funceval deadlock #72179

Fix debugger funceval deadlock #72179

Conversation

noahfalk commented Jul 14, 2022

ghost commented Jul 14, 2022

davidwr commented Jul 14, 2022 • edited Loading

jkotas commented Jul 14, 2022 • edited Loading

noahfalk commented Jul 14, 2022

jkotas commented Jul 14, 2022

noahfalk commented Jul 15, 2022

noahfalk commented Jul 15, 2022

jkotas commented Jul 15, 2022 • edited Loading

noahfalk commented Jul 19, 2022

jkotas commented Jul 19, 2022

noahfalk commented Jul 21, 2022

noahfalk commented Jul 21, 2022

jkotas commented Jul 21, 2022

noahfalk commented Jul 22, 2022

janvorli left a comment

Choose a reason for hiding this comment

noahfalk commented Jul 22, 2022

davidwr commented Jul 14, 2022 •

edited

Loading

jkotas commented Jul 14, 2022 •

edited

Loading

jkotas commented Jul 15, 2022 •

edited

Loading