Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ee_alloc_context #104849

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add ee_alloc_context #104849

wants to merge 7 commits into from

Conversation

noahfalk
Copy link
Member

@noahfalk noahfalk commented Jul 13, 2024

This change is some preparatory refactoring for the randomized allocation sampling feature. We need to add more state onto allocation context but we don't want to do a breaking change of the GC interface. The new state only needs to be visible to the EE but we want it physically near the existing alloc context state for good cache locality. To accomplish this we created a new ee_alloc_context struct which contains an instance of gc_alloc_context within it.

A new field called ee_alloc_context::combined_limit should be used by fast allocation helpers to determine when to go down the slow path. Most of the time combined_limit has the same value as alloc_limit, but periodically we need to emit an allocation sampling event on an object that is somewhere in the middle of an AC. Using combined_limit rather than alloc_limit as the slow path trigger allows us to keep all the sampling event logic in the slow path.

There is another PR for NativeAOT making the same change: #104851

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@noahfalk
Copy link
Member Author

This is ready for review now.
All failures are known pre-existing failures according to build analysis. I believe I've applied your feedback as well @jkotas

@noahfalk noahfalk marked this pull request as draft July 15, 2024 05:51
@noahfalk
Copy link
Member Author

I converted back to draft because testing on NativeAOT revealed a race condition that I believe effects the CoreCLR work as well. I am investigating.

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
@noahfalk
Copy link
Member Author

All issues I know of are resolved and build analysis is green, so this is ready for review once more. Thanks!

src/coreclr/debug/daccess/dacdbiimpl.cpp Outdated Show resolved Hide resolved
src/coreclr/debug/daccess/request.cpp Outdated Show resolved Hide resolved
src/native/managed/cdacreader/src/DataType.cs Show resolved Hide resolved
src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
@jkotas
Copy link
Member

jkotas commented Jul 16, 2024

Is it that hard to implement it in a way that it does not introduce cache contention from thread enumerations running in parallel? I think it is preferable to go with designs that do not have performance issues by construction.

@noahfalk
Copy link
Member Author

Is it that hard to implement it in a way that it does not introduce cache contention from thread enumerations running in parallel? I think it is preferable to go with designs that do not have performance issues by construction.

I can implement it using the RestartEE approach I had earlier if you prefer that? @Maoni0 did not believe it was worth prioritizing time making a change how the GC currently does that enumeration.

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved
@noahfalk
Copy link
Member Author

@jkotas Following up on perf, I attempted to stress the threading by running gcperfsim with 400 threads and gc server config, allocating 500GB of objects. I used two different configs, one with a 50MB heap size and small objects and the other with a 2gb heap size allowing for a 20% LOH allocation mix to introduce more BGCs. My machine has 20 hardware threads. Median execution time in scenario 1 was 34.5 sec for PR and baseline. Scenario 2 was 38.4 sec for baseline and 38.5 sec for the PR. Noise on machine causes runs to vary by up to 1 second so the 0.1 diff isn't statistically significant. It's possible with a large sample size and a low noise environment we could discern some cost from the noise, but it certainly doesnt jump out. I also took CPU sampling traces which showed GCEnumAllocContexts consistently consumed 0.3% inclusive time before and after the change.

Also fyi I lost internet service at my house. I'm trying to get it restored but for now I only have GH access from my cell phone.

@noahfalk
Copy link
Member Author

@jkotas - anything remaining before you can sign off? I believe everything has been resolved aside from still waiting to hear from @elinor-fung or @lambdageek.

@noahfalk
Copy link
Member Author

I did some manual testing of the debugging scenario. In the normal DAC scenario I found and fixed one bug in the new commit. CDAC appeared to be working correctly. My test was running windbg + SOS with !threads and !DumpHeap to exercise reading gc_alloc_context. I confirmed using a debugger on the debugger that the CDAC case was executing down the CDAC branch of the DAC code as intended.

@elinor-fung
Copy link
Member

cDAC changes look good to me.

@jkotas
Copy link
Member

jkotas commented Jul 29, 2024

perf

To assess perf impact for changes like this, we typically try to craft a microbenchmark that tries to show worst regression or best improvement (one example: #96150 (comment)). Unless you introduce very bad bug, the impact will be lost in the noise from all other things going on in a general-purpose allocation throughput microbenchmark.

I am not worried about the perf impact too much with the latest version of the change, so I do not think it is necessary to do that.

@jkotas
Copy link
Member

jkotas commented Jul 29, 2024

/azp run runtime-coreclr gcstress-extra

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jkotas
Copy link
Member

jkotas commented Jul 29, 2024

GC stress crash at:

      Assert failure(PID 2208 [0x000008a0], Thread: 9372 [0x249c]): Consistency check failed: AV in clr at this callstack:
      ------
      CORECLR! Thread::CooperativeCleanup + 0xFF (0x6fff547e)
      CORECLR! Thread::DetachThread + 0x124 (0x6fff69ec)
      CORECLR! TlsDestructionMonitor::~TlsDestructionMonitor + 0x8C (0x7043d0d8)
      CORECLR! _dyn_tls_dtor + 0x8A (0x7045116a)

It looks like a regression introduced by this change (Thread::CooperativeCleanup is modified by this change).

noahfalk and others added 7 commits July 30, 2024 01:08
This change is some preparatory refactoring for the randomized allocation sampling feature. We need to add more state onto allocation context but we don't want to do a breaking change of the GC interface. The new state only needs to be visible to the EE but we want it physically near the existing alloc context state for good cache locality. To accomplish this we created a new ee_alloc_context struct which contains an instance of gc_alloc_context within it.

The new ee_alloc_context.combined_limit field should be used by fast allocation helpers to determine when to go down the slow path. Most of the time combined_limit has the same value as alloc_limit, but periodically we need to emit an allocation sampling event on an object that is somewhere in the middle of an AC. Using combined_limit rather than alloc_limit as the slow path trigger allows us to keep all the sampling event logic in the slow path.
combined_limit is now synchronized in GcEnumAllocContexts instead of RestartEE.
This requires the GC being constrained in how it updates the alloc_ptr and alloc_limit. No GC behavior changed,
in practice, but the constraints are now part of the EE<->GC contract so that we can rely on them in the EE code.
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
The code of GetAllocContext() was constructing a PTR_gc_alloc_context which does a host->target pointer conversion. Those conversions work by doing a lookup in a dictionary of blocks of memory that we have previously marshalled and the pointer being converted is expected to be the start of the memory block. In this case we had never previously marshalled the gc_allocation_context on its own. We had only marshalled the m_pRuntimeThreadLocals block which includes the gc_allocation_context inside of it at a non-zero offset. This caused the host->target pointer conversion to fail which in turn meant commands like !threads in SOS would fail.

The fix is pretty trivial. We don't need to do a host->target conversion here at all because the calling code in the DAC is going to immediately convert right back to a host pointer. We can avoid the conversion in both directions by eliminating the cast and returning the host pointer directly.
Somehow a test reached Thread::CooperativeCleanup() with m_pRuntimeThreadLocals==NULL. Looking at the code I expected that would mean GetThread()==NULL or ThreadState contains TS_Dead, but neither of those conditions were true so it is unclear how it executed to that state. The callstack was:
>	coreclr.dll!Thread::CooperativeCleanup() Line 2762	C++
 	coreclr.dll!Thread::DetachThread(int fDLLThreadDetach) Line 936	C++
 	coreclr.dll!TlsDestructionMonitor::~TlsDestructionMonitor() Line 1745	C++
 	coreclr.dll!__dyn_tls_dtor(void * __formal, const unsigned long dwReason, void * __formal) Line 122	C++
 	ntdll.dll!_LdrxCallInitRoutine@16()	Unknown
 	ntdll.dll!LdrpCallInitRoutine()	Unknown
 	ntdll.dll!LdrpCallTlsInitializers()	Unknown
 	ntdll.dll!LdrShutdownThread()	Unknown
 	ntdll.dll!RtlExitUserThread()	Unknown

Regardless, the previous code wouldn't have hit that issue because it obtained the pointer through t_runtime_thread_locals rather than m_pRuntimeThreadLocals. I restored using t_runtime_thread_locals in the Cleanup routine. Out of caution I also searched for any other places that were previously accessing the alloc_context through the thread local address and ensured they don't switch to use m_pRuntimeThreadLocals either.
@noahfalk
Copy link
Member Author

I've updated for the feedback so far + the test failure that showed up in the GC stress run. Looking at the time I think I may need to call it quits on this for .NET 9 and instead treat it as .NET 10 on a slower cadence. At this point I don't think I have more time available right now to do a staged checkin, revalidate testing after the ongoing changes, and continue to have time in reserve to respond to potential post-checkin issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants