Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomized allocation sampling #104955

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

noahfalk
Copy link
Member

This PR is intended to supercede #100356. Currently it includes 5 commits, the first 4 of which are the same changes found in #104849 and #104851. Once those PRs are merged this PR will be rebased to remove that portion of the changes. The interesting commit is last one which adds the final code necessary to enable the feature on both CoreCLR and NativeAOT. The PR is currently in draft mode because I have yet to bring over the functional tests and validate functionality and perf are operating as expected.

This feature allows profilers to do allocation profiling based off randomized samples. It has better theoretical and empirically observed accuracy than our current allocation profiling approaches while also maintaining low performance overhead. It is designed for use in production profiling scenarios. For more information about usage and implementation, see the included doc docs/design/features/RandomizedAllocationSampling.md

noahfalk and others added 3 commits July 14, 2024 23:17
This change is some preparatory refactoring for the randomized allocation sampling feature. We need to add more state onto allocation context but we don't want to do a breaking change of the GC interface. The new state only needs to be visible to the EE but we want it physically near the existing alloc context state for good cache locality. To accomplish this we created a new ee_alloc_context struct which contains an instance of gc_alloc_context within it.

The new ee_alloc_context.combined_limit field should be used by fast allocation helpers to determine when to go down the slow path. Most of the time combined_limit has the same value as alloc_limit, but periodically we need to emit an allocation sampling event on an object that is somewhere in the middle of an AC. Using combined_limit rather than alloc_limit as the slow path trigger allows us to keep all the sampling event logic in the slow path.
combined_limit is now synchronized in GcEnumAllocContexts instead of RestartEE.
This requires the GC being constrained in how it updates the alloc_ptr and alloc_limit. No GC behavior changed,
in practice, but the constraints are now part of the EE<->GC contract so that we can rely on them in the EE code.
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 16, 2024
@noahfalk noahfalk self-assigned this Jul 16, 2024
@noahfalk noahfalk added area-VM-coreclr community-contribution Indicates that the PR has been added by a community member area-NativeAOT-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 16, 2024
noahfalk and others added 6 commits July 18, 2024 01:01
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
The code of GetAllocContext() was constructing a PTR_gc_alloc_context which does a host->target pointer conversion. Those conversions work by doing a lookup in a dictionary of blocks of memory that we have previously marshalled and the pointer being converted is expected to be the start of the memory block. In this case we had never previously marshalled the gc_allocation_context on its own. We had only marshalled the m_pRuntimeThreadLocals block which includes the gc_allocation_context inside of it at a non-zero offset. This caused the host->target pointer conversion to fail which in turn meant commands like !threads in SOS would fail.

The fix is pretty trivial. We don't need to do a host->target conversion here at all because the calling code in the DAC is going to immediately convert right back to a host pointer. We can avoid the conversion in both directions by eliminating the cast and returning the host pointer directly.
This change is some preparatory refactoring for the randomized allocation sampling feature. We need to add more state onto allocation context but we don't want to do a breaking change of the GC interface. The new state only needs to be visible to the EE but we want it physically near the existing alloc context state for good cache locality. To accomplish this we created a new ee_alloc_context struct which contains an instance of gc_alloc_context within it.

The new ee_alloc_context::combined_limit should be used by fast allocation helpers to determine when to go down the slow path. Most of the time combined_limit has the same value as alloc_limit, but periodically we need to emit an allocation sampling event on an object that is somewhere in the middle of an AC. Using combined_limit rather than alloc_limit as the slow path trigger allows us to keep all the sampling event logic in the slow path.
- removed unnecessary UpdateCombinedLimit() in thread detach
- updated comment for workaround on 96081
- swapped to updating combined_limit inside GcEnumAllocContexts() instead of in RestartEE()
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
This feature allows profilers to do allocation profiling based off randomized samples. It has better theoretical and empirically observed accuracy than our current allocation profiling approaches while also maintaining low performance overhead. It is designed for use in production profiling scenarios. For more information about usage and implementation, see the included doc docs/design/features/RandomizedAllocationSampling.md
@noahfalk noahfalk marked this pull request as ready for review July 20, 2024 00:20
@noahfalk
Copy link
Member Author

noahfalk commented Jul 20, 2024

Functional testing found and fixed an off-by-one error in the RNG code but otherwise things looked fine. I also resynced this PR on top of the latest changes in #104849 and #104851. The last commit, now number 10, remains the interesting one.

I also did some performance testing using GCPerfSim as an allocation benchmark. My default configuration was 4 threads, workstation mode, 500GB of allocations entirely with small objects and no survival. It is intended to put maximum stress on the allocation code paths. GCPerfSim command line: -tc 4 -tagb 500 -tlgb 0.05 -lohar 0 -lohsr 100000-2000000 -sohsi 0 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time. I also ran a few variations that added modest amounts of survival and LOH allocations that are a bit more realistic, though still extremely allocation heavy.

EDIT: Don't rely on these numbers, they are misleading. See #104955 (comment)

Benchmarks - No Tracing enabled

Scenario Baseline time PR time
Default 20.8 21.4
Default + sohsi 100 40.9 41.8
Default + sohsi 100 -lohar 10 42.8 43.9
Default + sohsi 100 -lohar 10 -lohsi 200 43.6 44.6

Benchmarks - Tracing AllocSampling+GC keywords, verbose level

Scenario Baseline time PR time
Default 22 23.1
Default + sohsi 100 41.6 43.0
Default + sohsi 100 -lohar 10 44.7 44.5
Default + sohsi 100 -lohar 10 -lohsi 200 44.2 45.0

Overall it looks like around ~0.9 additional seconds for the PR to do 500GB of allocations. On a tight microbenchmark its noticeable and then as other GC or non-allocation costs increase it becomes relatively less noticeable. I'm investigating to see if it can be improved at all.

@noahfalk
Copy link
Member Author

Continued perf investigation+testing has cast my previous results into doubt. After lots of searching for what could have caused the regression, my best explanation is that it actually had nothing to do with the source changes in this PR and instead it is either non-determinism in the build process or some user error on my part. I reached that conclusion by doing the following:

I've had a folder on my machine C:\git\runtime3 that throughout the entire process has been synced here:

commit 42b2b19e883f06af5771b5d85b26af263c62e781 (HEAD)
Author: Matous Kozak <55735845+matouskozak@users.noreply.github.com>
Date:   Fri Jul 12 09:42:55 2024 +0200

This folder has no changes from any of my PRs in it and I've been using the build here for all the baseline measurements. Then I executed the following changes:

move artifacts -> artifacts_backup
build.cmd clr+libs -c release
src\tests\build.cmd generatelayoutonly Release
copy artifacts_backup -> artifacts_backup_2

I can consistently reproduce the same magnitude perf regression using the coreclr built in the artifacts directory, but the regression doesn't appear using the build in the backup or backup_2 directory. I've done many runs on each binary switching between them in a semi-randomized ordering trying to ensure that the results for each binary are repeatable and robust relative to background noise on the machine.

Beyond that I've also got many other builds that include different subsets of the change but there is no clear relationship between the source and the perf results. During one period I progressively added functionality starting from the baseline without triggering the regression to occur, then during another period I was progressively removing functionality from the final PR and the regression would always occur. Even deleting the entirety of the source changes in that folder and syncing it back to the baseline didn't eliminate the perf overhead. Every build was done in a new folder starting without an artifacts folder to remove opportunity for incremental build problems to play a role.

The only explanations I have that make sense to me are either: (a) non-deterministic builds are giving bi-modal perf results for the same input source code or (b) I am making some other error in my testing methodology repeatedly

I'm going to see if I can get another machine to repeat some of the original experiments but at the moment I no longer have any evidence the PR is causing a regression.

These tests were never intended to built or run automatically but recursive globbing patterns are causing them to get included. I considered locating each such globbing pattern and making an exclusion or changing the tests so that they would build successfully in the automated build, but those options seemed like more work now and potentially more work in the future to maintain it. Given these manual tests will probably have very little ongoing usage I went with the cheap and simple option of adding an underscore to the csproj files.
@noahfalk
Copy link
Member Author

@jkotas @MichalStrehovsky - Functional and perf testing both looked good now, all outstanding comments on the PRs have been addressed, and CI is green. From my perspective this is ready to be merged unless any further review is planned?

I could check in #104849, #104851, then this PR in sequence but I'm not sure that gives any advantage over just checking in this PR alone and closing 104849 and 104851 as no longer needed.

@jkotas
Copy link
Member

jkotas commented Jul 29, 2024

@@ -0,0 +1,112 @@
# Manual Testing for Randomized Allocation Sampling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents these tests from being automated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test as currently designed does large numbers of allocations with different types of objects and then prints statistical distributions to the console. Nothing fundamental prevents it from being converted to an automated test but the main factors that discourage it are:

  • its a slow test even on a fast machine - it does large amounts of allocations to gather many statistical distributions
  • it would require effort to convert it and ensure we've reasonably calibrated the sensitivity for the randomized results the test generates.

Historically our automated testing on tracing events validate that the events are generated and check some fields for reasonable data, but don't go into great detail. That level of testing has had good cost-benefit tradeoff in the past so we repeated it here.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR in sequence but I'm not sure that gives any advantage over just checking in this PR alone

I think it still makes sense. I am not familiar enough with tracing to tell whether it is hooked up correctly. It would be best to for somebody familar with the tracing to review that part. It will be much easier if the delta does not show the other changes.

Also, this is non-trivial feature (a few thousand lines), with risk of introducing regressions (as demonstrated by the GC stress crash). Given that we are feature complete for .NET 9, should this get an approval from Jeff or tactics before merging for .NET 9?

(flags & GC_ALLOC_LARGE_OBJECT_HEAP) ? 1 :
0; // SOH
unsigned int heapIndex = 0;
#ifdef BACKGROUND_GC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKGROUND_GC is GC-specific ifdef. It has no meaning in VM

(flags & GC_ALLOC_LARGE_OBJECT_HEAP) ? 1 :
0; // SOH
unsigned int heapIndex = 0;
#ifdef BACKGROUND_GC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@noahfalk
Copy link
Member Author

Given that we are feature complete for .NET 9, should this get an approval from Jeff or tactics before merging for .NET 9?

Yes, I was assuming that would be part of the process.

Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants