Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable FEATURE_PERFMAP on OSX, and update perfjitdump.cpp to work on OSX #99986

Merged
merged 15 commits into from
Apr 3, 2024

Conversation

vvuk
Copy link
Contributor

@vvuk vvuk commented Mar 19, 2024

The jitdump/perfmap files are a generic way of providing jit method information to profilers and other tools. On macOS, mstange/samply natively understands jitdump files, but FEATURE_PERFMAP is limited to Linux in coreclr (because linux perf is where all this originally came from). This PR just enables this feature to work on macOS as well. Note that samply needs to be built & installed from source; the cargo-installed release is old and has issues.

With this patch, running a dotnet program with DOTNET_PerfMapEnabled=1 set in the environment (or other way of setting PerfMapEnabled) allows something like samply record dotnet exec Foo.dll ... to provide mixed native/C# profile stack information on macOS, similar to what you'd get with some profilers on Windows.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 19, 2024
@vvuk
Copy link
Contributor Author

vvuk commented Mar 19, 2024

Obligatory profiler screenshot on macOS running csc (using a debug coreclr build):

image

@vvuk
Copy link
Contributor Author

vvuk commented Mar 19, 2024

@dotnet-policy-service agree company="Unity Technologies"

@brianrob
Copy link
Member

@vuk, this is certainly interesting. One thing that I'd like to understand a bit more about - you mention that the profiler knows how to process a jitdump file. Can you confirm that the profiler can properly resolve managed code symbols for ready to run code? I looked at the screenshot, but everything I can see shows that the methods are jitted.

@vvuk
Copy link
Contributor Author

vvuk commented Mar 21, 2024

Can you confirm that the profiler can properly resolve managed code symbols for ready to run code? I looked at the screenshot, but everything I can see shows that the methods are jitted.

Hmm. No, I don't expect that it would. Reading a bit about ready to run + profiling symbols info, it looks like ready to run will generate an adjacent .pdb on windows, an adjacent . r2rmap on linux, and nothing mentioned for macOS though from a quick check r2rmap is generated on macOS as well. But samply doesn't know what to do with those files.

Given that, until tiered compilation kicks in to specialize some methods (and generate a jitdump record), there will be no symbols for functions that are loaded as AOT. I don't expect this to be any different from any other generic CPU profiler on macOS (maybe a .NET-focused profiler will understand them?) -- e.g. I'd expect Instruments to behave the same, along with not having any info about JIT-compiled methods at all.

For samply in particular, it would not be difficult to add support for parsing r2rmap files. There would just need to be some mechanism to communicate the existence of that file, plus I assume a load offset. I can think of a few ways to do this, but it should probably be a separate PR (probably a small change to coreclr to flag a r2rmap file by opening it, and a separate PR to samply to handle that open and parse it).

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also edit the text for the PerfMapEnabled config switch to indicate OSX is also supported:

RETAIL_CONFIG_DWORD_INFO(EXTERNAL_PerfMapEnabled, W("PerfMapEnabled"), 0, "This flag is used on Linux to enable writing /tmp/perf-$pid.map. It is disabled by default")

Is the default path where the perfmap is generated the same?

@vvuk
Copy link
Contributor Author

vvuk commented Mar 21, 2024

I think we should also edit the text for the PerfMapEnabled config switch to indicate OSX is also supported:

Good catch, done. I looked for other obvious spots to update docs, but didn't see any; the Linux performance tracing doc is tracing-speicifc instead of profiling.

Is the default path where the perfmap is generated the same?

Yep, /tmp.

@brianrob
Copy link
Member

Can you confirm that the profiler can properly resolve managed code symbols for ready to run code? I looked at the screenshot, but everything I can see shows that the methods are jitted.

Hmm. No, I don't expect that it would. Reading a bit about ready to run + profiling symbols info, it looks like ready to run will generate an adjacent .pdb on windows, an adjacent . r2rmap on linux, and nothing mentioned for macOS though from a quick check r2rmap is generated on macOS as well. But samply doesn't know what to do with those files.

Given that, until tiered compilation kicks in to specialize some methods (and generate a jitdump record), there will be no symbols for functions that are loaded as AOT. I don't expect this to be any different from any other generic CPU profiler on macOS (maybe a .NET-focused profiler will understand them?) -- e.g. I'd expect Instruments to behave the same, along with not having any info about JIT-compiled methods at all.

For samply in particular, it would not be difficult to add support for parsing r2rmap files. There would just need to be some mechanism to communicate the existence of that file, plus I assume a load offset. I can think of a few ways to do this, but it should probably be a separate PR (probably a small change to coreclr to flag a r2rmap file by opening it, and a separate PR to samply to handle that open and parse it).

The thing that I want to confirm is that samply is using the jitdump file and not the perfmap file. The jitdump file does contain symbols for R2R methods - this is in addition to the r2rmap functionality, which is used for scenarios that don't support jitdump. Can you confirm that it's the jitdump file that's being used? If so, can you try running with tiered compilation disabled and see if you get samples for R2R code (e.g. System.Private.CoreLib)?

@vvuk
Copy link
Contributor Author

vvuk commented Mar 25, 2024

The thing that I want to confirm is that samply is using the jitdump file and not the perfmap file.

Ah! Good catch, this was in fact what was happening. The issue was due to timestamps. samply uses mach_absolute_time() (aka mach clock_gettime(CLOCK_UPTIME_RAW)), while the coreclr jitdump code was using CLOCK_MONOTONIC. Those aren't compatible, so all the jit records were being ignored because they were outside of what samply saw as the thread execution range.

I changed this code to use QueryPerformanceCounter(), with a check in Start() that QPF is nanoseconds-to-seconds. This is the case on both Linux and macOS, and the assert should catch platforms where this won't be true. Alternatively, there's an interim commit where I use mach_absolute_time() directly and leave the existing CLOCK_MONOTONIC using code in place in the jitdump code, if that's preferable.

With this change, I get JIT functions with DOTNET_PerfMapEnabled=2 (write only jitdump files), including with COMPlus_TieredCompilation=0. But I don't get jit info if I do write perfmap files (PerfMapEnabled=1), and even with only jitdump files I'm still missing some symbols though -- I'm going to track down where these are coming from and figure out the perfmap issue:
image

@vvuk
Copy link
Contributor Author

vvuk commented Mar 25, 2024

With this change, I get JIT functions with DOTNET_PerfMapEnabled=2 (write only jitdump files), including with COMPlus_TieredCompilation=0. But I don't get jit info if I do write perfmap files (PerfMapEnabled=1)

Actually, I'm wrong here. Not sure what I was looking at.

With PerfMapEnabled=2 (jitdump files only), I see the following annotations in the profile: PreJIT, QuickJitted, and Optimized. I believe that means that I'm seeing both precompiled symbols (PreJIT) and actual JIT generated function symbols (QuickJitted/Optimized). But I also see some unresolved addresses.

With PerfMapEnabled=1 (jitdump and perfmap files), I see PreJit, QuickJitted, and Optimized, and I don't see any unresolved addresses. Everything looks correct. I see some being resolved from perfmap, and some from jitdump, even for an identical symbol and optimization tier. Is this expected? (See below)

I'm not sure if the samples are actually at different addresses, or if it's two calls to the same address... and one symbol was resolved at a time period before that address was written to a jitdump file but was present in perfmap. At that point the jitdump entry would override the perfmap entry, and samply would see it as coming from a different source. If that's the case, I can add an option to samply to fully merge perfmap & jitdump info for a process, which I think is the correct behaviour anyway.

image

@brianrob
Copy link
Member

With this change, I get JIT functions with DOTNET_PerfMapEnabled=2 (write only jitdump files), including with COMPlus_TieredCompilation=0. But I don't get jit info if I do write perfmap files (PerfMapEnabled=1)

Actually, I'm wrong here. Not sure what I was looking at.

With PerfMapEnabled=2 (jitdump files only), I see the following annotations in the profile: PreJIT, QuickJitted, and Optimized. I believe that means that I'm seeing both precompiled symbols (PreJIT) and actual JIT generated function symbols (QuickJitted/Optimized). But I also see some unresolved addresses.

With PerfMapEnabled=1 (jitdump and perfmap files), I see PreJit, QuickJitted, and Optimized, and I don't see any unresolved addresses. Everything looks correct. I see some being resolved from perfmap, and some from jitdump, even for an identical symbol and optimization tier. Is this expected? (See below)

I'm not sure if the samples are actually at different addresses, or if it's two calls to the same address... and one symbol was resolved at a time period before that address was written to a jitdump file but was present in perfmap. At that point the jitdump entry would override the perfmap entry, and samply would see it as coming from a different source. If that's the case, I can add an option to samply to fully merge perfmap & jitdump info for a process, which I think is the correct behaviour anyway.

image

Interesting that you see some unresolved symbols with jitdump only, but are able to resolve everything with the perfmap file as well. I would expect them to be essentially identical, with the exception of cases where code gets collected and the address space gets re-used. Then, I would expect jitdump to be correct, and perfmap to be incorrect (though silently so).

The fact that the perfmap solves this does imply that there's perhaps a timing issue, though I'm surprised, since I don't believe that there is any async behavior here and the data should be written before the code can be executed.

@vvuk
Copy link
Contributor Author

vvuk commented Mar 26, 2024

Interesting that you see some unresolved symbols with jitdump only, but are able to resolve everything with the perfmap file as well. I would expect them to be essentially identical, with the exception of cases where code gets collected and the address space gets re-used. Then, I would expect jitdump to be correct, and perfmap to be incorrect (though silently so).

The fact that the perfmap solves this does imply that there's perhaps a timing issue, though I'm surprised, since I don't believe that there is any async behavior here and the data should be written before the code can be executed.

I think what's happening is there are stack samples that are themselves timestamped (by samply) before the timestamp for the function that's written in the jitdump file.

If I set DOTNET_PerfMapEnabled=2 (jitdump files only), I get some of the unresolved symbols. All the JIT functions that I see in the trace are QuickJitted. If I modify samply to shift all timestamps backwards by 100ns, I get all symbols resolved. If I modify it to shift timestamps backwards by 1ns, I get a lot more, but not all symbols resolved. I confirmed that both coreclr and samply are using the same timebase (mach_absolute_time, either directly or via clock_gettime_nsec_np; I also tried making the calls identical to make sure).

From comparing some traces, I think the only methods that are missing are also ones that are QuickJitted. I can't prove that though because I'm manually comparing results with and without perfmap files.

Is it possible that there are JIT compiled methods that execute before PerfMap::Enable? I see Enable logging R2R methods as well as iterating through the code heap, so at least some methods must be compiled at that point. I'm going to add a code path so that those can be logged with timestamp 0 in the jitdump file and see what happens.

@vvuk
Copy link
Contributor Author

vvuk commented Mar 27, 2024

From some investigation... I see call stacks getting captured at timestamps that are 35µs, 85µs etc. before the timestamp in the jitdump record. samply's default capture interval is 1ms, so this looks to be a race condition in samply -- it timestamps samples from the time at the start of sample capture for the entire process/thread tree, but those threads are still running from that point onwards until the sample is actually taken. During this interval coreclr emits a jitdump record and jumps into jit-compiled code, which is captured in the stack... but with the old/earlier timestamp.

I've filed mstange/samply#127 for this -- a fix is to just timestamp samples with the time immediately after the actual thread capture.

@vvuk
Copy link
Contributor Author

vvuk commented Mar 27, 2024

With a fix for that issue I no longer get any unresolved samples with just jitdump files from coreclr.

@vvuk vvuk requested a review from jkotas March 30, 2024 00:09
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@brianrob @noahfalk Any additional feedback?

Copy link
Member

@brianrob brianrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM, but one question below.

src/coreclr/pal/src/misc/perfjitdump.cpp Show resolved Hide resolved
@brianrob
Copy link
Member

brianrob commented Apr 1, 2024

From some investigation... I see call stacks getting captured at timestamps that are 35µs, 85µs etc. before the timestamp in the jitdump record. samply's default capture interval is 1ms, so this looks to be a race condition in samply -- it timestamps samples from the time at the start of sample capture for the entire process/thread tree, but those threads are still running from that point onwards until the sample is actually taken. During this interval coreclr emits a jitdump record and jumps into jit-compiled code, which is captured in the stack... but with the old/earlier timestamp.

I've filed mstange/samply#127 for this -- a fix is to just timestamp samples with the time immediately after the actual thread capture.

@vvuk , thank you for tracking this down and getting it fixed.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me except I am confused at the same place Brian is that part of the change appears to be in unreachable code.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jkotas
Copy link
Member

jkotas commented Apr 3, 2024

/ba-g Infrastructure timeouts. The affected legs do not exercise the change in meaningful way, no reason for retrying.

@vvuk
Copy link
Contributor Author

vvuk commented Apr 3, 2024

Thanks all for being welcoming of a random outside PR -- quite happy with how my first coreclr contribution went (as minor as it was!) :)

@noahfalk
Copy link
Member

noahfalk commented Apr 3, 2024

We greatly appreciate the PR and look forward to the cool profiling, thanks @vvuk!

@brianrob
Copy link
Member

brianrob commented Apr 3, 2024

Thanks all for being welcoming of a random outside PR -- quite happy with how my first coreclr contribution went (as minor as it was!) :)

Awesome. Thanks so much @vvuk! Appreciate the contribution!

@vvuk vvuk deleted the upstream-mac-jitdump branch April 4, 2024 18:42
matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
…OSX (dotnet#99986)

* Enable FEATURE_PERFMAP on OSX, and update perfjitdump.cpp to work on OSX

* Update PerfMapEnabled documentation

* Enable only on OSX, and use PlatformGetCurrentThreadId

* Manual mach_absolute_time calls

* Use QueryPerformanceCounter

* Cleaner QueryPerformanceFrequency verification

* Use FEATURE_PERFMAP

* Put back conditional, but all __APPLE__

* Fix logic error when disabling jitdump
@github-actions github-actions bot locked and limited conversation to collaborators May 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants