Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS) #58939

Open
Tracked by #14548
rolfbjarne opened this issue Sep 10, 2021 · 18 comments
Assignees
Milestone

Comments

@rolfbjarne
Copy link
Member

rolfbjarne commented Sep 10, 2021

Description

Calling Marshal.AllocHGlobal / Marshal.FreeHGlobal is ~150x slower in .NET compared to legacy Mono when running on a tvOS device.

Sample test code: https://gist.github.com/rolfbjarne/b22b844e6f351ad40c4f30e20a2a36d8

Outputs something like this with legacy Mono (Xamarin.iOS from d16-10):

[...]
Elapsed: 17.03 seconds Counter: 265,779,440.00 Iterations per second: 15,608,870
Elapsed: 18.03 seconds Counter: 281,406,670.00 Iterations per second: 15,608,886
Elapsed: 19.03 seconds Counter: 297,098,501.00 Iterations per second: 15,612,343

which is roughly 15.5M calls to Marshal.AllocHGlobal+FreeHGlobal per second.

Now in .NET I get this:

[...]
Elapsed: 17.10 seconds Counter: 1,773,883.00 Iterations per second: 103,745
Elapsed: 18.10 seconds Counter: 1,878,446.00 Iterations per second: 103,784
Elapsed: 19.10 seconds Counter: 1,982,100.00 Iterations per second: 103,771

that's roughly 103k calls to Marshal.AllocHGlobal+FreeHGlobal per second; ~150x slower.

This is on an Apple TV 4K from 2017.

There's a difference in the simulator too, just not so stark (on an iMac Pro)

Legacy Mono:

[...]
Elapsed: 12.02 seconds Counter: 165,442,863.00 Iterations per second: 13,759,790
Elapsed: 13.02 seconds Counter: 179,568,197.00 Iterations per second: 13,787,606
Elapsed: 14.02 seconds Counter: 193,717,056.00 Iterations per second: 13,812,942

and with .NET:

[...]
Elapsed: 12.03 seconds Counter: 39,917,875.00 Iterations per second: 3,317,120
Elapsed: 13.03 seconds Counter: 43,252,347.00 Iterations per second: 3,318,392
Elapsed: 14.04 seconds Counter: 46,617,318.00 Iterations per second: 3,321,424

so ~4x slower.

I profiled the .NET version on device using instruments:
Marshal.trace.zip

Here's a preview:

Screen Shot 2021-09-10 at 15 31 11

It seems most of the time is spent inside mono_threads_enter_gc_safe_region_unbalanced.

This function isn't even called in legacy Mono.

Here's an Instruments trace:
MarshalMono.trace.zip

and a preview:

Screen Shot 2021-09-10 at 15 35 25

I don't know if this applies to other platforms as well, I only tested tvOS.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 10, 2021
@EgorBo
Copy link
Member

EgorBo commented Sep 10, 2021

Unrelated, just a note: there is also NativeMemory.Alloc which should be faster than AllocHGlobal

@filipnavara
Copy link
Member

here is also NativeMemory.Alloc which should be faster than AllocHGlobal

On Unix systems Marshal.AllocHGlobal is just wrapper around NativeMemory.Alloc. On legacy Mono it was an icall, now it's a P/Invoke. It doesn't have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.

@jkotas
Copy link
Member

jkotas commented Sep 10, 2021

It doesn't have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.

malloc/free can take locks or it can end up calling OS syscalls like mmap. It would not be appropriate to mark it with SuppressGCTransition.

@filipnavara
Copy link
Member

malloc/free can take locks or it can end up calling OS syscalls like mmap. It would not be appropriate to mark it with SuppressGCTransition.

I agree it would be a correctness issue to do so. I am still thinking about benchmarking the impact though.

Depending on particular use case there may be other ways to allocate the memory which are less costly (eg. stackalloc for small allocations).

@tannergooding
Copy link
Member

I think malloc is an interesting case where common implementations try to ensure it is as fast as possible. If you look at the various implementations available, most only take locks or do syscalls in the rare edge case, not in the common path.

That is, these calls that make it "incompatible" with SuppressGCTransition largely only happen when the underlying heap needs to be created or expanded. Otherwise, small allocations avoid all of this as do many medium sized allocations.

It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.

@EgorBo
Copy link
Member

EgorBo commented Sep 10, 2021

It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.

like emitting it as

if (len >= someLimit)
   emit machinery for gc transition
NativeMemory.Alloc(len);

? where someLimit is some small value that won't cause any mmap under the hood

@jkotas
Copy link
Member

jkotas commented Sep 10, 2021

? where someLimit is some small value that won't cause any mmap under the hood

You cannot tell. Any malloc call, no matter how small the request block size is, can end up taking locks or calling mmap.

@EgorBo
Copy link
Member

EgorBo commented Sep 10, 2021

? where someLimit is some small value that won't cause any mmap under the hood

You cannot tell. Any malloc call, no matter how small the request block size is, can end up taking locks or calling mmap.

Right, just discovered that in glibc sources 👍

@jkotas
Copy link
Member

jkotas commented Sep 10, 2021

I think the root cause of this problem is the high overhead implementation strategy used for PInvoke transitions on tvOS. This high overhead is a problem for every other PInvoke. For example, globalization PInvokes will hit it too.

rolfbjarne added a commit to rolfbjarne/xamarin-macios that referenced this issue Sep 10, 2021
Also add a 'SupportedOSPlatformVersion' value to the .NET perftest project
file, to cope with recent changes in our .NET support.

Ref: dotnet/runtime#58939.
@filipnavara
Copy link
Member

I'll have to revisit mono/mono#17110 to see if the optimizations can be done correctly.

On my MacBook Air M1 I get these numbers with the provided test case:

  • CoreCLR under Rosetta: Iterations per second: 12,019,291
  • Mono under Rosetta: Iterations per second: 4,003,878
  • Mono ARM64: Iterations per second: 6,845,245

@filipnavara
Copy link
Member

filipnavara commented Sep 11, 2021

There's something weird with my local runtime builds because they behave quite differently to the official ones. I'll have to figure that out first.

Local dotnet/runtime, MacCatalyst ARM64 Debug, Iterations per second: 2,591,953
The same with some optimizations cherry-picked from my earlier attempts to speed up the GC transitions, Iterations per second: 2,750,647

Release builds are way more comparable to the numbers above. My local changes produce Iterations per second: 6,402,773 so basically it's even worse or a problem with my benchmarking numbers. Definitely not the improvement I hoped for.

Another experiment is measuring the overhead by adding SuppressGCTransition (MacCatalyst, Mono, x64 under Rosetta):

  • Baseline: Iterations per second: 4,003,878
  • SuppressGCTransition on Free: Iterations per second: 5,573,670
  • SuppressGCTransition on both Free and Malloc: Iterations per second: 15,170,537

So, yeah, the transitions are crazy expensive.

@filipnavara
Copy link
Member

filipnavara commented Sep 11, 2021

I've done some experiment locally that ensures registers are saved to stack and then saves only part of the context on the thread state transition. Saves some memory copying. It would need a lot of polishing and validation to ensure it does not break anything. Mono ARM64 gets Iterations per second: 8,807,528 with the changes, or about 28% improvement.

Not sure if I can get it ready anytime soon but here's a gist of what I was testing:

  • The marshalling method already saves the LMF structure (last managed frame) which contains callee saved registers. I added saving of the reference-type arguments into stack variables. This ensures that all the passed parameters (such as SafeHandle) exist somewhere on the stack before the native method is called or before the GC transition frame is established.
  • Since everything is on the stack there's no need to accurately capture all registers since variables/arguments will need to be spilled to the stack. Thus we need to capture only IP/SP (ie. thread_state_init can be simplified). This saves about 12% of the run time.
  • The copy_stack method is no longer needed in the P/Invoke flow since everything is on the stack already by the time mono_threads_enter_gc_safe_region_unbalanced is called and the stack is not unwound. This saves another ±15% of the run time.
  • LLVM-only mode is missing support for save_lmf logic. It should emit llvm.eh.unwind.init intrinsic to spill the callee saved registers.

rolfbjarne added a commit to xamarin/xamarin-macios that referenced this issue Sep 13, 2021
…12696)

Also add a 'SupportedOSPlatformVersion' value to the .NET perftest project
file, to cope with recent changes in our .NET support.

Ref: dotnet/runtime#58939.
@marek-safar marek-safar added os-tvos Apple tvOS and removed untriaged New issue has not been triaged by the area owner labels Sep 13, 2021
@marek-safar marek-safar added this to the 6.0.0 milestone Sep 13, 2021
@marek-safar
Copy link
Contributor

/cc @vargaz

@vargaz
Copy link
Contributor

vargaz commented Sep 13, 2021

Fixing/improving this would require risky changes so this will unlikely to be fixed for 6.0.

@marek-safar marek-safar modified the milestones: 6.0.0, 7.0.0 Sep 13, 2021
@filipnavara
Copy link
Member

I agree that this is not generally fixable for .NET 6 timeframe. Something like PR #59029 may help a bit while being backportable. PR #58992 is exploring a more radical solution (but definitely not ready for prime time).

steveisok pushed a commit that referenced this issue Sep 21, 2021
…elease builds (#59269)

Backport of #59029

Profiling shows that large part of the GC transition overhead (~30%) in #58939 is caused by assert-style checks. Disabling them seems to be the best bang-for-the-buck option for reducing the overhead without fundamental changes to the code.

Co-authored-by: Filip Navara <navara@emclient.com>
@SupinePandora43
Copy link

I'll have to revisit mono/mono#17110 to see if the optimizations can be done correctly.

On my MacBook Air M1 I get these numbers with the provided test case:

  • CoreCLR under Rosetta: Iterations per second: 12,019,291
  • Mono under Rosetta: Iterations per second: 4,003,878
  • Mono ARM64: Iterations per second: 6,845,245

What about CoreCLR ARM64?

@akoeplinger
Copy link
Member

Moving to 8.0

@marek-safar marek-safar modified the milestones: 8.0.0, Future Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests