Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS) #58939

rolfbjarne · 2021-09-10T13:36:07Z

Description

Calling Marshal.AllocHGlobal / Marshal.FreeHGlobal is ~150x slower in .NET compared to legacy Mono when running on a tvOS device.

Sample test code: https://gist.github.com/rolfbjarne/b22b844e6f351ad40c4f30e20a2a36d8

Outputs something like this with legacy Mono (Xamarin.iOS from d16-10):

[...]
Elapsed: 17.03 seconds Counter: 265,779,440.00 Iterations per second: 15,608,870
Elapsed: 18.03 seconds Counter: 281,406,670.00 Iterations per second: 15,608,886
Elapsed: 19.03 seconds Counter: 297,098,501.00 Iterations per second: 15,612,343

which is roughly 15.5M calls to Marshal.AllocHGlobal+FreeHGlobal per second.

Now in .NET I get this:

[...]
Elapsed: 17.10 seconds Counter: 1,773,883.00 Iterations per second: 103,745
Elapsed: 18.10 seconds Counter: 1,878,446.00 Iterations per second: 103,784
Elapsed: 19.10 seconds Counter: 1,982,100.00 Iterations per second: 103,771

that's roughly 103k calls to Marshal.AllocHGlobal+FreeHGlobal per second; ~150x slower.

This is on an Apple TV 4K from 2017.

There's a difference in the simulator too, just not so stark (on an iMac Pro)

Legacy Mono:

[...]
Elapsed: 12.02 seconds Counter: 165,442,863.00 Iterations per second: 13,759,790
Elapsed: 13.02 seconds Counter: 179,568,197.00 Iterations per second: 13,787,606
Elapsed: 14.02 seconds Counter: 193,717,056.00 Iterations per second: 13,812,942

and with .NET:

[...]
Elapsed: 12.03 seconds Counter: 39,917,875.00 Iterations per second: 3,317,120
Elapsed: 13.03 seconds Counter: 43,252,347.00 Iterations per second: 3,318,392
Elapsed: 14.04 seconds Counter: 46,617,318.00 Iterations per second: 3,321,424

so ~4x slower.

I profiled the .NET version on device using instruments:
Marshal.trace.zip

Here's a preview:

It seems most of the time is spent inside mono_threads_enter_gc_safe_region_unbalanced.

This function isn't even called in legacy Mono.

Here's an Instruments trace:
MarshalMono.trace.zip

and a preview:

I don't know if this applies to other platforms as well, I only tested tvOS.

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-09-10T13:36:10Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

EgorBo · 2021-09-10T13:51:40Z

Unrelated, just a note: there is also NativeMemory.Alloc which should be faster than AllocHGlobal

filipnavara · 2021-09-10T14:05:53Z

here is also NativeMemory.Alloc which should be faster than AllocHGlobal

On Unix systems Marshal.AllocHGlobal is just wrapper around NativeMemory.Alloc. On legacy Mono it was an icall, now it's a P/Invoke. It doesn't have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.

jkotas · 2021-09-10T15:01:22Z

It doesn't have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.

malloc/free can take locks or it can end up calling OS syscalls like mmap. It would not be appropriate to mark it with SuppressGCTransition.

filipnavara · 2021-09-10T15:06:42Z

malloc/free can take locks or it can end up calling OS syscalls like mmap. It would not be appropriate to mark it with SuppressGCTransition.

I agree it would be a correctness issue to do so. I am still thinking about benchmarking the impact though.

Depending on particular use case there may be other ways to allocate the memory which are less costly (eg. stackalloc for small allocations).

tannergooding · 2021-09-10T15:20:10Z

I think malloc is an interesting case where common implementations try to ensure it is as fast as possible. If you look at the various implementations available, most only take locks or do syscalls in the rare edge case, not in the common path.

That is, these calls that make it "incompatible" with SuppressGCTransition largely only happen when the underlying heap needs to be created or expanded. Otherwise, small allocations avoid all of this as do many medium sized allocations.

It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.

EgorBo · 2021-09-10T15:30:26Z

It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.

like emitting it as

if (len >= someLimit)
   emit machinery for gc transition
NativeMemory.Alloc(len);

? where someLimit is some small value that won't cause any mmap under the hood

jkotas · 2021-09-10T15:32:59Z

? where someLimit is some small value that won't cause any mmap under the hood

You cannot tell. Any malloc call, no matter how small the request block size is, can end up taking locks or calling mmap.

EgorBo · 2021-09-10T15:33:53Z

? where someLimit is some small value that won't cause any mmap under the hood

You cannot tell. Any malloc call, no matter how small the request block size is, can end up taking locks or calling mmap.

Right, just discovered that in glibc sources 👍

jkotas · 2021-09-10T15:36:40Z

I think the root cause of this problem is the high overhead implementation strategy used for PInvoke transitions on tvOS. This high overhead is a problem for every other PInvoke. For example, globalization PInvokes will hit it too.

Also add a 'SupportedOSPlatformVersion' value to the .NET perftest project file, to cope with recent changes in our .NET support. Ref: dotnet/runtime#58939.

filipnavara · 2021-09-11T09:16:14Z

I'll have to revisit mono/mono#17110 to see if the optimizations can be done correctly.

On my MacBook Air M1 I get these numbers with the provided test case:

CoreCLR under Rosetta: Iterations per second: 12,019,291
Mono under Rosetta: Iterations per second: 4,003,878
Mono ARM64: Iterations per second: 6,845,245

filipnavara · 2021-09-11T11:27:56Z

There's something weird with my local runtime builds because they behave quite differently to the official ones. I'll have to figure that out first.

Local dotnet/runtime, MacCatalyst ARM64 Debug, Iterations per second: 2,591,953
The same with some optimizations cherry-picked from my earlier attempts to speed up the GC transitions, Iterations per second: 2,750,647

Release builds are way more comparable to the numbers above. My local changes produce Iterations per second: 6,402,773 so basically it's even worse or a problem with my benchmarking numbers. Definitely not the improvement I hoped for.

Another experiment is measuring the overhead by adding SuppressGCTransition (MacCatalyst, Mono, x64 under Rosetta):

Baseline: Iterations per second: 4,003,878
SuppressGCTransition on Free: Iterations per second: 5,573,670
SuppressGCTransition on both Free and Malloc: Iterations per second: 15,170,537

So, yeah, the transitions are crazy expensive.

filipnavara · 2021-09-11T21:18:16Z

I've done some experiment locally that ensures registers are saved to stack and then saves only part of the context on the thread state transition. Saves some memory copying. It would need a lot of polishing and validation to ensure it does not break anything. Mono ARM64 gets Iterations per second: 8,807,528 with the changes, or about 28% improvement.

Not sure if I can get it ready anytime soon but here's a gist of what I was testing:

The marshalling method already saves the LMF structure (last managed frame) which contains callee saved registers. I added saving of the reference-type arguments into stack variables. This ensures that all the passed parameters (such as SafeHandle) exist somewhere on the stack before the native method is called or before the GC transition frame is established.
Since everything is on the stack there's no need to accurately capture all registers since variables/arguments will need to be spilled to the stack. Thus we need to capture only IP/SP (ie. thread_state_init can be simplified). This saves about 12% of the run time.
The copy_stack method is no longer needed in the P/Invoke flow since everything is on the stack already by the time mono_threads_enter_gc_safe_region_unbalanced is called and the stack is not unwound. This saves another ±15% of the run time.
LLVM-only mode is missing support for save_lmf logic. It should emit llvm.eh.unwind.init intrinsic to spill the callee saved registers.

…12696) Also add a 'SupportedOSPlatformVersion' value to the .NET perftest project file, to cope with recent changes in our .NET support. Ref: dotnet/runtime#58939.

marek-safar · 2021-09-13T15:18:23Z

/cc @vargaz

vargaz · 2021-09-13T15:29:12Z

Fixing/improving this would require risky changes so this will unlikely to be fixed for 6.0.

filipnavara · 2021-09-13T15:38:27Z

I agree that this is not generally fixable for .NET 6 timeframe. Something like PR #59029 may help a bit while being backportable. PR #58992 is exploring a more radical solution (but definitely not ready for prime time).

…elease builds (#59269) Backport of #59029 Profiling shows that large part of the GC transition overhead (~30%) in #58939 is caused by assert-style checks. Disabling them seems to be the best bang-for-the-buck option for reducing the overhead without fundamental changes to the code. Co-authored-by: Filip Navara <navara@emclient.com>

SupinePandora43 · 2022-04-12T15:34:13Z

I'll have to revisit mono/mono#17110 to see if the optimizations can be done correctly.

On my MacBook Air M1 I get these numbers with the provided test case:

CoreCLR under Rosetta: Iterations per second: 12,019,291

Mono under Rosetta: Iterations per second: 4,003,878

Mono ARM64: Iterations per second: 6,845,245

What about CoreCLR ARM64?

akoeplinger · 2022-08-12T22:19:57Z

Moving to 8.0

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 10, 2021

rolfbjarne mentioned this issue Sep 10, 2021

[.NET 6] External dotnet issues xamarin/xamarin-macios#8901

Closed

61 tasks

danmoseley added the area-Interop-coreclr label Sep 10, 2021

jkoritzinsky added area-Interop-mono and removed area-Interop-coreclr labels Sep 10, 2021

rolfbjarne mentioned this issue Sep 10, 2021

[tests] Add perf test for calling Marshal.AllocHGlobal/FreeHGlobal. xamarin/xamarin-macios#12696

Merged

filipnavara mentioned this issue Sep 13, 2021

[MonoVM] Reduce P/Invoke GC transition asserts in release builds #59029

Merged

marek-safar added os-tvos Apple tvOS and removed untriaged New issue has not been triaged by the area owner labels Sep 13, 2021

marek-safar added this to the 6.0.0 milestone Sep 13, 2021

marek-safar modified the milestones: 6.0.0, 7.0.0 Sep 13, 2021

lambdageek mentioned this issue Sep 17, 2021

[release/6.0-rc2] [MonoVM] Reduce P/Invoke GC transition asserts in release builds #59269

Merged

rolfbjarne mentioned this issue Mar 29, 2022

[.NET 7] External dotnet issues xamarin/xamarin-macios#14548

Closed

5 tasks

rolfbjarne mentioned this issue Apr 27, 2022

[net6] mono_threads_enter_gc_safe_region_unbalanced performance regression xamarin/xamarin-macios#14835

Open

steveisok assigned vargaz Jul 11, 2022

akoeplinger modified the milestones: 7.0.0, 8.0.0 Aug 12, 2022

marek-safar modified the milestones: 8.0.0, Future Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS) #58939

Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS) #58939

rolfbjarne commented Sep 10, 2021 •

edited

Loading

dotnet-issue-labeler bot commented Sep 10, 2021

EgorBo commented Sep 10, 2021

filipnavara commented Sep 10, 2021

jkotas commented Sep 10, 2021

filipnavara commented Sep 10, 2021

tannergooding commented Sep 10, 2021

EgorBo commented Sep 10, 2021

jkotas commented Sep 10, 2021

EgorBo commented Sep 10, 2021

jkotas commented Sep 10, 2021

filipnavara commented Sep 11, 2021

filipnavara commented Sep 11, 2021 •

edited

Loading

filipnavara commented Sep 11, 2021 •

edited

Loading

marek-safar commented Sep 13, 2021

vargaz commented Sep 13, 2021

filipnavara commented Sep 13, 2021

SupinePandora43 commented Apr 12, 2022

akoeplinger commented Aug 12, 2022

Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS) #58939

Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS) #58939

Comments

rolfbjarne commented Sep 10, 2021 • edited Loading

Description

dotnet-issue-labeler bot commented Sep 10, 2021

EgorBo commented Sep 10, 2021

filipnavara commented Sep 10, 2021

jkotas commented Sep 10, 2021

filipnavara commented Sep 10, 2021

tannergooding commented Sep 10, 2021

EgorBo commented Sep 10, 2021

jkotas commented Sep 10, 2021

EgorBo commented Sep 10, 2021

jkotas commented Sep 10, 2021

filipnavara commented Sep 11, 2021

filipnavara commented Sep 11, 2021 • edited Loading

filipnavara commented Sep 11, 2021 • edited Loading

marek-safar commented Sep 13, 2021

vargaz commented Sep 13, 2021

filipnavara commented Sep 13, 2021

SupinePandora43 commented Apr 12, 2022

akoeplinger commented Aug 12, 2022

rolfbjarne commented Sep 10, 2021 •

edited

Loading

filipnavara commented Sep 11, 2021 •

edited

Loading

filipnavara commented Sep 11, 2021 •

edited

Loading