Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NativeAOT size conscious authoring observations #83902

Closed
NinoFloris opened this issue Mar 24, 2023 · 11 comments
Closed

NativeAOT size conscious authoring observations #83902

NinoFloris opened this issue Mar 24, 2023 · 11 comments
Labels
area-NativeAOT-coreclr linkable-framework Issues associated with delivering a linker friendly framework
Milestone

Comments

@NinoFloris
Copy link
Contributor

NativeAOT authoring observations

As I'm continuing to iterate to shave off overhead from the new serializer for Npgsql, and slowly getting more familiar with the code patterns that work well with NativeAOT I wanted to report on my findings so far. I will be using this issue to centralize future observations of a similar nature as well.

Async

One class of issues in particular I want to highlight is the cost of everything relating to async code. Having any of these methods in generic types is one major source of bloat due to their inherent IL type + codegen explosion` multiplied by the number of canonical instantiations.

It might be worthwhile for the C# and runtime team to take a look at what can be done to reduce its footprint, as this interaction is supremely problematic for more casual authors. If there is no significant improvement to be made here it could help to expose relevant tools for authors to reduce it on a case by case basis.

Some of those tools could be for instance having public methods on ValueTask and ValueTask to move from its generic form to non generic and back, allowing us external authors to support pooled IValueTaskSource(T) while doing so.
Tasks have inherent support of up/downcasting to do this. For ValueTask these apis could be used in the same way as up/downcasting Tasks, for the purpose of pushing this bulky async codegen to non generic types. There already is the internal method ValueTask.DangerousCreateFromTypedValueTask as a validation of this being useful internally.

Async Types

Next up I started looking at the general cost of even having apis returning ValueTask types, it's not cheap... Not only do you pay for the Task types but also for ValueTask<T>, IValueTaskSource<T>, ValueTask<T>.ValueTaskSourceAsTask, ValueTaskAwaiter<T> and others.

I have two links here, one before dropping about 20(?) mostly reference type instantiations of ValueTask<T> and one report after doing so. The difference is a hefty 80kb, with methods taking up just 24kb of that difference. The remainder is mostly types and type dictionary metadata.

Before: https://github.com/NinoFloris/Slon/actions/runs/4507705639
After: https://github.com/NinoFloris/Slon/actions/runs/4507808228

The mstats are attached if you want to explore the System.Private.Corelib types in more detail.

Reference Types

Onto the next papercut, reference type instantiations in particular. Now their methods are shared via __Canon, however all their concrete types are still required to exist in full, take a look:

Screenshot 2023-03-24 at 19 44 15

These all add to the binary size, but I'm not sure what they're exactly uniquely adding. Could these EETypes/method tables (what is the correct name of this?) be shared in any way via __Canon? Only keeping a concrete generic context around per reference type instantiation? I understand the latter would always be needed to have correct type testing etc. inside methods.

Improving this situation seems like it may also reduce the previously discussed ValueTask<T> type bloat?

Value Type Code Sharing

Finally I would urge runtime devs to consider what it would take to share code across same size/layout value types, a la gc shapes in golang. For a type like int that could allow eligible code for say List<T> to be reused across int, uint, enums,ValueTuple<int>, DateOnly and other types wrapping a single int. The same would go for other primitives like long that have a lot of representationally isomorphic types to share code with.

I understand this cuts into the ability to do runtime intrinsic optimizations (like uints never being negative values etc). I also see how this may complicate codegen as this sharing is only possible when the different instantiations don't actually produce different bodies (i.e. int.Equals will produce different code than DateOnly.Equals), however there will be many methods that don't depend on the generic type's methods at all, just their data representation. For an initial good enough experiment it might just be sufficient to add a stage that aliases same-type method instantiations by their body being byte for byte identical?

Such a stage may also help with sharing code across all instantiations when it doesn't make use of the generic context at all.
IIRC there is already some global deduplication mode (which impacts stack traces) but only sharing per generic type seems to be more suitable to be enabled by default?

I can see the theoretical version of these things working but I'm obviously not sure what this would mean more concretely. (and the practical problems flowing from this, which I'm surely glossing over here)

Conclusion

If we're really serious about NativeAOT being 'effortlessly' competitive (so no crazy authoring) these issues must be explored. If only just to understand the problematic elements better.

All in all it's been challenging to keep size down to acceptable levels in this particular area of generics, async and serializer-like code.

@DamianEdwards Is there a world in which we drive dotnet/aspnetcore#45910 stage 2 efforts across internal and external collaborators more effectively than just github issues? Is that something you're open to?

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 24, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Mar 24, 2023
@NinoFloris
Copy link
Contributor Author

The details below dive a bit deeper into some of the async overhead workarounds, this may be useful for other NativeAOT code authors. I've folded it to keep the issue focussed.

Workarounds

One of the hacks I've used to reduce size of async methods in generic classes (when they have a single await and no generic return type) can be seen being implemented in this commit NinoFloris/Slon@bcd8572

The basic steps come down to:

  • Create a non generic base class for the generic class
  • Define in it the async method which would otherwise live in the generic class, it should be taking the Task<T> value to be awaited, but upcasted to Task, the return type of this method should match the generic class method return type.
  • Define an abstract method to post the awaited task back to + any arguments that may be additionally required to do so.
    Then, in the derived generic class:
  • Implement the abstract and downcast the task back to its generic form, pull out the result and work with it.
  • Where once was the async method you now call into the base class async method.

if you already inherit from a class you can't modify or when you are implementing a generic struct, an alternative is to move the async method to a static class which then takes a callback delegate as an alternative to the virtual method. This is worse in certain ways, better in others, see https://github.com/NinoFloris/Slon/blob/c7f2b856e35e91d1f025640785fae4172d960f00/Slon/Pg/Converters/CollectionConverter.cs#L242-L255 for an example.

When working with ValueTasks the generic class method should first check whether the ValueTask is completed and skip the base method call if so, saving the alloc and virtual call.

That's basically it, when dealing with ValueTask's that wrap an IValueTaskSource this technique won't work without allocating, AsTask() will have to allocate a Task to proxy the IValueTaskSource. This is where a public version of ValueTask.DangerousCreateFromTypedValueTask could be relevant.

That being said, if that method would exist it wouldn't be an easy pattern to get right. As you can only use an IValueTaskSource once you can't use normal async await code, it would call GetResult on the non generic interface. And by doing so it would free the source without you getting the result. It would instead require you to write your own state machine to call into the virtual/callback after awaiting (some help from the language or some api to simplify doing this could be useful again).

Once there you can proceed as usual, by converting the ValueTask back to its generic form ValueTask<T>.GetAwaiter.GetResult (or ValueTask.Result) now correctly goes through IValueTaskSource<T>.GetResult giving the desired free + result.

@ghost
Copy link

ghost commented Mar 24, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

NativeAOT authoring observations

As I'm continuing to iterate to shave off overhead from the new serializer for Npgsql, and slowly getting more familiar with the code patterns that work well with NativeAOT I wanted to report on my findings so far. I will be using this issue to centralize future observations of a similar nature as well.

Async

One class of issues in particular I want to highlight is the cost of everything relating to async code. Having any of these methods in generic types is one major source of bloat due to their inherent IL type + codegen explosion` multiplied by the number of canonical instantiations.

It might be worthwhile for the C# and runtime team to take a look at what can be done to reduce its footprint, as this interaction is supremely problematic for more casual authors. If there is no significant improvement to be made here it could help to expose relevant tools for authors to reduce it on a case by case basis.

Some of those tools could be for instance having public methods on ValueTask and ValueTask to move from its generic form to non generic and back, allowing us external authors to support pooled IValueTaskSource(T) while doing so.
Tasks have inherent support of up/downcasting to do this. For ValueTask these apis could be used in the same way as up/downcasting Tasks, for the purpose of pushing this bulky async codegen to non generic types. There already is the internal method ValueTask.DangerousCreateFromTypedValueTask as a validation of this being useful internally.

Async Types

Next up I started looking at the general cost of even having apis returning ValueTask types, it's not cheap... Not only do you pay for the Task types but also for ValueTask<T>, IValueTaskSource<T>, ValueTask<T>.ValueTaskSourceAsTask, ValueTaskAwaiter<T> and others.

I have two links here, one before dropping about 20(?) mostly reference type instantiations of ValueTask<T> and one report after doing so. The difference is a hefty 80kb, with methods taking up just 24kb of that difference. The remainder is mostly types and type dictionary metadata.

Before: https://github.com/NinoFloris/Slon/actions/runs/4507705639
After: https://github.com/NinoFloris/Slon/actions/runs/4507808228

The mstats are attached if you want to explore the System.Private.Corelib types in more detail.

Reference Types

Onto the next papercut, reference type instantiations in particular. Now their methods are shared via __Canon, however all their concrete types are still required to exist in full, take a look:

Screenshot 2023-03-24 at 19 44 15

These all add to the binary size, but I'm not sure what they're exactly uniquely adding. Could these EETypes/method tables (what is the correct name of this?) be shared in any way via __Canon? Only keeping a concrete generic context around per reference type instantiation? I understand the latter would always be needed to have correct type testing etc. inside methods.

Improving this situation seems like it may also reduce the previously discussed ValueTask<T> type bloat?

Value Type Code Sharing

Finally I would urge runtime devs to consider what it would take to share code across same size/layout value types, a la gc shapes in golang. For a type like int that could allow eligible code for say List<T> to be reused across int, uint, enums,ValueTuple<int>, DateOnly and other types wrapping a single int. The same would go for other primitives like long that have a lot of representationally isomorphic types to share code with.

I understand this cuts into the ability to do runtime intrinsic optimizations (like uints never being negative values etc). I also see how this may complicate codegen as this sharing is only possible when the different instantiations don't actually produce different bodies (i.e. int.Equals will produce different code than DateOnly.Equals), however there will be many methods that don't depend on the generic type's methods at all, just their data representation. For an initial good enough experiment it might just be sufficient to add a stage that aliases same-type method instantiations by their body being byte for byte identical?

Such a stage may also help with sharing code across all instantiations when it doesn't make use of the generic context at all.
IIRC there is already some global deduplication mode (which impacts stack traces) but only sharing per generic type seems to be more suitable to be enabled by default?

I can see the theoretical version of these things working but I'm obviously not sure what this would mean more concretely. (and the practical problems flowing from this, which I'm surely glossing over here)

Conclusion

If we're really serious about NativeAOT being 'effortlessly' competitive (so no crazy authoring) these issues must be explored. If only just to understand the problematic elements better.

All in all it's been challenging to keep size down to acceptable levels in this particular area of generics, async and serializer-like code.

@DamianEdwards Is there a world in which we drive dotnet/aspnetcore#45910 stage 2 efforts across internal and external collaborators more effectively than just github issues? Is that something you're open to?

Author: NinoFloris
Assignees: -
Labels:

untriaged, area-NativeAOT-coreclr, needs-area-label

Milestone: -

@SingleAccretion SingleAccretion removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 24, 2023
@agocke agocke added this to the Future milestone Mar 24, 2023
@agocke agocke removed the untriaged New issue has not been triaged by the area owner label Mar 24, 2023
@jkotas
Copy link
Member

jkotas commented Mar 25, 2023

Could these EETypes/method tables (what is the correct name of this?) be shared in any way via __Canon?

260 bytes is the size after sharing via __Canon. In general, smaller size implies worse speed. Would you be willing to pay for smaller size by worse startup time or worse throughput?

there is already some global deduplication mode (which impacts stack traces) but only sharing per generic type seems to be more suitable to be enabled by default?

This undocumented mode can be enable by <IlcFoldIdenticalMethodBodies>true</IlcFoldIdenticalMethodBodies>. Enabling it will allow you to measure upper bound of what can be potentially achieved by this optimization. This mode breaks stacktraces and reflection in subtle ways, for both generic and non-generic code.

@NinoFloris
Copy link
Contributor Author

260 bytes is the size after sharing via __Canon. In general, smaller size implies worse time. Would you be willing to pay for smaller size by worse startup time or worse throughput?

There's no sign of a __Canon entry, just a type definition, (non virtual method tables can get trimmed afaik?). (you can view the report here, its the last folded panel https://github.com/NinoFloris/Slon/actions/runs/4507808228)

If the BOTR broadly applies to NativeAOT I'm aware the entire virtual method table is not shared, but I don't quite understand why. Is it really just to have concrete method table pointers on object instance headers? Would the throughput loss mean adding an indirection here? While the startup time increase would be for synthesizing these copies? Could that be ad-hoc on first allocation?

The problem I'm dealing with is keeping code size small while having a reasonable supported set of types baked in, the chances that they'll all be used by the app is small, and the chances they're needed immediately even smaller. User directed rooting - which would certainly be preferable - is not a great story in ADO.NET, but that's a different matter.

To answer your question directly, it might be worth some form of perf loss yes, but it'd have to be quantified first. The bad alternative is working purely with System.Object results everywhere, downcasting to its known type at the edges instead. It gets rid of all the ValueTask types, the CollectionConverter types and all the other type bloat. I have a commit that went down this road but I backed out of it as it just adds more implementation complexity, I hope to avoid it.

Is this issue in any way related? #83438

This undocumented mode can be enable by true.

Yeah that's the one! Thanks, saves a me search :) I'll see what it does.

@jkotas
Copy link
Member

jkotas commented Mar 25, 2023

Would the throughput loss mean adding an indirection here?

Yes, it would mean adding indirection in the method table and an extra instruction in all virtual callsites. It is not clear whether it would be a size win at the end.

While the startup time increase would be for synthesizing these copies?

Correct.

@danmoseley danmoseley added the linkable-framework Issues associated with delivering a linker friendly framework label Mar 28, 2023
@ghost
Copy link

ghost commented Mar 28, 2023

Tagging subscribers to 'linkable-framework': @eerhardt, @vitek-karas, @LakshanF, @sbomer, @joperezr, @marek-safar
See info in area-owners.md if you want to be subscribed.

Issue Details

NativeAOT authoring observations

As I'm continuing to iterate to shave off overhead from the new serializer for Npgsql, and slowly getting more familiar with the code patterns that work well with NativeAOT I wanted to report on my findings so far. I will be using this issue to centralize future observations of a similar nature as well.

Async

One class of issues in particular I want to highlight is the cost of everything relating to async code. Having any of these methods in generic types is one major source of bloat due to their inherent IL type + codegen explosion` multiplied by the number of canonical instantiations.

It might be worthwhile for the C# and runtime team to take a look at what can be done to reduce its footprint, as this interaction is supremely problematic for more casual authors. If there is no significant improvement to be made here it could help to expose relevant tools for authors to reduce it on a case by case basis.

Some of those tools could be for instance having public methods on ValueTask and ValueTask to move from its generic form to non generic and back, allowing us external authors to support pooled IValueTaskSource(T) while doing so.
Tasks have inherent support of up/downcasting to do this. For ValueTask these apis could be used in the same way as up/downcasting Tasks, for the purpose of pushing this bulky async codegen to non generic types. There already is the internal method ValueTask.DangerousCreateFromTypedValueTask as a validation of this being useful internally.

Async Types

Next up I started looking at the general cost of even having apis returning ValueTask types, it's not cheap... Not only do you pay for the Task types but also for ValueTask<T>, IValueTaskSource<T>, ValueTask<T>.ValueTaskSourceAsTask, ValueTaskAwaiter<T> and others.

I have two links here, one before dropping about 20(?) mostly reference type instantiations of ValueTask<T> and one report after doing so. The difference is a hefty 80kb, with methods taking up just 24kb of that difference. The remainder is mostly types and type dictionary metadata.

Before: https://github.com/NinoFloris/Slon/actions/runs/4507705639
After: https://github.com/NinoFloris/Slon/actions/runs/4507808228

The mstats are attached if you want to explore the System.Private.Corelib types in more detail.

Reference Types

Onto the next papercut, reference type instantiations in particular. Now their methods are shared via __Canon, however all their concrete types are still required to exist in full, take a look:

Screenshot 2023-03-24 at 19 44 15

These all add to the binary size, but I'm not sure what they're exactly uniquely adding. Could these EETypes/method tables (what is the correct name of this?) be shared in any way via __Canon? Only keeping a concrete generic context around per reference type instantiation? I understand the latter would always be needed to have correct type testing etc. inside methods.

Improving this situation seems like it may also reduce the previously discussed ValueTask<T> type bloat?

Value Type Code Sharing

Finally I would urge runtime devs to consider what it would take to share code across same size/layout value types, a la gc shapes in golang. For a type like int that could allow eligible code for say List<T> to be reused across int, uint, enums,ValueTuple<int>, DateOnly and other types wrapping a single int. The same would go for other primitives like long that have a lot of representationally isomorphic types to share code with.

I understand this cuts into the ability to do runtime intrinsic optimizations (like uints never being negative values etc). I also see how this may complicate codegen as this sharing is only possible when the different instantiations don't actually produce different bodies (i.e. int.Equals will produce different code than DateOnly.Equals), however there will be many methods that don't depend on the generic type's methods at all, just their data representation. For an initial good enough experiment it might just be sufficient to add a stage that aliases same-type method instantiations by their body being byte for byte identical?

Such a stage may also help with sharing code across all instantiations when it doesn't make use of the generic context at all.
IIRC there is already some global deduplication mode (which impacts stack traces) but only sharing per generic type seems to be more suitable to be enabled by default?

I can see the theoretical version of these things working but I'm obviously not sure what this would mean more concretely. (and the practical problems flowing from this, which I'm surely glossing over here)

Conclusion

If we're really serious about NativeAOT being 'effortlessly' competitive (so no crazy authoring) these issues must be explored. If only just to understand the problematic elements better.

All in all it's been challenging to keep size down to acceptable levels in this particular area of generics, async and serializer-like code.

@DamianEdwards Is there a world in which we drive dotnet/aspnetcore#45910 stage 2 efforts across internal and external collaborators more effectively than just github issues? Is that something you're open to?

Author: NinoFloris
Assignees: -
Labels:

linkable-framework, area-NativeAOT-coreclr

Milestone: Future

@danmoseley
Copy link
Member

@DamianEdwards Is there a world in which we drive dotnet/aspnetcore#45910 stage 2 efforts across internal and external collaborators more effectively than just github issues? Is that something you're open to?

@eerhardt are there tasks where the warnings for a library (stage 2 or not) seem relatively well understood and community folks could work through fixing them? or perhaps that would too much parallelism right now.

@eerhardt
Copy link
Member

are there tasks where the warnings for a library (stage 2 or not) seem relatively well understood and community folks could work through fixing them?

I'm not sure that is what @NinoFloris is asking for. But I will tag the issues for each area in dotnet/aspnetcore#45910 stage 2.a.

Is there a world in which we drive dotnet/aspnetcore#45910 stage 2 efforts across internal and external collaborators more effectively than just github issues? Is that something you're open to?

@NinoFloris - do you have a specific proposal? What other medium besides github issues would you use?

@MichalPetryka
Copy link
Contributor

MichalPetryka commented Mar 31, 2023

260 bytes is the size after sharing via __Canon. In general, smaller size implies worse speed. Would you be willing to pay for smaller size by worse startup time or worse throughput?

Personally I feel like games are an important usecase for AOTs and there the code size mostly doesn't matter since even if you have 100MB of binaries, you'll likely have multiple gigabytes of assets with it. As such, I believe that all size optimizations that negatively impact performance should be optional (or possible to disable via <IlcOptimizationPreference>Speed</IlcOptimizationPreference>).

@MichalStrehovsky
Copy link
Member

The size of async is tracked in #79204 and similar.

The valuetype sharing would come at a throughput cost - consider code like this:

[MethodImpl(MethodImplOptions.AggressiveInlining)] // compiles to a single comparison or method call
private static bool LessThan(ref T left, ref T right)
{
if (typeof(T) == typeof(byte)) return (byte)(object)left < (byte)(object)right;
if (typeof(T) == typeof(sbyte)) return (sbyte)(object)left < (sbyte)(object)right;
if (typeof(T) == typeof(ushort)) return (ushort)(object)left < (ushort)(object)right;
if (typeof(T) == typeof(short)) return (short)(object)left < (short)(object)right;
if (typeof(T) == typeof(uint)) return (uint)(object)left < (uint)(object)right;
if (typeof(T) == typeof(int)) return (int)(object)left < (int)(object)right;
if (typeof(T) == typeof(ulong)) return (ulong)(object)left < (ulong)(object)right;
if (typeof(T) == typeof(long)) return (long)(object)left < (long)(object)right;
if (typeof(T) == typeof(nuint)) return (nuint)(object)left < (nuint)(object)right;
if (typeof(T) == typeof(nint)) return (nint)(object)left < (nint)(object)right;
if (typeof(T) == typeof(float)) return (float)(object)left < (float)(object)right;
if (typeof(T) == typeof(double)) return (double)(object)left < (double)(object)right;
if (typeof(T) == typeof(Half)) return (Half)(object)left < (Half)(object)right;
return left.CompareTo(right) < 0 ? true : false;
}

If we do valuetype sharing here, what was originally a single instruction for long, now potentially becomes 4 comparisons (long/ulong/nint/nuint all have the same size). There's tons of high-performance code within the framework that assumes this code gets specialized at compile time. We do this especially when perf matters. We wouldn't be able to enable this by default, only if the user says "I care about size more than anything else". It would be a niche feature.

@jkotas
Copy link
Member

jkotas commented Aug 7, 2023

I do not see anything actionable in this issue. The actionable suggestions are tracked in linked issues.

@jkotas jkotas closed this as completed Aug 7, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-NativeAOT-coreclr linkable-framework Issues associated with delivering a linker friendly framework
Projects
Archived in project
Development

No branches or pull requests

8 participants