Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: optimizations for multi-use boxes #9118

Closed
AndyAyersMS opened this issue Oct 13, 2017 · 15 comments
Closed

JIT: optimizations for multi-use boxes #9118

AndyAyersMS opened this issue Oct 13, 2017 · 15 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Milestone

Comments

@AndyAyersMS
Copy link
Member

A fairly common pattern (especially after inlining) is to see a box that feeds an isinst and if that succeeds, an unbox.any. For example:

using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;

internal class ObjectEqualityComparer<T> : EqualityComparer<T>
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public override bool Equals(T x, T y)
    {
        if (x != null)
        {
            if (y != null) return x.Equals(y);
            return false;
        }
        if (y != null) return false;
        return true;
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public override int GetHashCode(T obj) => obj?.GetHashCode() ?? 0;
    
    // Equals method for the comparer itself.
    public override bool Equals(Object obj) =>
    obj != null && GetType() == obj.GetType();
    
    public override int GetHashCode() =>
    GetType().GetHashCode();
}

class C
{
    public static int Main()
    {
        var comp = new ObjectEqualityComparer<int>();
        bool result = comp.Equals(3, 4);
        return result ? 0 : 100;
    }
}

We get pretty far when optimizing Main here -- we can devirtualize the call to Equals, inline it and remove the null checks since we have a value type, then inline the inner call to Equals. But along the way we have to box y and the inner Equals has the following IL:

IL_0000  03                ldarg.1     
IL_0001  75 f1 00 00 02    isinst       0x20000F1
IL_0006  2d 02             brtrue.s     2 (IL_000a)
IL_0008  16                ldc.i4.0    
IL_0009  2a                ret         
IL_000a  02                ldarg.0     
IL_000b  4a                ldind.i4    
IL_000c  03                ldarg.1     
IL_000d  a5 f1 00 00 02    unbox.any    0x20000F1
IL_0012  fe 01             ceq         
IL_0014  2a                ret         

With the advent of dotnet/coreclr#14420 the jit will now optimize away the isinst, but the box cleanup opts for unbox.any don't fire because there is usually a temp in the way, and so we generate the following code for Main:

G_M4930_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40

G_M4930_IG02:
; ** BOX (y) **
       48B9086014E2FA7F0000 mov      rcx, 0x7FFAE2146008
       E8BB3C835F           call     CORINFO_HELP_NEWSFAST   
       C7400804000000       mov      dword ptr [rax+8], 4
       488BF0               mov      rsi, rax
       4885F6               test     rsi, rsi                 ; gratuitous null check ?
       7504                 jne      SHORT G_M4930_IG03
       33FF                 xor      edi, edi
       EB2D                 jmp      SHORT G_M4930_IG05

G_M4930_IG03:
; * UNBOX.ANY type check
       48BA086014E2FA7F0000 mov      rdx, 0x7FFAE2146008
       483916               cmp      qword ptr [rsi], rdx
       7412                 je       SHORT G_M4930_IG04
; * call helper if type check fails (which it won't)
       488BD6               mov      rdx, rsi
       48B9086014E2FA7F0000 mov      rcx, 0x7FFAE2146008
       E8470B395F           call     CORINFO_HELP_UNBOX

G_M4930_IG04:
       837E0803             cmp      dword ptr [rsi+8], 3
       400F94C7             sete     dil
       400FB6FF             movzx    rdi, dil

G_M4930_IG05:
       85FF                 test     edi, edi
       750C                 jne      SHORT G_M4930_IG07
       B864000000           mov      eax, 100

G_M4930_IG06:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

G_M4930_IG07:
       33C0                 xor      eax, eax

G_M4930_IG08:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

If when optimizing a successful cast we copy the result to a new more strongly typed temp (see #9117) we might be able to optimize away the type equality check in the downstream unbox.any (see dotnet/coreclr#14473). And perhaps if we are lucky and the box is simple we might be able to propagate the value to be boxed through the box/unbox to the ultimate use, and so not need the unbox. But the box would remain as it is difficult to remove unless it is known to be dead and whatever transformation makes it dead explicitly cleans it up.

A couple of ways we could approach this:

  • The optimizer should be able to reason about and propagate boxes and perhaps trigger the box/unbox.any peephole, and turn the result into a simple copy.
  • BOX is just an expression "wrapper" in the spirit of JIT: some ideas on high-level representation of runtime operations in IR #9056. So we could allow the inliner to give BOX(y) the same treatment as y and duplicate it within the inlinee body (essentially, generalize the logic in impInlineFetchArg that begins with else if (argInfo.argIsLclVar && !argCanBeModified) to also apply to BOX(y)). If we added suitable "reference counting" to boxes to track the duplicates then optimizing away the last use of the box could trigger the box cleanup. We have this today but the reference count is implicit and always = 1 since we don't duplicate the boxed values.

If all this kicked in, the code for Main above would collapse to simply returning a constant.

category:cq
theme:importer
skill-level:expert
cost:medium

@AndyAyersMS
Copy link
Member Author

Considering for 2.1.

@mikedn
Copy link
Contributor

mikedn commented Jan 1, 2018

4885F6 test rsi, rsi ; gratuitous null check ?

Yes. Can be removed by proper value numbering of GT_BOX. See dotnet/coreclr#15666.

I think that the subsequent unbox can be removed too - we need a proper VN for IND(JIT_New) and we should be able to remove CORINFO_HELP_UNBOX in optAssertionProp_Call.

But that still leaves us with the box:

G_M33790_IG01:
       4883EC28             sub      rsp, 40
G_M33790_IG02:
       48B968D0F2DEFC7F0000 mov      rcx, 0x7FFCDEF2D068
       E83DAF805F           call     CORINFO_HELP_NEWSFAST
       C7400804000000       mov      dword ptr [rax+8], 4
       83780803             cmp      dword ptr [rax+8], 3
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
       85C0                 test     eax, eax
       750A                 jne      SHORT G_M33790_IG04
       B864000000           mov      eax, 100
G_M33790_IG03:
       4883C428             add      rsp, 40
       C3                   ret
G_M33790_IG04:
       33C0                 xor      eax, eax
G_M33790_IG05:
       4883C428             add      rsp, 40
       C3                   ret

It would be nice if the optimizer could handle this but I don't think there's any mechanism in the JIT that would allow the elimination of useless stores. So even with the fixes I mentioned above it would still be preferable to get rid of the box sooner.

Another related example:

struct RGB { public byte r, g, b; }
static int Main() => Test<RGB>() ? 42 : 0;
static bool Test<T>() => return default(T) == null;

Test generates:

G_M1515_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
G_M1515_IG02:
       33F6                 xor      esi, esi
       33FF                 xor      edi, edi
       48B9805AFC7FFC7F0000 mov      rcx, 0x7FFC7FFC5A80
       E827AF845F           call     CORINFO_HELP_NEWSFAST
       488D5008             lea      rdx, bword ptr [rax+8]
       408832               mov      byte  ptr [rdx], sil
       40887A01             mov      byte  ptr [rdx+1], dil
       40887A02             mov      byte  ptr [rdx+2], dil
       4885C0               test     rax, rax
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
G_M1515_IG03:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

The null check can be eliminated but we're again stuck with the box. And this time it's completely useless, the code doesn't even attempt to read from the allocated memory.

It's not clear to me why the importer is so eager to expand box. It seems that it would be easier if the importer generates a single GT_BOX node that's then expanded during global morph.

@AndyAyersMS
Copy link
Member Author

Expansion of some runtime constructs like box can cause inlining to fail. So this creates tension in the jit: expanding too early reveals ugly details that inhibit or complicates high-level optimizations; expanding too late means the jit can't easily undo inlines that are now invalid.

Hence the wrapper idea from #9056 where we try and have our cake and eat it too. We at least need to do enough sanity checking in importation that we can spot the cases where inlining will fail.

@AndyAyersMS
Copy link
Member Author

Looked a bit at the reference counting idea. For instance genTreeBox nodes that share the same assign and copy could be linked in a circular list that is updated by gtClone and by deletion of box nodes.

Having IR cross links is obviously questionable but we already have these in genTreeBox and are potentially skating on thin ice because we don't have any ref counting.

At any rate, assuming we can somehow track how many uses there are, gtTryRemoveBoxUpstreamEffects should probably fail if there is more than one use. However, callers cannot tell why this method fails. Failing because the box is multi-use may still allow the optimization the caller has in mind to go forward, then allow them to remove their use of the box, and then potentially allow full cleanup to be triggered elsewhere. Hmm.

@AndyAyersMS
Copy link
Member Author

In the RGB example the jit doesn't try optimizing the box early as it is in a return tree. The importer currently only calls gtFoldExpr on trees feeding conditional branches.

We could presumably invoke the expression folder more often during importation and catch more cases. I suppose the question then is "how much more often" -- likely whether the size reduction wins counteract the extra cost of searching for foldable patterns. I'll play around with this and see how likely more eager folding is to lead to changes.

We also try to fold the box tree again during morph but the struct assignment to copy the box payload has been expanded into an exception-flagged comma laden tree that the box cleanup utility does not expect, so it bails out.

Attempting to optimize BOX(valueType) EQ null [000021]
gtTryRemoveBoxUpstreamEffects: attempting to remove side effects of BOX (valuetype) [000019] (assign/newobj [000011] copy [000017])

In this case the exception flags on the morphed copy tree are bogus as we're copying from a local struct to the box payload. The copy tree doesn't have an exception flag initially but fgMorphCopyBlock throws one in for some reason.

@AndyAyersMS
Copy link
Member Author

Looked at the flags issue a bit without gaining much insight. The GTF_EXCEPT flag is flipped on for GT_BLK during morph regardless of whether it was set before. So any upstream knowledge about potentially safe memory operations is seemingly just tossed out. And if we don't set the flag then fgDebugCheckFlags gets cranky.

@mikedn
Copy link
Contributor

mikedn commented Jan 30, 2018

In the RGB example the jit doesn't try optimizing the box early as it is in a return tree. The importer currently only calls gtFoldExpr on trees feeding conditional branches.

Aha! So the trick I used for string.IsNullOrEmpty works here as well but for completely different reasons:

static bool Test<T>() => default(T) == null ? true : false;

generates

G_M8956_IG02:
       33C0                 xor      eax, eax
G_M8956_IG03:
       C3                   ret

Funny.

@mikedn
Copy link
Contributor

mikedn commented Jan 30, 2018

Or in this List<T> code where I extracted the example from:
https://github.com/dotnet/coreclr/blob/8a9c59260b8cc04d266e021229f4b044bf40ad52/src/mscorlib/shared/System/Collections/Generic/List.cs#L193-L198

static bool Test<T>(object value) => (value is T) || (value == null && (default(T) == null ? true : false));

generates

G_M8956_IG02:
       488BC1               mov      rax, rcx
       4885C0               test     rax, rax
       7411                 je       SHORT G_M8956_IG03
       48BA585CC5F0FE7F0000 mov      rdx, 0x7FFEF0C55C58
       483910               cmp      qword ptr [rax], rdx
       7402                 je       SHORT G_M8956_IG03
       33C0                 xor      rax, rax
G_M8956_IG03:
       4885C0               test     rax, rax
       750B                 jne      SHORT G_M8956_IG07
       4885C9               test     rcx, rcx
       7503                 jne      SHORT G_M8956_IG05
       33C0                 xor      eax, eax
G_M8956_IG04:
       C3                   ret
G_M8956_IG05:
       33C0                 xor      eax, eax
G_M8956_IG06:
       C3                   ret
G_M8956_IG07:
       B801000000           mov      eax, 1
G_M8956_IG08:
       C3                   ret

While not ideal at least it no longer does boxing.

@AndyAyersMS
Copy link
Member Author

Seems like importer folding at relational ops is pretty cheap and effective. Will send a PR.

AndyAyersMS referenced this issue in AndyAyersMS/coreclr Jan 30, 2018
Eagerly fold expression trees for non-branch conditional operations.

Leads to elimination of boxes in some idiomatic uses. See notes and
examples in #14472.
@AndyAyersMS
Copy link
Member Author

Think the more general opt needs more conceptual bake time, so moving out of 2.1.

AndyAyersMS referenced this issue in dotnet/coreclr Jan 31, 2018
Eagerly fold expression trees for non-branch conditional operations.

Leads to elimination of boxes in some idiomatic uses. See notes and
examples in #14472.
@AndyAyersMS
Copy link
Member Author

Though I like this proposal I'm still not quite sure how to handle the case where there are multiple uses and the uses have different demands on the box tree. Say for some uses are type tests (and so just want the type handle) and some others are feeding an unboxed entry point (and so just want a local copy of the payload) and some others feed a null test.

Say we refcount and there are N uses. Until we understand all those use contexts we can't decide what shape the box tree should take on. So in addition to the reference count we need to track the kinds of references (basically the BR_xxx values that the uses would request if they were to optimize). And if any context needs the full box, then we can only satisfy the non-destructive uses (type tests and null tests).

Also since optimizations like the Enum.HasFlag couple the fates of two boxes together, box optimizations can potentially be mutually dependent. Though perhaps we can ignore this dependence as the current cases like this are single-use and so can just be handled directly.

@zpodlovics
Copy link

@AndyAyersMS Is it possible to also recognize + optimize box/unbox sequence? It seems this IL sequence is common in F#.

dotnet/fsharp#5019 (comment)

@AndyAyersMS
Copy link
Member Author

Adjacent box / unbox.any is handled via a peephole optimization in the jit importer (look at importer.cpp around line 14890).

If there is IL in between or a dup of the box the jit may not optimize.

@AndyAyersMS
Copy link
Member Author

Looks like a longshot for .NET 9 now too, so moving to future.

@AndyAyersMS AndyAyersMS modified the milestones: 9.0.0, Future Apr 14, 2024
AndyAyersMS added a commit that referenced this issue Jul 1, 2024
Enable object stack allocation for ref classes and extend the support to include boxed value classes. Use a specialized unbox helper for stack allocated boxes, both to avoid apparent escape of the box by the helper, and to ensure all box field accesses are visible to the JIT. Update the local address visitor to rewrite trees involving address of stack allocated boxes in some cases to avoid address exposure. Disable old promotion for stack allocated boxes (since we have no field handles) and allow physical promotion to enregister the box method table and/or payload as appropriate. In OSR methods handle the fact that the stack allocation may actually have been a heap allocation by the Tier0 method.

The analysis TP cost is around 0.4-0.7% (notes below). Boxes are much less likely to escape than ref classes (roughly ~90% of boxes escape, ~99.8% of ref classes escape). Codegen impact is diminished somewhat because many of the boxes are dead and were already getting optimized away.
 
Fixes #4584, #9118, #10195, #11192, #53585, #58554, #85570

---------

Co-authored-by: Jakob Botsch Nielsen <jakob.botsch.nielsen@gmail.com>
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
@AndyAyersMS
Copy link
Member Author

Fixed by #103361

@github-actions github-actions bot locked and limited conversation to collaborators Aug 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants