Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array interface method devirtualization #62457

Open
AndyAyersMS opened this issue Dec 6, 2021 · 10 comments
Open

Array interface method devirtualization #62457

AndyAyersMS opened this issue Dec 6, 2021 · 10 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Dec 6, 2021

The jit cannot devirtualize interface calls made through arrays, because this aspect of arrays has a special implementation in the runtime. I noticed this a while back but never wrote up anything. I was reminded of this again by #62266.

Consider a simple program like:

using System;
using System.Collections.Generic;

class X {
    public static void Main() {
        IList<int> list = new int[10];
        for (int i = 0; i < 100; i++) {
            foreach (var _ in list) {
                // nothing
            }
        }
    }
}

Here all types are known and we should be able to de-abstract this and come close to the equivalent for loop.

Currently none of the interface calls devirtualize or inline, the method allocates, and the try/finally can't be optimized away, despite the empty Dispose from the enumerator.

Fixing all this requires resolving a number of issues:

  1. resolveVirtualMethodHelper on the jit interface calls CanCastToInterface -- this bypasses special checks in CanCastTo that handle interface methods for arrays.
  2. Once that is fixed, the method that gets devirtualized to is SZArrayHelper.GetEnumerator<T> which is a generic method, not an instance method on a generic class. So the context handling in resolveVirtualMethodHelper needs to be updated here and also over in impDevirtualizeMethod -- currently we always assume class context.
  3. Similar changes are needed in crossgen2 (haven't looked into this yet)
  4. Once that is fixed, the jit can devirtualize the GetEnumerator call, but this happens "too late" to devirtualize the subsequent MoveNext and get_Current calls through the enumerator. The fix here is likely to mark these GetEnumerator methods as intrinsics and extend getSpecialIntrinsicExactReturnType to propagate the right type downstream during importation.
  5. The GetEnumerator call is too big to inline by default. Likely we can just mark this with AggressiveInlining. But pragmatically we might want to boost inlining for methods that return a type (especially a sealed type) that is more derived that the declared return type. Doing so likely requires resolving more tokens during the pre-inline scan, but perhaps we're already doing this for newobj and it might not be too costly.
  6. If we inline (and haven't fixed (4) yet) then we can late devirtualize the MoveNext and Dispose as we see the improved type flowing out of the inlined method -- but not the call to get_Current -- not sure why just yet.
  7. The enumerator is a class, not a struct; if we made it a struct we might have a shot at removing the boxing (rather than proving we can stack allocate a class). But this box would have multiple uses, so we'd also need to tackle JIT: optimizations for multi-use boxes #9118.
  8. It remains to be seen if we actually do inline all those methods and undo the boxing whether we can then enable struct promotion and remove the overhead of iterating via an enumerator.

I will point out that the "equivalent" direct code (with loop non-empty to avoid it being optimized away entirely)

    public static int M_enum() 
    {
        int[] a = new int[1000];
        int result = 0;
        foreach (var v in a)
        {
           result ^= v;
        }
        return result;
    } 

is marginally less efficient than the for loop version:

    public static int M() 
    {
        int[] a = new int[1000];
        int result = 0;
        for (int i = 0; i < a.Length; i++)
        {
           result ^= a[i]; 
        }
        return result;
    }

In the more general case where we don't know the exact collection class, we might hope PGO would get us to an equivalent place. But here there are additional challenges.

  1. The enumerator GDV doesn't allow us to know the enumerator type in the subsequent loop. We ought to notice a correlation between the type of the enumerator returned and the class type seen in the loop interface calls (or the fact that the latter are invariants) and decide to clone the loop based on a type test. That would remove the GDV tests from within the loop.
  2. Ideally then we see that the loop cloning check is partially redundant based on the outcome of the GetEnumerator GDV test and simplify that path further. So, in the "good" case we do one type test to see what kind of collection we have, and then branch to a specialized loop that knows exactly what kind of enumerator we have.

This is perhaps the simplest case of de-abstraction. Other collection types have more complex enumerators.

category:cq
theme:devirtualization
skill-level:expert
cost:small
impact:medium

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Dec 6, 2021
@ghost
Copy link

ghost commented Dec 6, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

The jit cannot devirtualize interface calls made through arrays, because this aspect of arrays has a special implementation in the runtime. I noticed this a while back but never wrote up anything. I was reminded of this again by #62266.

Consider a simple program like:

using System;
using System.Collections.Generic;

class X {
    public static void Main() {
        IList<int> list = new int[10];
        for (int i = 0; i < 100; i++) {
            foreach (var _ in list) {
                // nothing
            }
        }
    }
}

Here all types are known and we should be able to de-abstract this and come close to the equivalent for loop.

Currently none of the interface calls devirtualize or inline, the method allocates, and the try/finally can't be optimized away, despite the empty Dispose from the enumerator.

Fixing all this requires resolving a number of issues:

  1. resolveVirtualMethodHelper on the jit interface calls CanCastToInterface -- this bypasses special checks in CanCastTo that handle interface methods for arrays.
  2. Once that is fixed, the method that gets devirtualized to is SZArrayHelper.GetEnumerator<T> which is a generic method, not an instance method on a generic class. So the context handling in resolveVirtualMethodHelper needs to be updated here and also over in impDevirtualizeMethod -- currently we always assume class context.
  3. Similar changes are needed in crossgen2 (haven't looked into this yet)
  4. Once that is fixed, the jit can devirtualize the GetEnumerator call, but this happens "too late" to devirtualize the subsequent MoveNext and get_Current calls through the enumerator. The fix here is likely to mark these GetEnumerator methods as intrinsics and extend getSpecialIntrinsicExactReturnType to propagate the right type downstream during importation.
  5. The GetEnumerator call is too big to inline by default. Likely we can just mark this with AggressiveInlining. But pragmatically we might want to boost inlining for methods that return a type (especially a sealed type) that is more derived that the declared return type. Doing so likely requires resolving more tokens during the pre-inline scan, but perhaps we're already doing this for newobj and it might not be too costly.
  6. If we inline (and haven't fixed (4) yet) then we can late devirtualize the MoveNext and Dispose as we see the improved type flowing out of the inlined method -- but not the call to get_Current -- not sure why just yet.
  7. The enumerator is a class, not a struct; if we made it a struct we might have a shot at removing the boxing (rather than proving we can stack allocate a class). But this box would have multiple uses, so we'd also need to tackle JIT: optimizations for multi-use boxes #9118.
  8. It remains to be seen if we actually do inline all those methods and undo the boxing whether we can then enable struct promotion and remove the overhead of iterating via an enumerator.

I will point out that the "equivalent" direct code (with loop non-empty to avoid it being optimized away entirely)

    public static int M_enum() 
    {
        int[] a = new int[1000];
        int result = 0;
        foreach (var v in a)
        {
           result ^= v;
        }
        return result;
    } 

is marginally less efficient than the for loop version:

    public static int M() 
    {
        int[] a = new int[1000];
        int result = 0;
        for (int i = 0; i < a.Length; i++)
        {
           result ^= a[i]; 
        }
        return result;
    }

In the more general case where we don't know the exact collection class, we might hope PGO would get us to an equivalent place. But here there are additional challenges.

  1. The enumerator GDV doesn't allow us to know the enumerator type in the subsequent loop. We ought to notice a correlation between the type of the enumerator returned and the class type seen in the loop interface calls (or the fact that the latter are invariants) and decide to clone the loop based on a type test. That would remove the GDV tests from within the loop.
  2. Ideally then we see that the loop cloning check is partially redundant based on the outcome of the GetEnumerator GDV test and simplify that path further. So, in the "good" case we do one type test to see what kind of collection we have, and then branch to a specialized loop that knows exactly what kind of enumerator we have.

This is perhaps the simplest case of de-abstraction. Other collection types have more complex enumerators.

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Dec 7, 2021
Update jit and runtime to devirtualize interface calls on arrays.
This resolves the first two issues noted in dotnet#62457.
@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Dec 7, 2021
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Dec 7, 2021
@JulieLeeMSFT JulieLeeMSFT modified the milestones: Future, 7.0.0 Feb 2, 2022
@AndyAyersMS
Copy link
Member Author

Key loader log diff in cwt case #62497 (comment).

In the BAD case the stub desc is created early during devirt.

BAD

TID 1185c: GENERICS: Created instantiated method desc System.SZArrayHelper.GetEnumerator[System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]]()
TID 1185c: GENERICS: Created instantiating-stub method desc System.SZArrayHelper.GetEnumerator[System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]]() with dictionary size 24
..
TID 1185c: In PreStubWorker for System.SZArrayHelper::System.SZArrayHelper.GetEnumerator[System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]]()


GOOD

TID 2344: GENERICS: Created instantiated method desc System.SZArrayHelper.GetEnumerator[System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]]()
TID 2344: GENERICS: Created instantiating-stub method desc System.SZArrayHelper.GetEnumerator[System.Collections.Generic.KeyValuePair`2[System.Object,System.Object]]() with dictionary size 24
TID 2344: In PreStubWorker for System.SZArrayHelper::System.SZArrayHelper.GetEnumerator[System.Collections.Generic.KeyValuePair`2[System.Object,System.Object]]()

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue May 10, 2022
Update jit and runtime to devirtualize interface calls on arrays
over non-shared element types.

Shared types are not handled. The resulting enumerator calls
would not inline as they are on a different class.

This addresses the first two issues noted in dotnet#62457.
@AndyAyersMS
Copy link
Member Author

I have a version that backs off from devirtualizing arrays over shared types. Still trying to decide if there's sufficient benefit in this overall.

@jkotas is it feasible to change SZGenericArrayEnumerator<T> from a class to a struct, or would that cause problems? Doing that along with array interface devirt would allow promotion and (in conjunction with PGO or direct type deduction) get us closer to de-abstracting IEnumerable<T> over some types of arrays.

@jkotas
Copy link
Member

jkotas commented May 11, 2022

@jkotas is it feasible to change SZGenericArrayEnumerator from a class to a struct, or would that cause problems?

It would de-optimize it. The interface method calls would have to go through unboxing instantiating stubs.

@AndyAyersMS
Copy link
Member Author

@jkotas is it feasible to change SZGenericArrayEnumerator from a class to a struct, or would that cause problems?

It would de-optimize it. The interface method calls would have to go through unboxing instantiating stubs

Ok, thanks.

Keeping it as a class, we'd need to rely on object stack allocation to drive promotion. But we can't yet rely on object stack allocation until we can devirtualize the subsequent calls so we can know the enumerator can't escape.

That is, with the current information flow in the JIT, we'd need to know what type of enumerator was going to be returned without having to inline SZArrayHelper.GetEnumerator (similar to what we do now for other generic factory methods like EqualityComparer<T>.Default) so that we could propagate that type down to the subsequent enumerator call sites; then ensure those calls were inlined; at that point we'd see that the enumerator could not escape.

And that would only work well when we directly devirtualize; if we also need guarded devirtualization to guess at the type, then the flow from the known and unknown enumerator types merges before the calls on the enumerator and we can't conditionally disambiguate this in the GDVs that those calls inspire, so it will again look like even if we know the type that the enumerator object might escape.

Seems like too many missing pieces here and unknowable overall benefit. Going to shelve this for now.

@AndyAyersMS AndyAyersMS modified the milestones: 7.0.0, Future May 11, 2022
@AndyAyersMS
Copy link
Member Author

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Sep 17, 2024
Update JIT and runtime to devirtualize interface calls on arrays
over non-shared element types.

Shared types are not (yet) handled.

Add intrinsic and inlining attributes to key methods in the BCL.
This allows the JIT to devirtualize and inline enumerator creation
and devirtualize and inline all methods that access the enumerator.

And this in turn allows the enumerator to be stack allocated.

However, the enumerator fields are not (yet) physically promoted,
because of an optimization in the BCL to return a static empty
array enumerator. So the object being accessed later is ambiguous.

Progress towards dotnet#62457.
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Sep 17, 2024

I have revived this. Now that we have stack allocation and physical promotion, some of the issues noted above are more tractable.

Latest changes here: https://github.com/AndyAyersMS/runtime/tree/ArrayInterfaceDevirtV2

This includes fixes for points (4) and (5) above: we special-case knowledge of the return type early on, so the subsequent calls devirtualize, and we mark them as aggressive inline candidates.

With this the enumerator object is stack allocated(*) and all methods that touch its fields are devirtualized and inlined. So things are set up to enable physical promotion, but it doesn't happen yet because of the empty array optimization done here:

return length == 0 ? SZGenericArrayEnumerator<T>.Empty : new SZGenericArrayEnumerator<T>(@this, length);

This pattern causes the enumerator methods to see two reaching defs, one from the stack allocated case and another from the static.

If the array length is known we might be able to propagate it to the conditional above and eliminate one case or the other, but if the length is not known (which will commonly be the case) this won't help.

Broadly speaking if we can stack allocate then we'd be better off if the empty case wasn't handled like this, but it's not clear yet how to anticipate and/or undo that optimization...


(*) unfortunately not in the example above, there the allocation site is in a loop and we don't support stack allocation when the site is in a loop yet. So we're looking at the simpler case

public static void Main()
{
    IList<int> list = new int[10];

    foreach (var _ in list)
    {
        // nothing
    }
}

and the with the changes above the JIT produces

; Assembly listing for method X:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rbp based frame
; fully interruptible
; No PGO data
; 3 inlinees with PGO data; 4 single block inlinees; 2 inlinees without PGO data
; Final local variable assignments
;
;  V00 loc0         [V00,T02] (  9, 47   )   byref  ->  rcx         class-hnd exact single-def <System.SZGenericArrayEnumerator`1[int]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V02 tmp1         [V02,T05] (  2,  2   )   byref  ->  rcx         class-hnd "Inline return value spill temp" <System.Collections.Generic.IEnumerator`1[int]>
;* V03 tmp2         [V03    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "Inlining Arg" <int[]>
;  V04 tmp3         [V04,T06] (  2,  2   )     ref  ->  rax         class-hnd exact single-def "Inline stloc first use temp" <int[]>
;* V05 tmp4         [V05,T07] (  0,  0   )     int  ->  zero-ref    single-def "Inline stloc first use temp"
;* V06 tmp5         [V06    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <System.SZGenericArrayEnumerator`1[int]>
;* V07 tmp6         [V07,T08] (  0,  0   )     int  ->  zero-ref    single-def
;* V08 tmp7         [V08    ] (  0,  0   )   ubyte  ->  zero-ref    "Inlining Arg"
;* V09 tmp8         [V09    ] (  0,  0   )   ubyte  ->  zero-ref    "Inlining Arg"
;* V10 tmp9         [V10    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
;  V11 tmp10        [V11,T04] (  3, 20   )     int  ->  rax         "Inline stloc first use temp"
;  V12 tmp11        [V12    ] (  5,  5   )  struct (24) [rbp-0x18]  do-not-enreg[XSF] must-init addr-exposed "stack allocated ref class temp" <System.SZGenericArrayEnumerator`1[int]>
;  V13 tmp12        [V13,T00] (  2, 64   )     ref  ->  rdx         "arr expr"
;  V14 tmp13        [V14,T01] (  2, 64   )     int  ->  rax         "index expr"
;  V15 PSPSym       [V15,T09] (  1,  1   )    long  ->  [rbp-0x20]  do-not-enreg[V] "PSPSym"
;  V16 cse0         [V16,T03] (  4, 24   )     int  ->  rax         "CSE #02: aggressive"
;
; Lcl frame size = 64

G_M9330_IG01:  ;; offset=0x0000
       push     rbp
       sub      rsp, 64
       lea      rbp, [rsp+0x40]
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  xmmword ptr [rbp-0x18], xmm4
       xor      eax, eax
       mov      qword ptr [rbp-0x08], rax
       mov      qword ptr [rbp-0x20], rsp
                                                ;; size=29 bbWeight=1 PerfScore 6.33
G_M9330_IG02:  ;; offset=0x001D
       mov      rcx, 0x7FFF9098E958      ; int[]
       mov      edx, 10
       call     CORINFO_HELP_NEWARR_1_VC
       mov      rcx, 0x7FFF91289818      ; System.SZGenericArrayEnumerator`1[int]
       mov      qword ptr [rbp-0x18], rcx
       mov      dword ptr [rbp-0x10], -1
       mov      dword ptr [rbp-0x0C], 10
       mov      gword ptr [rbp-0x08], rax
       lea      rcx, [rbp-0x18]
       align    [0 bytes for IG03]
                                                ;; size=56 bbWeight=1 PerfScore 6.25
G_M9330_IG03:  ;; offset=0x0055
       mov      eax, dword ptr [rcx+0x08]
       inc      eax
       cmp      eax, dword ptr [rcx+0x0C]
       jb       SHORT G_M9330_IG05
                                                ;; size=10 bbWeight=8 PerfScore 50.00
G_M9330_IG04:  ;; offset=0x005F
       mov      eax, dword ptr [rcx+0x0C]
       mov      dword ptr [rcx+0x08], eax
       jmp      SHORT G_M9330_IG09
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M9330_IG05:  ;; offset=0x0067
       mov      dword ptr [rcx+0x08], eax
       mov      eax, dword ptr [rcx+0x08]
       cmp      eax, dword ptr [rcx+0x0C]
       jae      SHORT G_M9330_IG08
                                                ;; size=11 bbWeight=4 PerfScore 28.00
G_M9330_IG06:  ;; offset=0x0072
       mov      rdx, gword ptr [rcx+0x10]
       cmp      eax, dword ptr [rdx+0x08]
       jae      SHORT G_M9330_IG07
       jmp      SHORT G_M9330_IG03
                                                ;; size=11 bbWeight=16 PerfScore 128.00
G_M9330_IG07:  ;; offset=0x007D
       call     CORINFO_HELP_RNGCHKFAIL
       int3
                                                ;; size=6 bbWeight=0 PerfScore 0.00
G_M9330_IG08:  ;; offset=0x0083
       mov      ecx, eax
       call     [System.ThrowHelper:ThrowInvalidOperationException_EnumCurrent(int)]
       int3
                                                ;; size=9 bbWeight=0 PerfScore 0.00
G_M9330_IG09:  ;; offset=0x008C
       add      rsp, 64
       pop      rbp
       ret
                                                ;; size=6 bbWeight=1 PerfScore 1.75
G_M9330_IG10:  ;; offset=0x0092
       push     rbp
       sub      rsp, 48
       mov      rbp, qword ptr [rcx+0x20]
       mov      qword ptr [rsp+0x20], rbp
       lea      rbp, [rbp+0x40]
                                                ;; size=18 bbWeight=0 PerfScore 0.00
G_M9330_IG11:  ;; offset=0x00A4
       add      rsp, 48
       pop      rbp
       ret
                                                ;; size=6 bbWeight=0 PerfScore 0.00

Note that the empty array case is optimized away, but that happens too late for physical promotion, so all the enumerator accesses are stack accesses.

There's also an empty fault handler... need to look into why that doesn't get optimized away.

@AndyAyersMS
Copy link
Member Author

Getting this case to work is just the first step -- after that we need to look at the cases where we have only IEnumerable<T> and profiling tells is it is likely an array. So there is another level of conditionalization involved.

@AndyAyersMS
Copy link
Member Author

Looks like the finally/fault has some residual IR which ends up getting optimized away by RBO. So perhaps it makes sense to rerun the EH opts after global opts?

@AndyAyersMS
Copy link
Member Author

I worked up some (arguably hacky) changes to flag the ALLOCOBJ in SZArray's GetEnumerator as part of an "empty static pattern" and used that to prune away the control flow when the JIT is able to stack allocate the enumerator object.

However even with this the JIT is unable to promote the enumerator fields, because the enumerator address &V12 gets copied to a local V00 before the loop, and then V00 is used in a loop. Local morph is not able to resolve those in-loop uses to &V12 (since it fears V00 will be redefined) and concludes that V12 is thus exposed.

So we either need to keep track of where definitions can happen (so local morph can know not to kill certain assertions when entering loops) or try to do a more aggressive address propagation beforehand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Development

No branches or pull requests

3 participants