Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Smarter ordering of late args based on register uses #76017

Closed
wants to merge 18 commits into from

Conversation

jakobbotsch
Copy link
Member

No description provided.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 22, 2022
@ghost ghost assigned jakobbotsch Sep 22, 2022
@ghost
Copy link

ghost commented Sep 22, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member

So essentially this rearranges the args depending on how many of them will get overwritten/usage?

@jakobbotsch
Copy link
Member Author

So essentially this rearranges the args depending on how many of them will get overwritten/usage?

Right, maybe an example is easiest:

 Sorting the arguments:
-  [000027] -> #0 (cost 6)
-  [000029] -> #1 (last arg)
-Deferred argument ('rcx'):
-     (  6,  5) [000027] ---XG------                         ▌  IND       ref   
-     (  4,  3) [000033] -------N---                         └──▌  ADD       byref 
-     (  3,  2) [000026] -----------                            ├──▌  LCL_VAR   ref    V02 loc0         
-     (  1,  1) [000032] -----------                            └──▌  CNS_INT   long   16 Fseq[_delegate]
-Moved to late list
+  [000029] -> #0 (cost 6)
+  [000027] -> #1 (last arg)
 Deferred argument ('rdx'):
      (  6,  5) [000029] ---XG------                         ▌  IND       ref   
      (  4,  3) [000035] -------N---                         └──▌  ADD       byref 
      (  3,  2) [000028] -----------                            ├──▌  LCL_VAR   ref    V00 this         
      (  1,  1) [000034] -----------                            └──▌  CNS_INT   long   16 Fseq[_delegate]
 Moved to late list
+Deferred argument ('rcx'):
+     (  6,  5) [000027] ---XG------                         ▌  IND       ref   
+     (  4,  3) [000033] -------N---                         └──▌  ADD       byref 
+     (  3,  2) [000026] -----------                            ├──▌  LCL_VAR   ref    V02 loc0         
+     (  1,  1) [000032] -----------                            └──▌  CNS_INT   long   16 Fseq[_delegate]
+Moved to late list
 
-Register placement order:    rcx rdx 
+Register placement order:    rdx rcx 

As you can see, the argument placed in rdx uses this. this is likely to be in rcx already, so if we place rcx first we need to spill this somewhere else.

Overall an improvement, but there are many regressions too -- this stuff is quite sensitive. I also only currently do the reordering for trees with equal cost (so e.g. if [000029] was LCL_VAR ref V00 this instead this PR would not change anything and the conflict would remain). Trying to find the right heuristics to use.

Another thing to consider is that if there is a previous call then the parameters are likely to have been spilled into other registers anyway, so the sort is not helpful (but probably not harmful either). However this depends on the exact block linearization that LSRA picks. Might be worth experimenting here in morph to see if we predict it in an easy way.

@jakobbotsch
Copy link
Member Author

There are also other smart things we can do, e.g. if we notice a cycle, we can evaluate one of the arguments in the cycle into a temp and place it last. That essentially does a small live-range split which should help LSRA spill into a register.

This starts communicating more information about diffs back from
superpmi and starts using it in the driver. The current information is
the base size, diff size, base instructions, diff instructions and
context size.

The driver uses the base size/diff size to pick a small number of
interesting contexts and only creates disassembly for these contexts.
jit-analyze is then also invoked on only these contexts, but the
contexts are picked so that they overlap with what jit-analyze would
display. Additionally, we also pick the top 20 smallest contexts (in
terms of context size) -- I frequently use the smallest contexts with
diffs to investigate my change.

The new behavior is only enabled when no custom metrics are specified.
If custom metrics are specified we fall back to producing all
disassembly files and leave it up to jit-analyze to extract and analyze
the metrics specified.

Also remove --retainOnlyTopFiles. This is no longer necessary since we
avoid creating these disassembly files in the first place. Another
benefit is that all contexts mentioned by jit-analyze output will now be
part of the artifacts.

The net result is that we can now get SPMI diffs for changes with even
hundreds of thousands of diffs in the same time as it takes to get diffs
for a small change.

Fix dotnet#76178
Since there can be multiple conflicting pairs this strategy works
better.
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 28, 2022

Spot-checking some regressions.

libraries.pmi regression
10 (47.62% of base) : 122054.dasm - Microsoft.FSharp.Control.ObservableModule:succeed@14[System.__Canon](System.IObserver`1[System.__Canon],Microsoft.FSharp.Core.FSharpOption`1[System.__Canon])

The diff looks like:

 ; No matching PGO data
 ; Final local variable assignments
 ;
-;  V00 TypeCtx      [V00,T00] (  5,  4   )    long  ->  r11         single-def
-;  V01 arg0         [V01,T02] (  3,  2.50)     ref  ->  rcx         class-hnd single-def
+;  V00 TypeCtx      [V00,T00] (  5,  4   )    long  ->  [rsp+08H]   single-def
+;  V01 arg0         [V01,T02] (  3,  2.50)     ref  ->  rdx         class-hnd single-def
 ;  V02 arg1         [V02,T01] (  4,  3.50)     ref  ->   r8         class-hnd single-def
 ;* V03 loc0         [V03    ] (  0,  0   )     ref  ->  zero-ref    class-hnd single-def
 ;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
@@ -16,30 +16,31 @@
 ; Lcl frame size = 0
 
 G_M32974_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
-       mov      r11, rcx
-       mov      rcx, rdx
-       ; gcrRegs +[rcx]
-						;; size=6 bbWeight=1    PerfScore 0.50
-G_M32974_IG02:        ; gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, byref, isz
-       ; gcrRegs +[r8]
+						;; size=0 bbWeight=1    PerfScore 0.00
+G_M32974_IG02:        ; gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, byref, isz
+       ; gcrRegs +[rdx r8]
        test     r8, r8
        jne      SHORT G_M32974_IG04
 						;; size=5 bbWeight=1    PerfScore 1.25
 G_M32974_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
-       ; gcrRegs -[rcx r8]
+       ; gcrRegs -[rdx r8]
        ret      
 						;; size=1 bbWeight=0.50 PerfScore 0.50
-G_M32974_IG04:        ; gcVars=0000000000000000 {}, gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, gcvars, byref
-       ; gcrRegs +[rcx r8]
+G_M32974_IG04:        ; gcVars=0000000000000000 {}, gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, gcvars, byref
+       ; gcrRegs +[rdx r8]
+       mov      qword ptr [rsp+08H], rcx
+       mov      r11, rcx
+       mov      rcx, rdx
+       ; gcrRegs +[rcx]
        mov      rdx, gword ptr [r8+08H]
-       ; gcrRegs +[rdx]
        cmp      dword ptr [rcx], ecx
-						;; size=6 bbWeight=0.50 PerfScore 2.50
+       mov      rax, qword ptr [rsp+08H]
+						;; size=22 bbWeight=0.50 PerfScore 3.75
 G_M32974_IG05:        ; , epilog, nogc, extend
-       tail.jmp [r11]
+       tail.jmp [rax]
 						;; size=3 bbWeight=0.50 PerfScore 1.00
@@ -5,17 +5,23 @@ Sorting the arguments:
     [000010] [000008] [000019]
   [000010] -> #0 (last arg)
     [000010] [000008] [000019]
+  [000010] -> #1 (clobbers rdx used by [000008])
+    [000008] [000010] [000019]
+  [000008] -> #2 (clobbers rcx used by [000019])
+    [000010] [000019] [000008]
+  [000010] -> #2 (clobbers rdx used by [000008])
+    [000019] [000008] [000010]
+Deferred argument ('r11'):
+               [000019] -----+-----                         ▌  LCL_VAR   long   V00 TypeCtx      
+Moved to late list
+Deferred argument ('rcx'):
+               [000008] -----+-----                         ▌  LCL_VAR   ref    V01 arg0         
+Moved to late list
 Deferred argument ('rdx'):
                [000010] ---XG+-----                         ▌  IND       ref   
                [000021] -----+-----                         └──▌  ADD       byref 
                [000009] -----+-----                            ├──▌  LCL_VAR   ref    V02 arg1         
                [000020] -----+-----                            └──▌  CNS_INT   long   8 Fseq[value]
 Moved to late list
-Deferred argument ('rcx'):
-               [000008] -----+-----                         ▌  LCL_VAR   ref    V01 arg0         
-Moved to late list
-Deferred argument ('r11'):
-               [000019] -----+-----                         ▌  LCL_VAR   long   V00 TypeCtx      
-Moved to late list
 
-Register placement order:    rdx rcx r11 
+Register placement order:    r11 rcx rdx 

On the surface of it this register placement order is much better, we avoid two conflicts. However, for some reason it causes LSRA to not allocate TypeCtx into r11, meaning that it is kept in rcx, which gets overridden during the call. TypeCtx is used for the calli target as well, so this causes rcx to be spilled to stack.

I think the best solution would be to apply the same trick to VSD calls with dynamically computed VSD stub as we do for static VSD stubs -- simply always do the call through indirection of the indir cell.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 28, 2022

libraries.pmi regression
17 (73.91% of base) : 112173.dasm - System.Xml.Xsl.Runtime.XmlNavTypeFilter:MoveToFollowing(System.Xml.XPath.XPathNavigator,System.Xml.XPath.XPathNavigator):bool:this

Diff looks like:

 ; No matching PGO data
 ; Final local variable assignments
 ;
-;  V00 this         [V00,T01] (  3,  3   )     ref  ->  rcx         this class-hnd single-def
-;  V01 arg1         [V01,T00] (  4,  4   )     ref  ->  rax         class-hnd single-def
+;  V00 this         [V00,T01] (  3,  3   )     ref  ->  [rsp+08H]   this class-hnd single-def
+;  V01 arg1         [V01,T00] (  4,  4   )     ref  ->  [rsp+10H]   class-hnd single-def
 ;  V02 arg2         [V02,T02] (  3,  3   )     ref  ->   r8         class-hnd single-def
 ;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
 ;
 ; Lcl frame size = 0
 
 G_M53345_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
-       mov      rax, rdx
+       mov      gword ptr [rsp+08H], rcx
+       ; GC ptr vars +{V00}
+						;; size=5 bbWeight=1    PerfScore 1.00
+G_M53345_IG02:        ; gcVars=0000000000000002 {V00}, gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, gcvars, byref
+       ; gcrRegs +[rdx r8]
+       mov      gword ptr [rsp+10H], rdx
+       ; GC ptr vars +{V01}
+       mov      rcx, rdx
+       ; gcrRegs +[rcx]
+       mov      rax, gword ptr [rsp+08H]
        ; gcrRegs +[rax]
-						;; size=3 bbWeight=1    PerfScore 0.25
-G_M53345_IG02:        ; gcrefRegs=00000103 {rax rcx r8}, byrefRegs=00000000 {}, byref
-       ; gcrRegs +[rcx r8]
-       mov      edx, dword ptr [rcx+08H]
-       mov      rcx, rax
+       mov      edx, dword ptr [rax+08H]
+       ; gcrRegs -[rdx]
+       mov      rax, gword ptr [rsp+10H]
        mov      rax, qword ptr [rax]
        ; gcrRegs -[rax]
        mov      rax, qword ptr [rax+80H]
-						;; size=16 bbWeight=1    PerfScore 6.25
+						;; size=31 bbWeight=1    PerfScore 9.25
 G_M53345_IG03:        ; , epilog, nogc, extend
        tail.jmp [rax+10H]System.Xml.XPath.XPathNavigator:hackishMethodName
 						;; size=4 bbWeight=1    PerfScore 2.00
 
@@ -5,17 +5,23 @@ Sorting the arguments:
     [000002] [000000] [000003]
   [000002] -> #0 (last arg)
     [000002] [000000] [000003]
+  [000002] -> #1 (clobbers rdx used by [000000])
+    [000000] [000002] [000003]
+  [000000] -> #1 (clobbers rcx used by [000002])
+    [000002] [000000] [000003]
+  [000002] -> #1 (clobbers rdx used by [000000])
+    [000000] [000002] [000003]
+Deferred argument ('rcx'):
+               [000000] -----+-----                         ▌  LCL_VAR   ref    V01 arg1         
+Moved to late list
 Deferred argument ('rdx'):
                [000002] ---XG+-----                         ▌  IND       int   
                [000007] -----+-----                         └──▌  ADD       byref 
                [000001] -----+-----                            ├──▌  LCL_VAR   ref    V00 this         
                [000006] -----+-----                            └──▌  CNS_INT   long   8 Fseq[hackishFieldName]
 Moved to late list
-Deferred argument ('rcx'):
-               [000000] -----+-----                         ▌  LCL_VAR   ref    V01 arg1         
-Moved to late list
 Deferred argument ('r8'):
                [000003] -----+-----                         ▌  LCL_VAR   ref    V02 arg2         
 Moved to late list
 
-Register placement order:    rdx rcx r8 
+Register placement order:    rcx rdx r8 

Looks like there is a cycle in the interference graph and the algorithm ends up swapping them compared to the previous order, and LSRA does not cope well with this. Seems like it may be beneficial to keep LCL_VAR in the later positions to have LSRA rehome these.

@jakobbotsch
Copy link
Member Author

Seems like it may be beneficial to keep LCL_VAR in the later positions to have LSRA rehome these.

Nope, not in general. It solves the above regression but it is overall a regression to put LCL_VAR nodes with conflicts at the end.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 28, 2022

Nope, not in general. It solves the above regression but it is overall a regression to put LCL_VAR nodes with conflicts at the end.

Problem here was that I was reordering locals. If we have a cycle caused by complex arg vs local, then it does look slightly better to keep the local at the end.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 28, 2022

libraries.pmi regression
10 (43.48% of base) : 78804.dasm - EnumeratorImpl:get_Current():Microsoft.CodeAnalysis.SyntaxNodeOrToken:this

Diff:

 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T01] (  3,  3   )     ref  ->  rcx         this class-hnd single-def
-;  V01 RetBuf       [V01,T02] (  3,  3   )   byref  ->  rax         single-def
+;  V01 RetBuf       [V01,T02] (  3,  3   )   byref  ->  rdx         single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;  V03 tmp1         [V03,T00] (  3,  6   )   byref  ->  rcx         single-def "Inlining Arg"
+;  V03 tmp1         [V03,T00] (  3,  6   )   byref  ->  [rsp+00H]   spill-single-def "Inlining Arg"
 ;
-; Lcl frame size = 0
+; Lcl frame size = 8
 
 G_M28311_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
-       mov      rax, rdx
-       ; byrRegs +[rax]
-						;; size=3 bbWeight=1    PerfScore 0.25
-G_M28311_IG02:        ; gcrefRegs=00000002 {rcx}, byrefRegs=00000001 {rax}, byref
+       push     rax
+						;; size=1 bbWeight=1    PerfScore 1.00
+G_M28311_IG02:        ; gcrefRegs=00000002 {rcx}, byrefRegs=00000004 {rdx}, byref
        ; gcrRegs +[rcx]
+       ; byrRegs +[rdx]
        add      rcx, 8
        ; gcrRegs -[rcx]
        ; byrRegs +[rcx]
+       mov      bword ptr [rsp], rcx
+       ; GC ptr vars +{V03}
        mov      r8d, dword ptr [rcx+0CH]
-       mov      rdx, gword ptr [rcx]
+       mov      rcx, rdx
+       mov      rdx, bword ptr [rsp]
+       mov      rdx, gword ptr [rdx]
        ; gcrRegs +[rdx]
-       mov      rcx, rax
-						;; size=14 bbWeight=1    PerfScore 4.50
+       ; byrRegs -[rdx]
+						;; size=22 bbWeight=1    PerfScore 6.50
 G_M28311_IG03:        ; , epilog, nogc, extend
+       add      rsp, 8
        tail.jmp [hackishMethodName]
        ; gcr arg pop 0
-						;; size=6 bbWeight=1    PerfScore 2.00
+						;; size=10 bbWeight=1    PerfScore 2.25
@@ -5,18 +5,20 @@ Sorting the arguments:
     [000011] [000009] [000013]
   [000009] -> #1 (last arg)
     [000011] [000009] [000013]
+  [000009] -> #2 (clobbers rdx used by [000013])
+    [000011] [000013] [000009]
 Deferred argument ('r8'):
      (  6,  5) [000011] ---XG------                         ▌  IND       int   
      (  4,  3) [000025] -------N---                         └──▌  ADD       byref 
      (  3,  2) [000010] -----------                            ├──▌  LCL_VAR   byref  V03 tmp1         
      (  1,  1) [000024] -----------                            └──▌  CNS_INT   long   12
 Moved to late list
+Deferred argument ('rcx'):
+               [000013] -----+-----                         ▌  LCL_VAR   byref  V01 RetBuf       
+Moved to late list
 Deferred argument ('rdx'):
      (  6,  4) [000009] ---XG------                         ▌  IND       ref   
      (  3,  2) [000008] -----------                         └──▌  LCL_VAR   byref  V03 tmp1         
 Moved to late list
-Deferred argument ('rcx'):
-               [000013] -----+-----                         ▌  LCL_VAR   byref  V01 RetBuf       
-Moved to late list
 
-Register placement order:    r8 rdx rcx 
+Register placement order:    r8 rcx rdx 

The new arg order looks preferable, we get rid of the conflict that was there before by placing rdx first which has the ret buffer.

Not clear to me why tmp1 is being spilled here. Looks like LSRA picks rcx for it and then notices the conflict after. But rax is free and will not result in any conflicts, so I guess some heuristic does not work out well here. Full jit dump attached if you want to take a look @kunalspathak.

out.txt

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 28, 2022

aspnet.run regression
15 (26.32% of base) : 16805.dasm - System.Threading.Tasks.Task:NotifyParentIfPotentiallyAttachedTask():this

Diff:

 ; Assembly listing for method System.Threading.Tasks.Task:NotifyParentIfPotentiallyAttachedTask():this
 ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
 ; Tier-1 compilation
 ; optimized code
 ; optimized using profile data
 ; rsp based frame
 ; fully interruptible
 ; with Dynamic PGO: edge weights are valid, and fgCalledCount is 492
 ; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 this         [V00,T00] (  5,  3   )     ref  ->  [rsp+08H]   this class-hnd single-def
-;  V01 loc0         [V01,T03] (  6,  2   )     ref  ->  rdx         class-hnd single-def
+;  V00 this         [V00,T00] (  5,  3   )     ref  ->  rcx         this class-hnd single-def
+;  V01 loc0         [V01,T03] (  6,  2   )     ref  ->  [rsp+00H]   class-hnd single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
 ;  V03 tmp1         [V03,T01] (  2,  4   )     ref  ->  rdx         class-hnd single-def "dup spill"
 ;  V04 tmp2         [V04,T02] (  3,  3   )     ref  ->  rdx         single-def
 ;  V05 tmp3         [V05,T04] (  4,  2   )     ref  ->  rdx        
 ;* V06 tmp4         [V06    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
 ;
-; Lcl frame size = 0
+; Lcl frame size = 8
 
 G_M22329_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
-						;; size=0 bbWeight=1    PerfScore 0.00
+       push     rax
+						;; size=1 bbWeight=1    PerfScore 1.00
 G_M22329_IG02:        ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref, isz
        ; gcrRegs +[rcx]
        mov      rdx, gword ptr [rcx+28H]
        ; gcrRegs +[rdx]
        test     rdx, rdx
        je       SHORT G_M22329_IG04
        mov      rdx, gword ptr [rdx+30H]
        test     rdx, rdx
        jne      SHORT G_M22329_IG05
 						;; size=18 bbWeight=1    PerfScore 6.50
 G_M22329_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
        ; gcrRegs -[rcx rdx]
+       add      rsp, 8
        ret      
-						;; size=1 bbWeight=1    PerfScore 1.00
+						;; size=5 bbWeight=1    PerfScore 1.25
 G_M22329_IG04:        ; gcVars=0000000000000000 {}, gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, gcvars, byref, isz
        ; gcrRegs +[rcx]
        xor      rdx, rdx
        ; gcrRegs +[rdx]
        test     rdx, rdx
        je       SHORT G_M22329_IG03
 						;; size=7 bbWeight=0    PerfScore 0.00
 G_M22329_IG05:        ; gcrefRegs=00000006 {rcx rdx}, byrefRegs=00000000 {}, byref, isz
+       mov      gword ptr [rsp], rdx
+       ; GC ptr vars +{V01}
        test     byte  ptr [rdx+34H], 8
        jne      SHORT G_M22329_IG03
-       mov      gword ptr [rsp+08H], rcx
-       ; GC ptr vars +{V00}
+       mov      rdx, gword ptr [rsp]
        test     byte  ptr [rcx+34H], 4
+       mov      gword ptr [rsp], rdx
        je       SHORT G_M22329_IG03
-       mov      rcx, rdx
-       mov      rdx, gword ptr [rsp+08H]
-						;; size=25 bbWeight=0    PerfScore 0.00
+       mov      rdx, rcx
+       mov      rcx, gword ptr [rsp]
+						;; size=31 bbWeight=0    PerfScore 0.00
 G_M22329_IG06:        ; , epilog, nogc, extend
+       add      rsp, 8
        tail.jmp [hackishMethodName]
        ; gcr arg pop 0
-						;; size=6 bbWeight=0    PerfScore 0.00
+						;; size=10 bbWeight=0    PerfScore 0.00
@@ -3,11 +3,13 @@ Sorting the arguments:
     [000043] [000044]
   [000043] -> #0 (non-struct local)
     [000043] [000044]
-Deferred argument ('rcx'):
-               [000043] -----+-----                         ▌  LCL_VAR   ref    V01 loc0         
-Moved to late list
+  [000043] -> #1 (clobbers rcx used by [000044])
+    [000044] [000043]
 Deferred argument ('rdx'):
                [000044] -----+-----                         ▌  LCL_VAR   ref    V00 this         
 Moved to late list
+Deferred argument ('rcx'):
+               [000043] -----+-----                         ▌  LCL_VAR   ref    V01 loc0         
+Moved to late list
 
-Register placement order:    rcx rdx 
+Register placement order:    rdx rcx 

Seems like the same as the previous case -- loc0 is allocated to rdx which is an unfortunate choice because it ends up conflicting with arg placement. rax is also free here, so not clear why that isn't preferred.

@kunalspathak
Copy link
Member

Not clear to me why tmp1 is being spilled here. Looks like LSRA picks rcx for it and then notices the conflict after.

Correct. We allocate rcx for refpositions related to [000011], but then it sees there are FixedRegister (rcx) usage at [000013]] for PUT_ARG because of which it has to spill rcx`.

But rax is free and will not result in any conflicts, so I guess some heuristic does not work out well here.

rcx is picked because of ABI needs and not because of any heuristics.

However this depends on the exact block linearization that LSRA picks. Might be worth experimenting here in morph to see if we predict it in an easy way.

Morph would be too early to predict the block order that LSRA would pick. If you just renumber the blocks right before LSRA, you would see 1000s of improvements/regressions. I wouldn't recommend relying on "block ordering" factor at least to predict register allocation decision.

@jakobbotsch
Copy link
Member Author

rcx is picked because of ABI needs and not because of any heuristics.

Can you elaborate? I don't see any ABI need for putting tmp1 in rcx.

@kunalspathak
Copy link
Member

Can you elaborate? I don't see any ABI need for putting tmp1 in rcx.

I meant to say rcx being picked for Put_Arg, not V03.

lowering arg : N004 (  4,  4) [000011] ---XG------                         ▌  IND       int    <l:$280, c:$281>
new node is :                [000028] ---XG------                         ▌  PUTARG_REG int    REG r8

lowering arg : N005 (  1,  1) [000013] -----------                         ▌  LCL_VAR   byref  V01 RetBuf       u:1 (last use) $c0
new node is :                [000029] -----------                         ▌  PUTARG_REG byref  REG rcx

lowering arg : N007 (  3,  2) [000009] n---GO-----                         ▌  IND       ref    <l:$241, c:$242>
new node is :                [000030] ----GO-----                         ▌  PUTARG_REG ref    REG rdx

For V03, we do not do forward looking to great extent and readjust our choices (and hence it is linear allocation), but perhaps, we should be able to filter out registers from getting assigned if they are soon will be needed because of FixedReg requirements. So in this case, as you said, we might chose rcx but then see it is needed at FixedReg refposition and instead avoid picking rcx and pick a different one that doesn't have FixedReg requirement. I will add it to my list of things.

@jakobbotsch
Copy link
Member Author

For V03, we do not do forward looking to great extent and readjust our choices (and hence it is linear allocation), but perhaps, we should be able to filter out registers from getting assigned if they are soon will be needed because of FixedReg requirements. So in this case, as you said, we might chose rcx but then see it is needed at FixedReg refposition and instead avoid picking rcx and pick a different one that doesn't have FixedReg requirement. I will add it to my list of things.

I understand, but why allocate it to rcx? I would expect rax to be picked before it. Is it preferenced that way due to the assignment

STMT00003 ( 0x000[E-] ... ??? )
N005 (  3,  3) [000015] -A--G---R--ASG       byref  $VN.Void
N004 (  1,  1) [000014] D------N---                         ├──▌  LCL_VAR   byref  V03 tmp1         d:1 $VN.Void
N003 (  3,  3) [000021] -----------                         └──▌  ADD       byref  $180
N001 (  1,  1) [000019] -----------                            ├──▌  LCL_VAR   ref    V00 this         u:1 (last use) $80
N002 (  1,  1) [000020] -----------                            └──▌  CNS_INT   long   8 Fseq[hackishFieldName] $140

?

@kunalspathak
Copy link
Member

Is it preferenced that way due to the assignment

Yes. It seems, out of multiple choices, we end up picking rcx based upon "COVERS" heuristic (refposition# 6 and #7). Now, I will have to debug and see what other choices we had, what the state was for other intervals and why it thought rcx covers the entire lifetime although, the last use refposition #17 and rcx definitely doesn't cover upto that point because of presence of fixed registers in between.

 9.#3  V0   Use *  Keep     rcx  │    │V0 a│V1 a│    │    │    │    │    │    │
10.#4  I3   Def    COVRS(A) rcx  │    │I3 a│V1 a│    │    │    │    │    │    │
11.#5  I3   Use *  Keep     rcx  │    │I3 a│V1 a│    │    │    │    │    │    │
12.#6  V3   Def    COVRS(A) rcx  │    │V3 a│V1 a│    │    │    │    │    │    │
19.#7  V3   Use    Keep     rcx  │    │V3 a│V1 a│    │    │    │    │    │    │

@kunalspathak
Copy link
Member

rcx is chosen because the interval V03 is tied to V00 because of use-def chain and we try to prefer the same register floating around without having to move the values between different registers. V00 was in rcx and we prefer that for V03 as well.

Regarding why rcx was chosen by cover set, because that's the only register in the preferred set of that interval, so we 100% assign that register.

why it thought rcx covers the entire lifetime

Well, here is the answer:

runtime/src/coreclr/jit/lsra.h

Lines 2435 to 2438 in 65bde8c

// It would seem to make sense to only return 'nextRefPosition' if it is a lastUse,
// and otherwise return `lastRefPosition', but that tends to excessively lengthen
// the range for heuristic purposes.
// TODO-CQ: Look into how this might be improved .

@jakobbotsch
Copy link
Member Author

rcx is chosen because the interval V03 is tied to V00 because of use-def chain and we try to prefer the same register floating around without having to move the values between different registers. V00 was in rcx and we prefer that for V03 as well.

Thanks, that makes sense. Note that it's not a straight copy, it is rcx = rcx + 8, but I guess we still prefer it (lea rax, [rcx+8] would have the same size).

we might chose rcx but then see it is needed at FixedReg refposition and instead avoid picking rcx and pick a different one that doesn't have FixedReg requirement

I actually noticed that we have this logic for parameters already:

else if (assignedRegister != REG_NA)
{
// Handle the case where this is a preassigned register (i.e. parameter).
// We don't want to actually use the preassigned register if it's not
// going to cover the lifetime - but we had to preallocate it to ensure
// that it remained live.
// TODO-CQ: At some point we may want to refine the analysis here, in case
// it might be beneficial to keep it in this reg for PART of the lifetime
if (currentInterval->isLocalVar)
{
regMaskTP preferences = currentInterval->registerPreferences;
bool keepAssignment = true;
bool matchesPreferences = (preferences & genRegMask(assignedRegister)) != RBM_NONE;
// Will the assigned register cover the lifetime? If not, does it at least
// meet the preferences for the next RefPosition?
LsraLocation nextPhysRegLocation = nextFixedRef[assignedRegister];
if (nextPhysRegLocation <= currentInterval->lastRefPosition->nodeLocation)
{
// Check to see if the existing assignment matches the preferences (e.g. callee save registers)
// and ensure that the next use of this localVar does not occur after the nextPhysRegRefPos
// There must be a next RefPosition, because we know that the Interval extends beyond the
// nextPhysRegRefPos.
assert(nextRefPosition != nullptr);
if (!matchesPreferences || nextPhysRegLocation < nextRefPosition->nodeLocation)
{
keepAssignment = false;
}
else if ((nextRefPosition->registerAssignment != assignedRegBit) &&
(nextPhysRegLocation <= nextRefPosition->getRefEndLocation()))
{
keepAssignment = false;
}
}
else if (refType == RefTypeParamDef && !matchesPreferences)
{
// Don't use the register, even if available, if it doesn't match the preferences.
// Note that this case is only for ParamDefs, for which we haven't yet taken preferences
// into account (we've just automatically got the initial location). In other cases,
// we would already have put it in a preferenced register, if it was available.
// TODO-CQ: Consider expanding this to check availability - that would duplicate
// code here, but otherwise we may wind up in this register anyway.
keepAssignment = false;
}
if (keepAssignment == false)
{
RegRecord* physRegRecord = getRegisterRecord(currentInterval->physReg);
currentRefPosition->registerAssignment = allRegs(currentInterval->registerType);
currentRefPosition->isFixedRegRef = false;
unassignPhysRegNoSpill(physRegRecord);
// If the preferences are currently set to just this register, reset them to allRegs
// of the appropriate type (just as we just reset the registerAssignment for this
// RefPosition.
// Otherwise, simply remove this register from the preferences, if it's there.
if (currentInterval->registerPreferences == assignedRegBit)
{
currentInterval->registerPreferences = currentRefPosition->registerAssignment;
}
else
{
currentInterval->registerPreferences &= ~assignedRegBit;
}
assignedRegister = REG_NA;
assignedRegBit = RBM_NONE;
}
}
}

I tried to extend it to other locals too but I wasn't able to find out the right way to check this properly (I hit some snags where the next/last ref positions were in other blocks and I could not check if there was an actual use after a fixed reg).

@kunalspathak
Copy link
Member

I actually noticed that we have this logic for parameters already:

Yep, and that's precisely what #7999 is suggesting to improve.

@jakobbotsch
Copy link
Member Author

that's precisely what #7999 is suggesting to improve

Is it? The improvement here would be to extend this optimization to other locals (not just parameters). #7999 seems to be about allowing parameters to move between different registers

@kunalspathak
Copy link
Member

The improvement here would be to extend this optimization to other locals

Sure, that's a fair point.

@jakobbotsch
Copy link
Member Author

I polished the implementation of this a bit. We now do an actual stable sort of the arguments that puts constants at the end and calls at the beginning. The previous implementation used swapping which had the annoying side effect of reordering other arguments in random ways, causing any changes to the algorithm to hit spurious other diffs.

After the stable sort there is a separate step to resolve conflicts, which is done by an implementation of Tarjan's algorithm to pick a register order with a minimal number of conflicts. win-x64 diffs are looking pretty good and the TP impact seems reasonable, but some other platforms are seeing regressions (especially linux-arm). I need to investigate why in more detail tomorrow.

@kunalspathak
Copy link
Member

which is done by an implementation of Tarjan's algorithm to pick a register order with a minimal number of conflicts

I want to do some experiments using similar concept in establishing the relation between various blocks during sequencing as mentioned in #66994 (comment).

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 30, 2022

Investigating arm32 diffs.

libraries.pmi arm32 regression
           8 (33.33% of base) : 173904.dasm - Builder[int,System.Nullable`1[int]]:ContainsKey(int):bool:this

Diff:

 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T00] (  4,  4   )     ref  ->   r0         this class-hnd single-def
-;  V01 arg1         [V01,T01] (  3,  3   )     int  ->   r3         single-def
+;  V01 arg1         [V01,T01] (  3,  3   )     int  ->   r1         single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct ( 8) zero-ref    "struct address for call/obj"
 ;* V04 tmp2         [V04    ] (  0,  0   )  struct ( 8) zero-ref    multireg-arg ld-addr-op "NewObj constructor temp"
-;  V05 tmp3         [V05,T02] (  2,  4   )     ref  ->   r1         class-hnd single-def "Inlining Arg"
-;  V06 tmp4         [V06,T03] (  2,  4   )     ref  ->   r2         class-hnd single-def "Inlining Arg"
+;  V05 tmp3         [V05,T02] (  2,  4   )     ref  ->   r2         class-hnd single-def "Inlining Arg"
+;  V06 tmp4         [V06,T03] (  2,  4   )     ref  ->   r0         class-hnd single-def "Inlining Arg"
 ;* V07 tmp5         [V07    ] (  0,  0   )     ref  ->  zero-ref    V03._root(offs=0x00) P-INDEP "field V03._root (fldOffset=0x0)"
 ;* V08 tmp6         [V08    ] (  0,  0   )     ref  ->  zero-ref    V03._comparers(offs=0x04) P-INDEP "field V03._comparers (fldOffset=0x4)"
-;  V09 tmp7         [V09,T04] (  2,  2   )     ref  ->   r1         single-def V04._root(offs=0x00) P-INDEP "field V04._root (fldOffset=0x0)"
-;  V10 tmp8         [V10,T05] (  2,  2   )     ref  ->   r2         single-def V04._comparers(offs=0x04) P-INDEP "field V04._comparers (fldOffset=0x4)"
+;  V09 tmp7         [V09,T04] (  2,  2   )     ref  ->   r2         single-def V04._root(offs=0x00) P-INDEP "field V04._root (fldOffset=0x0)"
+;  V10 tmp8         [V10,T05] (  2,  2   )     ref  ->  [sp+04H]   single-def V04._comparers(offs=0x04) P-INDEP "field V04._comparers (fldOffset=0x4)"
 ;
-; Lcl frame size = 4
+; Lcl frame size = 12
 
 G_M10664_IG01:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-            push    {r3,lr}
-            mov     r3, r1
+            push    {lr}
+            sub     sp, 12
 						;; size=4 bbWeight=1    PerfScore 2.00
 G_M10664_IG02:        ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref
             ; gcrRegs +[r0]
-            ldr     r1, [r0+0x04]
-            ; gcrRegs +[r1]
-            ldr     r2, [r0+0x08]
+            ldr     r2, [r0+0x04]
             ; gcrRegs +[r2]
-            mov     r0, r3
+            ldr     r0, [r0+0x08]
+            str     r0, [sp+0x04]	// [V10 tmp8]
+            ; GC ptr vars +{V10}
+            mov     r0, r1
             ; gcrRegs -[r0]
+            mov     r1, r2
+            ; gcrRegs +[r1]
+            ldr     r2, [sp+0x04]	// [V10 tmp8]
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
             ldr     r3, [r3]
+            ; GC ptr vars -{V10}
             blx     r3		// System.Collections.Immutable.ImmutableDictionary`2[int,System.Nullable`1[int]]:ContainsKey(int,MutationInput[int,System.Nullable`1[int]]):bool
             ; gcrRegs -[r1-r2]
             ; gcr arg pop 0
-						;; size=18 bbWeight=1    PerfScore 7.00
+						;; size=24 bbWeight=1    PerfScore 10.00
 G_M10664_IG03:        ; , epilog, nogc, extend
-            pop     {r3,pc}
-						;; size=2 bbWeight=1    PerfScore 1.00
+            add     sp, 12
+            pop     {pc}
+						;; size=4 bbWeight=1    PerfScore 2.00

Arg ordering diff:

 Sorting the arguments:
-  [000000] -> #1 (non-struct local)
-    [000019] [000000]
-  [000019] -> #0 (last arg)
-    [000019] [000000]
+  Arguments have register interference. Argument order and SCCs:
+  [ [000000] ]
+  [ [000019] ]
+Deferred argument ('r0'):
+               [000000] -----+------                        ▌  LCL_VAR   int    V01 arg1         
+Moved to late list
 Deferred argument ('r1'):
      (  9,  6) [000019] ------------                        ▌  LCL_VAR   struct<System.Collections.Immutable.ImmutableDictionary`2+MutationInput[System.Int32,System.Nullable`1[System.Int32]], 8>(P) V04 tmp2         
                                                             ▌    ref    V04._root (offs=0x00) -> V09 tmp7         
                                                             ▌    ref    V04._comparers (offs=0x04) -> V10 tmp8         
 Moved to late list
-Deferred argument ('r0'):
-               [000000] -----+------                        ▌  LCL_VAR   int    V01 arg1         
-Moved to late list
 
-Register placement order:    r1 r0 
-Multireg struct argument V04 : CallArg[[000019].LCL_VAR struct (By value), 2 regs: r1 r2, byteAlignment=4, isLate, processed, isStruct]
+Register placement order:    r0 r1 
+Multireg struct argument V04 : CallArg[[000019].LCL_VAR struct (By value), 2 regs: r1 r2, byteAlignment=4, isLate, isStruct]

LSRA puts tmp7 in r2 but seems to allocate tmp8 to r0 that conflicts, so ends up with the spill. tmp8 (V10) is related to V06 that is put in r0 by a copy, so that's probably why.

Full IR with some comments:

***** BB01
STMT00008 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
               [000037] -A-XG+------ASG       ref   
               [000036] D----+-N----                        ├──▌  LCL_VAR   ref    V05 tmp3          // put in r2
               [000024] ---XG+------                        └──▌  IND       ref   
               [000043] -----+------                           └──▌  ADD       byref 
               [000001] -----+------                              ├──▌  LCL_VAR   ref    V00 this         
               [000042] -----+------                              └──▌  CNS_INT   int    4 Fseq[_root]

***** BB01
STMT00009 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
               [000039] -A-XG+------ASG       ref   
               [000038] D----+-N----                        ├──▌  LCL_VAR   ref    V06 tmp4         // put in r0, presumably because of last use of V00 on the RHS, and V00 is in r0
               [000015] ---XG+------                        └──▌  IND       ref   
               [000045] -----+------                           └──▌  ADD       byref 
               [000014] -----+------                              ├──▌  LCL_VAR   ref    V00 this         
               [000044] -----+------                              └──▌  CNS_INT   int    8 Fseq[_comparers]

***** BB01
STMT00006 ( INL03 @ 0x000[E-] ... ??? ) <- INL01 @ 0x000[E-] <- INLRT @ 0x000[E-]
               [000030] -A---+------ASG       ref   
               [000029] D----+-N----                        ├──▌  LCL_VAR   ref    V09 tmp7         
               [000028] -----+------                        └──▌  LCL_VAR   ref    V05 tmp3         

***** BB01
STMT00007 ( INL03 @ 0x007[E-] ... ??? ) <- INL01 @ 0x000[E-] <- INLRT @ 0x000[E-]
               [000035] -A---+------ASG       ref   
               [000034] D----+-N----                        ├──▌  LCL_VAR   ref    V10 tmp8         
               [000033] -----+------                        └──▌  LCL_VAR   ref    V06 tmp4         

***** BB01
STMT00003 ( ??? ... ??? )
               [000011] --CXG+------RETURN    int   
               [000004] --CXG+------                        └──▌  CALL      int    System.Collections.Immutable.ImmutableDictionary`2[Int32,Nullable`1][System.Int32,System.Nullable`1[System.Int32]].ContainsKey
               [000000] -----+------ arg0 in r0                ├──▌  LCL_VAR   int    V01 arg1         
               [000046] -c---------- arg1 r1,r2                └──▌  FIELD_LIST struct
               [000047] ------------ ofs 0                        ├──▌  LCL_VAR   ref    V09 tmp7         
               [000048] ------------ ofs 4                        └──▌  LCL_VAR   ref    V10 tmp8         

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 30, 2022

libraries.pmi arm32 regression
          40 (37.74% of base) : 77451.dasm - System.Security.Cryptography.ECDiffieHellmanOpenSsl:GenerateKey(System.Security.Cryptography.ECCurve):this

Diff:

@@ -28,46 +28,48 @@ G_M52640_IG02:        ; gcrefRegs=0010 {r4}, byrefRegs=0000 {}, byref
             ldr     r3, [r3]
             blx     r3		// System.Security.Cryptography.ECDiffieHellmanOpenSsl:ThrowIfDisposed():this
             ; gcrRegs -[r0]
-            ldr     r0, [sp+0x40]
+            ldr     r0, [r4+0x0C]
             ; gcrRegs +[r0]
-            str     r0, [sp]
-            ldr     r0, [sp+0x44]
-            str     r0, [sp+0x04]
-            ldr     r0, [sp+0x48]
-            str     r0, [sp+0x08]
-            ldr     r0, [sp+0x4C]
-            str     r0, [sp+0x0C]
-            ldr     r0, [sp+0x50]
-            str     r0, [sp+0x10]
-            ldr     r0, [sp+0x54]
-            ; gcrRegs -[r0]
-            str     r0, [sp+0x14]
-            ldr     r0, [sp+0x58]
-            ; gcrRegs +[r0]
-            str     r0, [sp+0x18]
-            ldr     r0, [sp+0x5C]
-            str     r0, [sp+0x1C]
-            ldr     r0, [sp+0x60]
-            ; gcrRegs -[r0]
-            str     r0, [sp+0x20]
-            ldr     r0, [sp+0x64]
-            ; gcrRegs +[r0]
-            str     r0, [sp+0x24]
+            ldr     lr, [sp+0x40]
+            ; gcrRegs +[lr]
+            str     lr, [sp]
+            ldr     lr, [sp+0x44]
+            str     lr, [sp+0x04]
+            ldr     lr, [sp+0x48]
+            str     lr, [sp+0x08]
+            ldr     lr, [sp+0x4C]
+            str     lr, [sp+0x0C]
+            ldr     lr, [sp+0x50]
+            str     lr, [sp+0x10]
+            ldr     lr, [sp+0x54]
+            ; gcrRegs -[lr]
+            str     lr, [sp+0x14]
+            ldr     lr, [sp+0x58]
+            ; gcrRegs +[lr]
+            str     lr, [sp+0x18]
+            ldr     lr, [sp+0x5C]
+            str     lr, [sp+0x1C]
+            ldr     lr, [sp+0x60]
+            ; gcrRegs -[lr]
+            str     lr, [sp+0x20]
+            ldr     lr, [sp+0x64]
+            ; gcrRegs +[lr]
+            str     lr, [sp+0x24]
             ldr     r1, [sp+0x34]
             ; gcrRegs +[r1]
             ldr     r2, [sp+0x38]
             ; gcrRegs +[r2]
             ldr     r3, [sp+0x3C]
             ; gcrRegs +[r3]
-            ldr     r0, [r4+0x0C]
             movw    lr, 0xd1ff
+            ; gcrRegs -[lr]
             movt    lr, 0xd1ff
             ldr     lr, [lr]
             ldr     r12, [r0]
             blx     lr		// hackishMethodName
             ; gcrRegs -[r0-r3]
             str     r0, [r4+0x08]
-						;; size=82 bbWeight=1    PerfScore 35.00
+						;; size=122 bbWeight=1    PerfScore 35.00
 G_M52640_IG03:        ; , epilog, nogc, extend
             add     sp, 40
             pop     {r4,r11,lr}
@@ -75,7 +77,7 @@ G_M52640_IG03:        ; , epilog, nogc, extend
             bx      lr
 						;; size=10 bbWeight=1    PerfScore 4.00
diff --git "a/.\\base.txt" "b/.\\diff.txt"
index 131af5a..bcc2b16 100644
--- "a/.\\base.txt"
+++ "b/.\\diff.txt"
@@ -1,16 +1,13 @@
 Sorting the arguments:
-  [000008] -> #0 (cost 9)
-    [000008] [000004]
-  [000004] -> #1 (last arg)
-    [000008] [000004]
+  No interference found between arguments.
+Deferred argument ('r0'):
+               [000004] ---XG+------                        ▌  IND       ref   
+               [000015] -----+------                        └──▌  ADD       byref 
+               [000003] -----+------                           ├──▌  LCL_VAR   ref    V00 this         
+               [000014] -----+------                           └──▌  CNS_INT   int    12 Fseq[hackishFieldName]
+Moved to late list
 Deferred argument ('r1'):
      (  9,  6) [000008] ------------                        ▌  LCL_VAR   struct<System.Security.Cryptography.ECCurve, 52> V01 arg1         
 Moved to late list
-Deferred argument ('r0'):
-     (  6,  3) [000004] ---XG-------                        ▌  IND       ref   
-     (  5,  5) [000015] -------N----                        └──▌  ADD       byref 
-     (  3,  2) [000003] ------------                           ├──▌  LCL_VAR   ref    V00 this         
-     (  1,  2) [000014] ------------                           └──▌  CNS_INT   int    12 Fseq[hackishFieldName]
-Moved to late list
 
-Register placement order:    r1 r0 
+Register placement order:    r0 r1 

r0 is no longer free for the copy, and apparently we pick lr instead. lr cannot be encoded with thumb so all the copy instructions double in size.

Seems we allocate an internal register for the following PUTARG_SPLIT:

N027 (???,???) [000024] ------------                  t24 =PUTARG_SPLIT struct (40 stackByteSize), (3 numRegs) REG r1,r2,r3

Presumably we don't need any such register and can just use one of the target registers as a temporary scratch register. Since these are argument registers, they can always use thumb encoding.

We can almost always avoid allocating an internal register, which may be
expensive if lr is picked since we cannot use thumb encoding for it.

The only case where we need an internal register is when we have a
source that is in a register, and we have a single taget register to
place that conflicts with that source register. The to-stack copies then
need an intermediate scratch register to avoid clobbering the source
register.

Nit

Nit 2
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 30, 2022

benchmarks.run arm32 regression
[14:31:43]            8 (22.22% of base) : 10715.dasm - <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this

Diff:

 ; Assembly listing for method <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this
 ; Emitting BLENDED_CODE for generic ARM CPU - Unix
 ; optimized code
 ; sp based frame
 ; fully interruptible
 ; No matching PGO data
 ; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 this         [V00,T00] (  6,  5   )     ref  ->   r0         this class-hnd single-def
+;  V00 this         [V00,T00] (  6,  5   )     ref  ->  [sp+04H]   this class-hnd single-def
 ;  V01 arg1         [V01,T01] (  4,  4   )     ref  ->   r1         class-hnd single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
 ;  V03 tmp1         [V03,T02] (  2,  4   )     int  ->   r2         "impAppendStmt"
 ;  V04 cse0         [V04,T03] (  4,  3   )     ref  ->   r3         "CSE - aggressive"
 ;
-; Lcl frame size = 4
+; Lcl frame size = 12
 
 G_M11432_IG01:        ; gcVars=00000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
-            push    {r3,lr}
-						;; size=2 bbWeight=1    PerfScore 1.00
+            push    {lr}
+            sub     sp, 12
+						;; size=4 bbWeight=1    PerfScore 2.00
 G_M11432_IG02:        ; gcrefRegs=0003 {r0 r1}, byrefRegs=0000 {}, byref, isz
             ; gcrRegs +[r0-r1]
             ldr     r2, [r0+0x0C]
             ldr     r3, [r1+0x08]
             ; gcrRegs +[r3]
             ldr     r3, [r3+0x04]
             ; gcrRegs -[r3]
             adds    r2, r3, r2
             str     r2, [r0+0x0C]
             ldr     r3, [r1+0x0C]
             ; gcrRegs +[r3]
             cmp     r3, 0
             beq     SHORT G_M11432_IG05
 						;; size=16 bbWeight=1    PerfScore 8.00
 G_M11432_IG03:        ; gcrefRegs=0009 {r0 r3}, byrefRegs=0000 {}, byref
             ; gcrRegs -[r1]
+            str     r0, [sp+0x04]
+            ; GC ptr vars +{V00}
             ldrb    r2, [r0+0x10]
-            ldr     r1, [r0+0x04]
-            ; gcrRegs +[r1]
             ldr     r0, [r3+0x04]
+            ldr     r1, [sp+0x04]	// [V00 this]
+            ; gcrRegs +[r1]
+            ldr     r1, [r1+0x04]
             ldr     r3, [r3+0x0C]
             ; gcrRegs -[r3]
-						;; size=8 bbWeight=0.50 PerfScore 2.00
+						;; size=12 bbWeight=0.50 PerfScore 3.00
 G_M11432_IG04:        ; , epilog, nogc, extend
-            add     sp, 4
+            add     sp, 12
             pop     lr
             bx      r3		// hackishMethodName
             ; gcr arg pop 0
 						;; size=8 bbWeight=0.50 PerfScore 1.50
 G_M11432_IG05:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc
             ; gcrRegs -[r0-r1]
-            pop     {r3,pc}
-						;; size=2 bbWeight=0.50 PerfScore 0.50
+            ; GC ptr vars -{V00}
+            add     sp, 12
+            pop     {pc}
+						;; size=4 bbWeight=0.50 PerfScore 1.00
 
-; Total bytes of code 36, prolog size 2, PerfScore 16.60, instruction count 17, allocated bytes for code 36 (MethodHash=a976d357) for method <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this
+; Total bytes of code 44, prolog size 4, PerfScore 19.90, instruction count 21, allocated bytes for code 44 (MethodHash=a976d357) for method <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this
 ; ============================================================

Ordering diff:

 Sorting the arguments:
-  [000026] -> #0 (cost 7)
-    [000026] [000024] [000032]
-  [000024] -> #1 (cost 6)
-    [000026] [000024] [000032]
-  [000032] -> #2 (last arg)
-    [000026] [000024] [000032]
+  Arguments have register interference. Argument order and SCCs:
+  [ [000026] ]
+  [ [000032] [000024] ]
 Deferred argument ('r2'):
-     (  7,  4) [000026] ---XG-------                        ▌  IND       bool  
-     (  5,  5) [000047] -------N----                        └──▌  ADD       byref 
-     (  3,  2) [000025] ------------                           ├──▌  LCL_VAR   ref    V00 this         
-     (  1,  2) [000046] ------------                           └──▌  CNS_INT   int    16 Fseq[hackishFieldName]
-Moved to late list
-Deferred argument ('r1'):
-     (  6,  3) [000024] ---XG-------                        ▌  IND       ref   
-     (  5,  5) [000045] -------N----                        └──▌  ADD       byref 
-     (  3,  2) [000023] ------------                           ├──▌  LCL_VAR   ref    V00 this         
-     (  1,  2) [000044] ------------                           └──▌  CNS_INT   int    4 Fseq[hackishFieldName]
+               [000026] ---XG+------                        ▌  IND       bool  
+               [000047] -----+------                        └──▌  ADD       byref 
+               [000025] -----+------                           ├──▌  LCL_VAR   ref    V00 this         
+               [000046] -----+------                           └──▌  CNS_INT   int    16 Fseq[hackishFieldName]
 Moved to late list
 Deferred argument ('r0'):
-     (  6,  3) [000032] ---XG-------                        ▌  IND       ref   
-     (  5,  5) [000043] -------N----                        └──▌  ADD       byref 
-     (  3,  2) [000020] ------------                           ├──▌  LCL_VAR   ref    V01 arg1         
-     (  1,  2) [000042] ------------                           └──▌  CNS_INT   int    12 Fseq[<Before>k__BackingField]
+               [000032] ---XG+------                        ▌  IND       ref   
+               [000043] -----+------                        └──▌  ADD       byref 
+               [000020] -----+------                           ├──▌  LCL_VAR   ref    V01 arg1         
+               [000042] -----+------                           └──▌  CNS_INT   int    12 Fseq[<Before>k__BackingField]
+Moved to late list
+Deferred argument ('r1'):
+               [000024] ---XG+------                        ▌  IND       ref   
+               [000045] -----+------                        └──▌  ADD       byref 
+               [000023] -----+------                           ├──▌  LCL_VAR   ref    V00 this         
+               [000044] -----+------                           └──▌  CNS_INT   int    4 Fseq[hackishFieldName]
 Moved to late list
 
-Register placement order:    r2 r1 r0 
+Register placement order:    r2 r0 r1 

Looks like a cycle. Previously we place the "more expensive" arg first, which is [000026] due to it being a bool load (questionable that this is more expensive on arm, but I digress). We are then lucky that CSE comes along and CSEs [000032] out which breaks the cycle (puts it into r3). With the new order, the same CSE happens but it does not break the cycle.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 30, 2022

benchmarks.run arm32 regression
           4 (15.38% of base) : 17304.dasm - System.Collections.CopyTo`1[int]:ImmutableArray():this

Diff:

 ; Assembly listing for method System.Collections.CopyTo`1[int]:ImmutableArray():this
 ; Emitting BLENDED_CODE for generic ARM CPU - Unix
 ; optimized code
 ; sp based frame
 ; partially interruptible
 ; No matching PGO data
 ; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T00] (  4,  4   )     ref  ->   r0         this class-hnd single-def
 ;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
-;  V02 tmp1         [V02,T01] (  2,  4   )   byref  ->   r2         single-def "Inlining Arg"
+;  V02 tmp1         [V02,T01] (  2,  4   )   byref  ->   r1         single-def "Inlining Arg"
 ;* V03 tmp2         [V03    ] (  0,  0   )  struct ( 4) zero-ref    ld-addr-op "Inline stloc first use temp"
-;  V04 tmp3         [V04,T02] (  2,  4   )     ref  ->   r1         class-hnd single-def "Inlining Arg"
+;  V04 tmp3         [V04,T02] (  2,  4   )     ref  ->   r2         class-hnd single-def "Inlining Arg"
 ;* V05 tmp4         [V05    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "impAppendStmt"
 ;  V06 tmp5         [V06,T03] (  3,  3   )     ref  ->   r0         single-def V03.array(offs=0x00) P-INDEP "field V03.array (fldOffset=0x0)"
-;  V07 cse0         [V07,T04] (  2,  2   )     int  ->   r2         "CSE - aggressive"
+;  V07 cse0         [V07,T04] (  2,  2   )     int  ->   r3         "CSE - aggressive"
 ;
 ; Lcl frame size = 4
 
 G_M316_IG01:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
             push    {r3,lr}
 						;; size=2 bbWeight=1    PerfScore 1.00
 G_M316_IG02:        ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref
             ; gcrRegs +[r0]
-            add     r2, r0, 20
-            ; byrRegs +[r2]
-            ldr     r1, [r0+0x0C]
+            add     r1, r0, 20
+            ; byrRegs +[r1]
+            ldr     r2, [r0+0x0C]
+            ; gcrRegs +[r2]
+            ldr     r0, [r1]
+            ldr     r3, [r0+0x04]
+            mov     r1, r2
             ; gcrRegs +[r1]
-            ldr     r0, [r2]
-            ldr     r2, [r0+0x04]
-            ; byrRegs -[r2]
+            ; byrRegs -[r1]
+            mov     r2, r3
+            ; gcrRegs -[r2]
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
             ldr     r3, [r3]
             blx     r3		// hackishMethodName
             ; gcrRegs -[r0-r1]
             ; gcr arg pop 0
-						;; size=22 bbWeight=1    PerfScore 8.00
+						;; size=26 bbWeight=1    PerfScore 10.00
 G_M316_IG03:        ; , epilog, nogc, extend
             pop     {r3,pc}
 						;; size=2 bbWeight=1    PerfScore 1.00
 
-; Total bytes of code 26, prolog size 2, PerfScore 12.60, instruction count 10, allocated bytes for code 26 (MethodHash=ad42fec3) for method System.Collections.CopyTo`1[int]:ImmutableArray():this
+; Total bytes of code 30, prolog size 2, PerfScore 15.00, instruction count 12, allocated bytes for code 30 (MethodHash=ad42fec3) for method System.Collections.CopyTo`1[int]:ImmutableArray():this

Ordering diff:

 Sorting the arguments:
-  [000018] -> #2 (non-struct local)
-    [000017] [000042] [000018]
-  [000017] -> #1 (non-struct local)
-    [000042] [000017] [000018]
-  [000042] -> #0 (last arg)
-    [000042] [000017] [000018]
-Deferred argument ('r2'):
-               [000042] ---X-+------                        ▌  ARR_LENGTH int   
-               [000041] -----+------                        └──▌  LCL_VAR   ref    V06 tmp5         
-Moved to late list
+  No interference found between arguments.
 Deferred argument ('r0'):
                [000017] -----+------                        ▌  LCL_VAR   ref    V06 tmp5         
 Moved to late list
 Deferred argument ('r1'):
                [000018] -----+------                        ▌  LCL_VAR   ref    V04 tmp3         
 Moved to late list
+Deferred argument ('r2'):
+               [000042] ---X-+------                        ▌  ARR_LENGTH int   
+               [000041] -----+------                        └──▌  LCL_VAR   ref    V06 tmp5         
+Moved to late list
 
-Register placement order:    r2 r0 r1 
+Register placement order:    r0 r1 r2 

Just seems like slightly worse RA, instructions here are otherwise in the same order as before so the only thing added is some extra resolution moves.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 30, 2022

benchmarks.run arm32 regression
          10 (12.20% of base) : 49443.dasm - System.Collections.CopyTo`1[int]:Span():this
 ; Assembly listing for method System.Collections.CopyTo`1[int]:Span():this
 ; Emitting BLENDED_CODE for generic ARM CPU - Unix
 ; optimized code
 ; r11 based frame
 ; partially interruptible
 ; No matching PGO data
 ; 0 inlinees with PGO data; 4 single block inlinees; 5 inlinees without PGO data
 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T00] (  4,  4   )     ref  ->   r0         this class-hnd single-def
 ;* V01 loc0         [V01    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "NewObj constructor temp"
 ;* V04 tmp2         [V04    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "NewObj constructor temp"
-;  V05 tmp3         [V05,T01] (  4,  6   )     ref  ->   r2         class-hnd single-def "Inlining Arg"
-;  V06 tmp4         [V06,T02] (  4,  6   )     ref  ->   r2         class-hnd single-def "Inlining Arg"
+;  V05 tmp3         [V05,T01] (  4,  6   )     ref  ->   r1         class-hnd single-def "Inlining Arg"
+;  V06 tmp4         [V06,T02] (  4,  6   )     ref  ->   r0         class-hnd single-def "Inlining Arg"
 ;* V07 tmp5         [V07    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "Inlining Arg"
 ;* V08 tmp6         [V08    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
 ;* V09 tmp7         [V09    ] (  0,  0   )   byref  ->  zero-ref    single-def "Inlining Arg"
 ;* V10 tmp8         [V10    ] (  0,  0   )   byref  ->  zero-ref    single-def "Inlining Arg"
 ;* V11 tmp9         [V11    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
 ;* V12 tmp10        [V12    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
 ;  V13 tmp11        [V13,T08] (  2,  2   )   byref  ->   r1         single-def V01._reference(offs=0x00) P-INDEP "field V01._reference (fldOffset=0x0)"
-;  V14 tmp12        [V14,T03] (  3,  3   )     int  ->   r3         single-def V01._length(offs=0x04) P-INDEP "field V01._length (fldOffset=0x4)"
-;  V15 tmp13        [V15,T04] (  3,  2   )   byref  ->   r1         V03._reference(offs=0x00) P-INDEP "field V03._reference (fldOffset=0x0)"
-;  V16 tmp14        [V16,T06] (  3,  2   )     int  ->   r3         V03._length(offs=0x04) P-INDEP "field V03._length (fldOffset=0x4)"
-;  V17 tmp15        [V17,T05] (  3,  2   )   byref  ->   r0         V04._reference(offs=0x00) P-INDEP "field V04._reference (fldOffset=0x0)"
-;  V18 tmp16        [V18,T07] (  3,  2   )     int  ->   lr         V04._length(offs=0x04) P-INDEP "field V04._length (fldOffset=0x4)"
+;  V14 tmp12        [V14,T03] (  3,  3   )     int  ->   r2         single-def V01._length(offs=0x04) P-INDEP "field V01._length (fldOffset=0x4)"
+;  V15 tmp13        [V15,T04] (  3,  2   )   byref  ->  registers   V03._reference(offs=0x00) P-INDEP "field V03._reference (fldOffset=0x0)"
+;  V16 tmp14        [V16,T06] (  3,  2   )     int  ->  registers   V03._length(offs=0x04) P-INDEP "field V03._length (fldOffset=0x4)"
+;  V17 tmp15        [V17,T05] (  3,  2   )   byref  ->  registers   V04._reference(offs=0x00) P-INDEP "field V04._reference (fldOffset=0x0)"
+;  V18 tmp16        [V18,T07] (  3,  2   )     int  ->  registers   V04._length(offs=0x04) P-INDEP "field V04._length (fldOffset=0x4)"
 ;  V19 tmp17        [V19,T09] (  2,  2   )   byref  ->   r0         single-def V07._reference(offs=0x00) P-INDEP "field V07._reference (fldOffset=0x0)"
 ;* V20 tmp18        [V20    ] (  0,  0   )     int  ->  zero-ref    single-def V07._length(offs=0x04) P-INDEP "field V07._length (fldOffset=0x4)"
 ;
-; Lcl frame size = 0
+; Lcl frame size = 4
 
 G_M42859_IG01:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-            push    {r11,lr}
-            mov     r11, sp
-						;; size=6 bbWeight=1    PerfScore 2.00
+            push    {r3,r4,r11,lr}
+            add     r11, sp, 8
+						;; size=8 bbWeight=1    PerfScore 2.00
 G_M42859_IG02:        ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref, isz
             ; gcrRegs +[r0]
-            ldr     r2, [r0+0x04]
-            ; gcrRegs +[r2]
-            cmp     r2, 0
+            ldr     r1, [r0+0x04]
+            ; gcrRegs +[r1]
+            cmp     r1, 0
             bne     SHORT G_M42859_IG04
 						;; size=6 bbWeight=1    PerfScore 3.00
 G_M42859_IG03:        ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref, isz
-            ; gcrRegs -[r2]
+            ; gcrRegs -[r1]
             movs    r1, 0
-            movs    r3, 0
+            movs    r2, 0
             b       SHORT G_M42859_IG05
 						;; size=6 bbWeight=0.50 PerfScore 1.50
-G_M42859_IG04:        ; gcrefRegs=0005 {r0 r2}, byrefRegs=0000 {}, byref
-            ; gcrRegs +[r2]
-            add     r1, r2, 8
+G_M42859_IG04:        ; gcrefRegs=0003 {r0 r1}, byrefRegs=0000 {}, byref
+            ; gcrRegs +[r1]
+            add     r2, r1, 8
+            ; byrRegs +[r2]
+            ldr     r1, [r1+0x04]
+            ; gcrRegs -[r1]
+            mov     r3, r1
+            mov     r1, r2
             ; byrRegs +[r1]
-            ldr     r3, [r2+0x04]
-						;; size=6 bbWeight=0.50 PerfScore 1.00
+            mov     r2, r3
+            ; byrRegs -[r2]
+						;; size=12 bbWeight=0.50 PerfScore 2.50
 G_M42859_IG05:        ; gcrefRegs=0001 {r0}, byrefRegs=0002 {r1}, byref, isz
-            ; gcrRegs -[r2]
-            ldr     r2, [r0+0x0C]
-            ; gcrRegs +[r2]
-            cmp     r2, 0
+            ldr     r0, [r0+0x0C]
+            cmp     r0, 0
             bne     SHORT G_M42859_IG07
 						;; size=6 bbWeight=1    PerfScore 3.00
 G_M42859_IG06:        ; gcrefRegs=0000 {}, byrefRegs=0002 {r1}, byref, isz
-            ; gcrRegs -[r0 r2]
+            ; gcrRegs -[r0]
             movs    r0, 0
-            mov     lr, 0
+            movs    r3, 0
             b       SHORT G_M42859_IG08
-						;; size=8 bbWeight=0.50 PerfScore 1.50
-G_M42859_IG07:        ; gcrefRegs=0004 {r2}, byrefRegs=0002 {r1}, byref
-            ; gcrRegs +[r2]
-            add     r0, r2, 8
+						;; size=6 bbWeight=0.50 PerfScore 1.50
+G_M42859_IG07:        ; gcrefRegs=0001 {r0}, byrefRegs=0002 {r1}, byref
+            ; gcrRegs +[r0]
+            add     r3, r0, 8
+            ; byrRegs +[r3]
+            ldr     r0, [r0+0x04]
+            ; gcrRegs -[r0]
+            mov     r4, r0
+            mov     r0, r3
             ; byrRegs +[r0]
-            ldr     lr, [r2+0x04]
-						;; size=8 bbWeight=0.50 PerfScore 1.00
+            mov     r3, r4
+            ; byrRegs -[r3]
+						;; size=12 bbWeight=0.50 PerfScore 2.50
 G_M42859_IG08:        ; gcrefRegs=0000 {}, byrefRegs=0003 {r0 r1}, byref, isz
-            ; gcrRegs -[r2]
-            cmp     r3, lr
+            cmp     r2, r3
             bhi     SHORT G_M42859_IG10
-            lsls    r2, r3, 2
+            lsls    r2, r2, 2
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
             ldr     r3, [r3]
             blx     r3		// hackishMethodName
             ; byrRegs -[r0-r1]
 						;; size=18 bbWeight=1    PerfScore 7.00
 G_M42859_IG09:        ; , epilog, nogc, extend
-            pop     {r11,pc}
+            pop     {r3,r4,r11,pc}
 						;; size=4 bbWeight=1    PerfScore 1.00
 G_M42859_IG10:        ; gcVars=00000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
             ldr     r3, [r3]
             blx     r3		// System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
             bkpt    
 						;; size=14 bbWeight=0    PerfScore 0.00
 
-; Total bytes of code 82, prolog size 6, PerfScore 29.20, instruction count 31, allocated bytes for code 82 (MethodHash=a24e5894) for method System.Collections.CopyTo`1[int]:Span():this
+; Total bytes of code 92, prolog size 8, PerfScore 33.20, instruction count 37, allocated bytes for code 92 (MethodHash=a24e5894) for method System.Collections.CopyTo`1[int]:Span():this
 Sorting the arguments:
-  [000130] -> #2 (non-struct local)
-    [000129] [000133] [000130]
-  [000129] -> #1 (non-struct local)
-    [000133] [000129] [000130]
-  [000133] -> #0 (last arg)
-    [000133] [000129] [000130]
-Deferred argument ('r2'):
-               [000133] -----+------                        ▌  LSH       int   
-               [000131] -----+------                        ├──▌  LCL_VAR   int    V14 tmp12        
-               [000132] -----+------                        └──▌  CNS_INT   int    2
-Moved to late list
+  No interference found between arguments.
 Deferred argument ('r0'):
                [000129] -----+------                        ▌  LCL_VAR   byref  V19 tmp17        
 Moved to late list
 Deferred argument ('r1'):
                [000130] -----+------                        ▌  LCL_VAR   byref  V13 tmp11        
 Moved to late list
+Deferred argument ('r2'):
+               [000133] -----+------                        ▌  LSH       int   
+               [000131] -----+------                        ├──▌  LCL_VAR   int    V14 tmp12        
+               [000132] -----+------                        └──▌  CNS_INT   int    2
+Moved to late list
 
-Register placement order:    r2 r0 r1 
+Register placement order:    r0 r1 r2 

A small change that seems to lead to overall globally worse register allocation. Thankfully not inside the main CopyTo loop, so probably not an actual perf issue.

@jakobbotsch
Copy link
Member Author

benchmarks.run linux-x64
          18 (78.26% of base) : 1413.dasm - System.ValueTuple`2[int,int]:GetHashCode():int:this
 ;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
 ;* V04 tmp1         [V04    ] (  0,  0   )   byref  ->  zero-ref    single-def "impAppendStmt"
 ;* V05 tmp2         [V05    ] (  0,  0   )   byref  ->  zero-ref   
-;  V06 tmp3         [V06,T02] (  2,  2   )     int  ->  rax        
-;  V07 tmp4         [V07,T01] (  2,  4   )   byref  ->  rdi         single-def "impAppendStmt"
+;  V06 tmp3         [V06,T02] (  2,  2   )     int  ->  rsi        
+;  V07 tmp4         [V07,T01] (  2,  4   )   byref  ->  [rbp-08H]   single-def "impAppendStmt"
 ;* V08 tmp5         [V08    ] (  0,  0   )     int  ->  zero-ref   
 ;* V09 tmp6         [V09    ] (  0,  0   )   byref  ->  zero-ref   
 ;* V10 tmp7         [V10    ] (  0,  0   )     int  ->  zero-ref   
 ;* V11 tmp8         [V11    ] (  0,  0   )     int  ->  zero-ref   
 ;
-; Lcl frame size = 0
+; Lcl frame size = 16
 
 G_M63581_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
        push     rbp
-       mov      rbp, rsp
-						;; size=4 bbWeight=1    PerfScore 1.25
+       sub      rsp, 16
+       lea      rbp, [rsp+10H]
+						;; size=10 bbWeight=1    PerfScore 1.75
 G_M63581_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000080 {rdi}, byref
        ; byrRegs +[rdi]
-       mov      eax, dword ptr [rdi]
-       add      rdi, 4
        mov      esi, dword ptr [rdi]
-       mov      edi, eax
+       add      rdi, 4
+       mov      bword ptr [rbp-08H], rdi
+       ; GC ptr vars +{V07}
+       mov      edi, esi
        ; byrRegs -[rdi]
+       mov      rsi, bword ptr [rbp-08H]
+       ; byrRegs +[rsi]
+       mov      esi, dword ptr [rsi]
+       ; byrRegs -[rsi]
+       ; GC ptr vars -{V07}
        call     [System.HashCode:Combine[int,int](int,int):int]
        nop      
-						;; size=17 bbWeight=1    PerfScore 7.75
+						;; size=25 bbWeight=1    PerfScore 9.75
 G_M63581_IG03:        ; , epilog, nogc, extend
+       add      rsp, 16
        pop      rbp
        ret      
-						;; size=2 bbWeight=1    PerfScore 1.50
+						;; size=6 bbWeight=1    PerfScore 1.75
 
-; Total bytes of code 23, prolog size 4, PerfScore 12.80, instruction count 10, allocated bytes for code 23 (MethodHash=47bb07a2) for method System.ValueTuple`2[int,int]:GetHashCode():int:this
+; Total bytes of code 41, prolog size 10, PerfScore 17.35, instruction count 14, allocated bytes for code 41 (MethodHash=47bb07a2) for method System.ValueTuple`2[int,int]:GetHashCode():int:this
 Sorting the arguments:
-  [000024] -> #1 (non-struct local)
-    [000063] [000024]
-  [000063] -> #0 (last arg)
-    [000063] [000024]
+  No interference found between arguments.
+Deferred argument ('rdi'):
+               [000024] -----+-----                         ▌  LCL_VAR   int    V06 tmp3         
+Moved to late list
 Deferred argument ('rsi'):
                [000063] ---XG+-----                         ▌  IND       int   
                [000035] -----+-----                         └──▌  LCL_VAR   byref  V07 tmp4         
 Moved to late list
-Deferred argument ('rdi'):
-               [000024] -----+-----                         ▌  LCL_VAR   int    V06 tmp3         
-Moved to late list
 
-Register placement order:    rsi rdi 
+Register placement order:    rdi rsi 

Another case where we were lucky to get the order right before -- LSRA allocates tmp4 into rdi, so the new order requires a spill. I will probably need to figure out how to either do something in LSRA about this, or go back to more or less the same sort as before to avoid the churn.

@jakobbotsch
Copy link
Member Author

#76671 gets a lot of the diffs, but the diffs of this is still quite large, especially on win-x64. I'm investigating some of the regressions with that change in now.

win-x64 benchmarks regression
17 (70.83% of base) : 3299.dasm - System.Xml.Schema.XmlSchemaObjectCollection:OnInsert(int,System.Object):this

Diff:

@@ -6,40 +6,48 @@
 ; No matching PGO data
 ; Final local variable assignments
 ;
-;  V00 this         [V00,T00] (  4,  3.50)     ref  ->  rdx         this class-hnd single-def
+;  V00 this         [V00,T00] (  4,  3.50)     ref  ->  rcx         this class-hnd single-def
 ;* V01 arg1         [V01    ] (  0,  0   )     int  ->  zero-ref    single-def
 ;  V02 arg2         [V02,T01] (  3,  2.50)     ref  ->   r8         class-hnd single-def
 ;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;  V04 tmp1         [V04,T02] (  2,  4   )     ref  ->  rcx         class-hnd single-def "dup spill"
-;  V05 tmp2         [V05,T03] (  4,  3   )     ref  ->  rcx         single-def
+;  V04 tmp1         [V04,T02] (  2,  4   )     ref  ->  rdx         class-hnd single-def "dup spill"
+;  V05 tmp2         [V05,T03] (  4,  3   )     ref  ->  [rsp+00H]   spill-single-def
 ;
-; Lcl frame size = 0
+; Lcl frame size = 8
 
 G_M49570_IG01:        ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
-       mov      rdx, rcx
+       push     rax
+						;; size=1 bbWeight=1    PerfScore 1.00
+G_M49570_IG02:        ; gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, byref, isz
+       ; gcrRegs +[rcx r8]
+       mov      rdx, gword ptr [rcx+10H]
        ; gcrRegs +[rdx]
-						;; size=3 bbWeight=1    PerfScore 0.25
-G_M49570_IG02:        ; gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, byref, isz
-       ; gcrRegs +[r8]
-       mov      rcx, gword ptr [rdx+10H]
-       ; gcrRegs +[rcx]
-       test     rcx, rcx
+       mov      gword ptr [rsp], rdx
+       ; GC ptr vars +{V05}
+       test     rdx, rdx
        jne      SHORT G_M49570_IG04
-						;; size=9 bbWeight=1    PerfScore 3.25
+						;; size=13 bbWeight=1    PerfScore 4.25
 G_M49570_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
        ; gcrRegs -[rcx rdx r8]
+       ; GC ptr vars -{V05}
+       add      rsp, 8
        ret      
-						;; size=1 bbWeight=0.50 PerfScore 0.50
-G_M49570_IG04:        ; gcVars=0000000000000000 {}, gcrefRegs=00000106 {rcx rdx r8}, byrefRegs=00000000 {}, gcvars, byref
-       ; gcrRegs +[rcx rdx r8]
+						;; size=5 bbWeight=0.50 PerfScore 0.62
+G_M49570_IG04:        ; gcVars=0000000000000008 {V05}, gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, gcvars, byref
+       ; gcrRegs +[rcx r8]
+       ; GC ptr vars +{V05}
+       mov      rdx, rcx
+       ; gcrRegs +[rdx]
+       mov      rcx, gword ptr [rsp]
        mov      rax, qword ptr [rcx]
        mov      rax, qword ptr [rax+48H]
-						;; size=7 bbWeight=0.50 PerfScore 2.00
+						;; size=14 bbWeight=0.50 PerfScore 2.62
 G_M49570_IG05:        ; , epilog, nogc, extend
+       add      rsp, 8
        tail.jmp [rax+20H]System.Xml.Schema.XmlSchemaObject:hackishMethodName
-						;; size=4 bbWeight=0.50 PerfScore 1.00
+						;; size=8 bbWeight=0.50 PerfScore 1.12

New arg placement order:

@@ -1,18 +1,16 @@
 Sorting the arguments:
-  [000015] -> #2 (non-struct local)
-    [000013] [000014] [000015]
-  [000014] -> #1 (non-struct local)
-    [000013] [000014] [000015]
-  [000013] -> #0 (non-struct local)
-    [000013] [000014] [000015]
-Deferred argument ('rcx'):
-               [000013] -----+-----                         ▌  LCL_VAR   ref    V05 tmp2         
-Moved to late list
+  Arguments have register interference. Argument order and SCCs:
+  [ [000014] ]
+  [ [000013] ]
+  [ [000015] ]
 Deferred argument ('rdx'):
                [000014] -----+-----                         ▌  LCL_VAR   ref    V00 this         
 Moved to late list
+Deferred argument ('rcx'):
+               [000013] -----+-----                         ▌  LCL_VAR   ref    V05 tmp2         
+Moved to late list
 Deferred argument ('r8'):
                [000015] -----+-----                         ▌  LCL_VAR   ref    V02 arg2         
 Moved to late list
 
-Register placement order:    rcx rdx r8 
+Register placement order:    rdx rcx r8 

LSRA then realizes that it can leave this in rcx. However, for some reason it still allocates V05 (that is being placed into rcx) into rdx, causing the spill. That's a little odd since the new preferencing logic does kick in properly:

Last use of V05 between PUTARG and CALL. Removing occupied arg regs from preferences: [rdx r8]

Will need to look at bit closer at this one to figure out why the preferencing does not kick in. It seems like allocating V05 into rax instead would solve all problems.

jitdump.txt

@kunalspathak
Copy link
Member

`Last use of V05 between PUTARG and CALL. Removing occupied arg regs from preferences: [rdx r8]

Just realized, that we should also print the new preference set on this line, just to make things clear.

Will need to look at bit closer at this one to figure out why the preferencing does not kick in.

This are all the refpositions for that Interval

Interval 3: (V05) ref RefPositions {#8@16 #9@21 #17@39 #24@47} physReg:NA Preferences=[rcx]

Although here we say that preference is just [rcx], for Refpositions, it is not set and can pick one of the [allInt] registes.

REFPOSITIONS BEFORE ALLOCATION: 
....
<RefPosition #8   @16  RefTypeDef <Ivl:3 V05> STORE_LCL_VAR BB01 regmask=[allInt] minReg=1 wt=300.00>
<RefPosition #9   @21  RefTypeUse <Ivl:3 V05> LCL_VAR BB01 regmask=[allInt] minReg=1 regOptional wt=300.00>
<RefPosition #17  @39  RefTypeUse <Ivl:3 V05> LCL_VAR BB03 regmask=[rcx] minReg=1 fixed wt=300.00>
<RefPosition #24  @47  RefTypeUse <Ivl:3 V05> LCL_VAR BB03 regmask=[allInt] minReg=1 last wt=300.00>

So, check when allocating for the RefTypeDef #8, are we are ignoring the Interval's preference that is set in "dying local" method.

@ghost ghost closed this Nov 9, 2022
@ghost
Copy link

ghost commented Nov 9, 2022

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants