-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Smarter ordering of late args based on register uses #76017
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Detailsnull
|
So essentially this rearranges the args depending on how many of them will get overwritten/usage? |
Right, maybe an example is easiest: Sorting the arguments:
- [000027] -> #0 (cost 6)
- [000029] -> #1 (last arg)
-Deferred argument ('rcx'):
- ( 6, 5) [000027] ---XG------ ▌ IND ref
- ( 4, 3) [000033] -------N--- └──▌ ADD byref
- ( 3, 2) [000026] ----------- ├──▌ LCL_VAR ref V02 loc0
- ( 1, 1) [000032] ----------- └──▌ CNS_INT long 16 Fseq[_delegate]
-Moved to late list
+ [000029] -> #0 (cost 6)
+ [000027] -> #1 (last arg)
Deferred argument ('rdx'):
( 6, 5) [000029] ---XG------ ▌ IND ref
( 4, 3) [000035] -------N--- └──▌ ADD byref
( 3, 2) [000028] ----------- ├──▌ LCL_VAR ref V00 this
( 1, 1) [000034] ----------- └──▌ CNS_INT long 16 Fseq[_delegate]
Moved to late list
+Deferred argument ('rcx'):
+ ( 6, 5) [000027] ---XG------ ▌ IND ref
+ ( 4, 3) [000033] -------N--- └──▌ ADD byref
+ ( 3, 2) [000026] ----------- ├──▌ LCL_VAR ref V02 loc0
+ ( 1, 1) [000032] ----------- └──▌ CNS_INT long 16 Fseq[_delegate]
+Moved to late list
-Register placement order: rcx rdx
+Register placement order: rdx rcx
As you can see, the argument placed in Overall an improvement, but there are many regressions too -- this stuff is quite sensitive. I also only currently do the reordering for trees with equal cost (so e.g. if Another thing to consider is that if there is a previous call then the parameters are likely to have been spilled into other registers anyway, so the sort is not helpful (but probably not harmful either). However this depends on the exact block linearization that LSRA picks. Might be worth experimenting here in morph to see if we predict it in an easy way. |
There are also other smart things we can do, e.g. if we notice a cycle, we can evaluate one of the arguments in the cycle into a temp and place it last. That essentially does a small live-range split which should help LSRA spill into a register. |
This starts communicating more information about diffs back from superpmi and starts using it in the driver. The current information is the base size, diff size, base instructions, diff instructions and context size. The driver uses the base size/diff size to pick a small number of interesting contexts and only creates disassembly for these contexts. jit-analyze is then also invoked on only these contexts, but the contexts are picked so that they overlap with what jit-analyze would display. Additionally, we also pick the top 20 smallest contexts (in terms of context size) -- I frequently use the smallest contexts with diffs to investigate my change. The new behavior is only enabled when no custom metrics are specified. If custom metrics are specified we fall back to producing all disassembly files and leave it up to jit-analyze to extract and analyze the metrics specified. Also remove --retainOnlyTopFiles. This is no longer necessary since we avoid creating these disassembly files in the first place. Another benefit is that all contexts mentioned by jit-analyze output will now be part of the artifacts. The net result is that we can now get SPMI diffs for changes with even hundreds of thousands of diffs in the same time as it takes to get diffs for a small change. Fix dotnet#76178
Since there can be multiple conflicting pairs this strategy works better.
Spot-checking some regressions. libraries.pmi regression10 (47.62% of base) : 122054.dasm - Microsoft.FSharp.Control.ObservableModule:succeed@14[System.__Canon](System.IObserver`1[System.__Canon],Microsoft.FSharp.Core.FSharpOption`1[System.__Canon]) The diff looks like: ; No matching PGO data
; Final local variable assignments
;
-; V00 TypeCtx [V00,T00] ( 5, 4 ) long -> r11 single-def
-; V01 arg0 [V01,T02] ( 3, 2.50) ref -> rcx class-hnd single-def
+; V00 TypeCtx [V00,T00] ( 5, 4 ) long -> [rsp+08H] single-def
+; V01 arg0 [V01,T02] ( 3, 2.50) ref -> rdx class-hnd single-def
; V02 arg1 [V02,T01] ( 4, 3.50) ref -> r8 class-hnd single-def
;* V03 loc0 [V03 ] ( 0, 0 ) ref -> zero-ref class-hnd single-def
;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
@@ -16,30 +16,31 @@
; Lcl frame size = 0
G_M32974_IG01: ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
- mov r11, rcx
- mov rcx, rdx
- ; gcrRegs +[rcx]
- ;; size=6 bbWeight=1 PerfScore 0.50
-G_M32974_IG02: ; gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, byref, isz
- ; gcrRegs +[r8]
+ ;; size=0 bbWeight=1 PerfScore 0.00
+G_M32974_IG02: ; gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, byref, isz
+ ; gcrRegs +[rdx r8]
test r8, r8
jne SHORT G_M32974_IG04
;; size=5 bbWeight=1 PerfScore 1.25
G_M32974_IG03: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
- ; gcrRegs -[rcx r8]
+ ; gcrRegs -[rdx r8]
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
-G_M32974_IG04: ; gcVars=0000000000000000 {}, gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, gcvars, byref
- ; gcrRegs +[rcx r8]
+G_M32974_IG04: ; gcVars=0000000000000000 {}, gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, gcvars, byref
+ ; gcrRegs +[rdx r8]
+ mov qword ptr [rsp+08H], rcx
+ mov r11, rcx
+ mov rcx, rdx
+ ; gcrRegs +[rcx]
mov rdx, gword ptr [r8+08H]
- ; gcrRegs +[rdx]
cmp dword ptr [rcx], ecx
- ;; size=6 bbWeight=0.50 PerfScore 2.50
+ mov rax, qword ptr [rsp+08H]
+ ;; size=22 bbWeight=0.50 PerfScore 3.75
G_M32974_IG05: ; , epilog, nogc, extend
- tail.jmp [r11]
+ tail.jmp [rax]
;; size=3 bbWeight=0.50 PerfScore 1.00 @@ -5,17 +5,23 @@ Sorting the arguments:
[000010] [000008] [000019]
[000010] -> #0 (last arg)
[000010] [000008] [000019]
+ [000010] -> #1 (clobbers rdx used by [000008])
+ [000008] [000010] [000019]
+ [000008] -> #2 (clobbers rcx used by [000019])
+ [000010] [000019] [000008]
+ [000010] -> #2 (clobbers rdx used by [000008])
+ [000019] [000008] [000010]
+Deferred argument ('r11'):
+ [000019] -----+----- ▌ LCL_VAR long V00 TypeCtx
+Moved to late list
+Deferred argument ('rcx'):
+ [000008] -----+----- ▌ LCL_VAR ref V01 arg0
+Moved to late list
Deferred argument ('rdx'):
[000010] ---XG+----- ▌ IND ref
[000021] -----+----- └──▌ ADD byref
[000009] -----+----- ├──▌ LCL_VAR ref V02 arg1
[000020] -----+----- └──▌ CNS_INT long 8 Fseq[value]
Moved to late list
-Deferred argument ('rcx'):
- [000008] -----+----- ▌ LCL_VAR ref V01 arg0
-Moved to late list
-Deferred argument ('r11'):
- [000019] -----+----- ▌ LCL_VAR long V00 TypeCtx
-Moved to late list
-Register placement order: rdx rcx r11
+Register placement order: r11 rcx rdx On the surface of it this register placement order is much better, we avoid two conflicts. However, for some reason it causes LSRA to not allocate I think the best solution would be to apply the same trick to VSD calls with dynamically computed VSD stub as we do for static VSD stubs -- simply always do the call through indirection of the indir cell. |
libraries.pmi regression17 (73.91% of base) : 112173.dasm - System.Xml.Xsl.Runtime.XmlNavTypeFilter:MoveToFollowing(System.Xml.XPath.XPathNavigator,System.Xml.XPath.XPathNavigator):bool:this Diff looks like: ; No matching PGO data
; Final local variable assignments
;
-; V00 this [V00,T01] ( 3, 3 ) ref -> rcx this class-hnd single-def
-; V01 arg1 [V01,T00] ( 4, 4 ) ref -> rax class-hnd single-def
+; V00 this [V00,T01] ( 3, 3 ) ref -> [rsp+08H] this class-hnd single-def
+; V01 arg1 [V01,T00] ( 4, 4 ) ref -> [rsp+10H] class-hnd single-def
; V02 arg2 [V02,T02] ( 3, 3 ) ref -> r8 class-hnd single-def
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M53345_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
- mov rax, rdx
+ mov gword ptr [rsp+08H], rcx
+ ; GC ptr vars +{V00}
+ ;; size=5 bbWeight=1 PerfScore 1.00
+G_M53345_IG02: ; gcVars=0000000000000002 {V00}, gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, gcvars, byref
+ ; gcrRegs +[rdx r8]
+ mov gword ptr [rsp+10H], rdx
+ ; GC ptr vars +{V01}
+ mov rcx, rdx
+ ; gcrRegs +[rcx]
+ mov rax, gword ptr [rsp+08H]
; gcrRegs +[rax]
- ;; size=3 bbWeight=1 PerfScore 0.25
-G_M53345_IG02: ; gcrefRegs=00000103 {rax rcx r8}, byrefRegs=00000000 {}, byref
- ; gcrRegs +[rcx r8]
- mov edx, dword ptr [rcx+08H]
- mov rcx, rax
+ mov edx, dword ptr [rax+08H]
+ ; gcrRegs -[rdx]
+ mov rax, gword ptr [rsp+10H]
mov rax, qword ptr [rax]
; gcrRegs -[rax]
mov rax, qword ptr [rax+80H]
- ;; size=16 bbWeight=1 PerfScore 6.25
+ ;; size=31 bbWeight=1 PerfScore 9.25
G_M53345_IG03: ; , epilog, nogc, extend
tail.jmp [rax+10H]System.Xml.XPath.XPathNavigator:hackishMethodName
;; size=4 bbWeight=1 PerfScore 2.00
@@ -5,17 +5,23 @@ Sorting the arguments:
[000002] [000000] [000003]
[000002] -> #0 (last arg)
[000002] [000000] [000003]
+ [000002] -> #1 (clobbers rdx used by [000000])
+ [000000] [000002] [000003]
+ [000000] -> #1 (clobbers rcx used by [000002])
+ [000002] [000000] [000003]
+ [000002] -> #1 (clobbers rdx used by [000000])
+ [000000] [000002] [000003]
+Deferred argument ('rcx'):
+ [000000] -----+----- ▌ LCL_VAR ref V01 arg1
+Moved to late list
Deferred argument ('rdx'):
[000002] ---XG+----- ▌ IND int
[000007] -----+----- └──▌ ADD byref
[000001] -----+----- ├──▌ LCL_VAR ref V00 this
[000006] -----+----- └──▌ CNS_INT long 8 Fseq[hackishFieldName]
Moved to late list
-Deferred argument ('rcx'):
- [000000] -----+----- ▌ LCL_VAR ref V01 arg1
-Moved to late list
Deferred argument ('r8'):
[000003] -----+----- ▌ LCL_VAR ref V02 arg2
Moved to late list
-Register placement order: rdx rcx r8
+Register placement order: rcx rdx r8 Looks like there is a cycle in the interference graph and the algorithm ends up swapping them compared to the previous order, and LSRA does not cope well with this. Seems like it may be beneficial to keep |
Nope, not in general. It solves the above regression but it is overall a regression to put |
Problem here was that I was reordering locals. If we have a cycle caused by complex arg vs local, then it does look slightly better to keep the local at the end. |
libraries.pmi regression10 (43.48% of base) : 78804.dasm - EnumeratorImpl:get_Current():Microsoft.CodeAnalysis.SyntaxNodeOrToken:this Diff: ; Final local variable assignments
;
; V00 this [V00,T01] ( 3, 3 ) ref -> rcx this class-hnd single-def
-; V01 RetBuf [V01,T02] ( 3, 3 ) byref -> rax single-def
+; V01 RetBuf [V01,T02] ( 3, 3 ) byref -> rdx single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
-; V03 tmp1 [V03,T00] ( 3, 6 ) byref -> rcx single-def "Inlining Arg"
+; V03 tmp1 [V03,T00] ( 3, 6 ) byref -> [rsp+00H] spill-single-def "Inlining Arg"
;
-; Lcl frame size = 0
+; Lcl frame size = 8
G_M28311_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
- mov rax, rdx
- ; byrRegs +[rax]
- ;; size=3 bbWeight=1 PerfScore 0.25
-G_M28311_IG02: ; gcrefRegs=00000002 {rcx}, byrefRegs=00000001 {rax}, byref
+ push rax
+ ;; size=1 bbWeight=1 PerfScore 1.00
+G_M28311_IG02: ; gcrefRegs=00000002 {rcx}, byrefRegs=00000004 {rdx}, byref
; gcrRegs +[rcx]
+ ; byrRegs +[rdx]
add rcx, 8
; gcrRegs -[rcx]
; byrRegs +[rcx]
+ mov bword ptr [rsp], rcx
+ ; GC ptr vars +{V03}
mov r8d, dword ptr [rcx+0CH]
- mov rdx, gword ptr [rcx]
+ mov rcx, rdx
+ mov rdx, bword ptr [rsp]
+ mov rdx, gword ptr [rdx]
; gcrRegs +[rdx]
- mov rcx, rax
- ;; size=14 bbWeight=1 PerfScore 4.50
+ ; byrRegs -[rdx]
+ ;; size=22 bbWeight=1 PerfScore 6.50
G_M28311_IG03: ; , epilog, nogc, extend
+ add rsp, 8
tail.jmp [hackishMethodName]
; gcr arg pop 0
- ;; size=6 bbWeight=1 PerfScore 2.00
+ ;; size=10 bbWeight=1 PerfScore 2.25 @@ -5,18 +5,20 @@ Sorting the arguments:
[000011] [000009] [000013]
[000009] -> #1 (last arg)
[000011] [000009] [000013]
+ [000009] -> #2 (clobbers rdx used by [000013])
+ [000011] [000013] [000009]
Deferred argument ('r8'):
( 6, 5) [000011] ---XG------ ▌ IND int
( 4, 3) [000025] -------N--- └──▌ ADD byref
( 3, 2) [000010] ----------- ├──▌ LCL_VAR byref V03 tmp1
( 1, 1) [000024] ----------- └──▌ CNS_INT long 12
Moved to late list
+Deferred argument ('rcx'):
+ [000013] -----+----- ▌ LCL_VAR byref V01 RetBuf
+Moved to late list
Deferred argument ('rdx'):
( 6, 4) [000009] ---XG------ ▌ IND ref
( 3, 2) [000008] ----------- └──▌ LCL_VAR byref V03 tmp1
Moved to late list
-Deferred argument ('rcx'):
- [000013] -----+----- ▌ LCL_VAR byref V01 RetBuf
-Moved to late list
-Register placement order: r8 rdx rcx
+Register placement order: r8 rcx rdx The new arg order looks preferable, we get rid of the conflict that was there before by placing Not clear to me why |
aspnet.run regression15 (26.32% of base) : 16805.dasm - System.Threading.Tasks.Task:NotifyParentIfPotentiallyAttachedTask():this Diff: ; Assembly listing for method System.Threading.Tasks.Task:NotifyParentIfPotentiallyAttachedTask():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 492
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
-; V00 this [V00,T00] ( 5, 3 ) ref -> [rsp+08H] this class-hnd single-def
-; V01 loc0 [V01,T03] ( 6, 2 ) ref -> rdx class-hnd single-def
+; V00 this [V00,T00] ( 5, 3 ) ref -> rcx this class-hnd single-def
+; V01 loc0 [V01,T03] ( 6, 2 ) ref -> [rsp+00H] class-hnd single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T01] ( 2, 4 ) ref -> rdx class-hnd single-def "dup spill"
; V04 tmp2 [V04,T02] ( 3, 3 ) ref -> rdx single-def
; V05 tmp3 [V05,T04] ( 4, 2 ) ref -> rdx
;* V06 tmp4 [V06 ] ( 0, 0 ) int -> zero-ref "Inlining Arg"
;
-; Lcl frame size = 0
+; Lcl frame size = 8
G_M22329_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
- ;; size=0 bbWeight=1 PerfScore 0.00
+ push rax
+ ;; size=1 bbWeight=1 PerfScore 1.00
G_M22329_IG02: ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref, isz
; gcrRegs +[rcx]
mov rdx, gword ptr [rcx+28H]
; gcrRegs +[rdx]
test rdx, rdx
je SHORT G_M22329_IG04
mov rdx, gword ptr [rdx+30H]
test rdx, rdx
jne SHORT G_M22329_IG05
;; size=18 bbWeight=1 PerfScore 6.50
G_M22329_IG03: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
; gcrRegs -[rcx rdx]
+ add rsp, 8
ret
- ;; size=1 bbWeight=1 PerfScore 1.00
+ ;; size=5 bbWeight=1 PerfScore 1.25
G_M22329_IG04: ; gcVars=0000000000000000 {}, gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, gcvars, byref, isz
; gcrRegs +[rcx]
xor rdx, rdx
; gcrRegs +[rdx]
test rdx, rdx
je SHORT G_M22329_IG03
;; size=7 bbWeight=0 PerfScore 0.00
G_M22329_IG05: ; gcrefRegs=00000006 {rcx rdx}, byrefRegs=00000000 {}, byref, isz
+ mov gword ptr [rsp], rdx
+ ; GC ptr vars +{V01}
test byte ptr [rdx+34H], 8
jne SHORT G_M22329_IG03
- mov gword ptr [rsp+08H], rcx
- ; GC ptr vars +{V00}
+ mov rdx, gword ptr [rsp]
test byte ptr [rcx+34H], 4
+ mov gword ptr [rsp], rdx
je SHORT G_M22329_IG03
- mov rcx, rdx
- mov rdx, gword ptr [rsp+08H]
- ;; size=25 bbWeight=0 PerfScore 0.00
+ mov rdx, rcx
+ mov rcx, gword ptr [rsp]
+ ;; size=31 bbWeight=0 PerfScore 0.00
G_M22329_IG06: ; , epilog, nogc, extend
+ add rsp, 8
tail.jmp [hackishMethodName]
; gcr arg pop 0
- ;; size=6 bbWeight=0 PerfScore 0.00
+ ;; size=10 bbWeight=0 PerfScore 0.00 @@ -3,11 +3,13 @@ Sorting the arguments:
[000043] [000044]
[000043] -> #0 (non-struct local)
[000043] [000044]
-Deferred argument ('rcx'):
- [000043] -----+----- ▌ LCL_VAR ref V01 loc0
-Moved to late list
+ [000043] -> #1 (clobbers rcx used by [000044])
+ [000044] [000043]
Deferred argument ('rdx'):
[000044] -----+----- ▌ LCL_VAR ref V00 this
Moved to late list
+Deferred argument ('rcx'):
+ [000043] -----+----- ▌ LCL_VAR ref V01 loc0
+Moved to late list
-Register placement order: rcx rdx
+Register placement order: rdx rcx Seems like the same as the previous case -- |
Correct. We allocate
Morph would be too early to predict the block order that LSRA would pick. If you just renumber the blocks right before LSRA, you would see 1000s of improvements/regressions. I wouldn't recommend relying on "block ordering" factor at least to predict register allocation decision. |
Can you elaborate? I don't see any ABI need for putting |
I meant to say
For |
I understand, but why allocate it to STMT00003 ( 0x000[E-] ... ??? )
N005 ( 3, 3) [000015] -A--G---R-- ▌ ASG byref $VN.Void
N004 ( 1, 1) [000014] D------N--- ├──▌ LCL_VAR byref V03 tmp1 d:1 $VN.Void
N003 ( 3, 3) [000021] ----------- └──▌ ADD byref $180
N001 ( 1, 1) [000019] ----------- ├──▌ LCL_VAR ref V00 this u:1 (last use) $80
N002 ( 1, 1) [000020] ----------- └──▌ CNS_INT long 8 Fseq[hackishFieldName] $140 ? |
Yes. It seems, out of multiple choices, we end up picking
|
Regarding why
Well, here is the answer: runtime/src/coreclr/jit/lsra.h Lines 2435 to 2438 in 65bde8c
|
Thanks, that makes sense. Note that it's not a straight copy, it is
I actually noticed that we have this logic for parameters already: runtime/src/coreclr/jit/lsra.cpp Lines 5173 to 5243 in cc934c7
I tried to extend it to other locals too but I wasn't able to find out the right way to check this properly (I hit some snags where the next/last ref positions were in other blocks and I could not check if there was an actual use after a fixed reg). |
Yep, and that's precisely what #7999 is suggesting to improve. |
Sure, that's a fair point. |
I polished the implementation of this a bit. We now do an actual stable sort of the arguments that puts constants at the end and calls at the beginning. The previous implementation used swapping which had the annoying side effect of reordering other arguments in random ways, causing any changes to the algorithm to hit spurious other diffs. After the stable sort there is a separate step to resolve conflicts, which is done by an implementation of Tarjan's algorithm to pick a register order with a minimal number of conflicts. win-x64 diffs are looking pretty good and the TP impact seems reasonable, but some other platforms are seeing regressions (especially linux-arm). I need to investigate why in more detail tomorrow. |
I want to do some experiments using similar concept in establishing the relation between various blocks during sequencing as mentioned in #66994 (comment). |
Investigating arm32 diffs. libraries.pmi arm32 regression 8 (33.33% of base) : 173904.dasm - Builder[int,System.Nullable`1[int]]:ContainsKey(int):bool:this Diff: ; Final local variable assignments
;
; V00 this [V00,T00] ( 4, 4 ) ref -> r0 this class-hnd single-def
-; V01 arg1 [V01,T01] ( 3, 3 ) int -> r3 single-def
+; V01 arg1 [V01,T01] ( 3, 3 ) int -> r1 single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct ( 8) zero-ref "struct address for call/obj"
;* V04 tmp2 [V04 ] ( 0, 0 ) struct ( 8) zero-ref multireg-arg ld-addr-op "NewObj constructor temp"
-; V05 tmp3 [V05,T02] ( 2, 4 ) ref -> r1 class-hnd single-def "Inlining Arg"
-; V06 tmp4 [V06,T03] ( 2, 4 ) ref -> r2 class-hnd single-def "Inlining Arg"
+; V05 tmp3 [V05,T02] ( 2, 4 ) ref -> r2 class-hnd single-def "Inlining Arg"
+; V06 tmp4 [V06,T03] ( 2, 4 ) ref -> r0 class-hnd single-def "Inlining Arg"
;* V07 tmp5 [V07 ] ( 0, 0 ) ref -> zero-ref V03._root(offs=0x00) P-INDEP "field V03._root (fldOffset=0x0)"
;* V08 tmp6 [V08 ] ( 0, 0 ) ref -> zero-ref V03._comparers(offs=0x04) P-INDEP "field V03._comparers (fldOffset=0x4)"
-; V09 tmp7 [V09,T04] ( 2, 2 ) ref -> r1 single-def V04._root(offs=0x00) P-INDEP "field V04._root (fldOffset=0x0)"
-; V10 tmp8 [V10,T05] ( 2, 2 ) ref -> r2 single-def V04._comparers(offs=0x04) P-INDEP "field V04._comparers (fldOffset=0x4)"
+; V09 tmp7 [V09,T04] ( 2, 2 ) ref -> r2 single-def V04._root(offs=0x00) P-INDEP "field V04._root (fldOffset=0x0)"
+; V10 tmp8 [V10,T05] ( 2, 2 ) ref -> [sp+04H] single-def V04._comparers(offs=0x04) P-INDEP "field V04._comparers (fldOffset=0x4)"
;
-; Lcl frame size = 4
+; Lcl frame size = 12
G_M10664_IG01: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
- push {r3,lr}
- mov r3, r1
+ push {lr}
+ sub sp, 12
;; size=4 bbWeight=1 PerfScore 2.00
G_M10664_IG02: ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref
; gcrRegs +[r0]
- ldr r1, [r0+0x04]
- ; gcrRegs +[r1]
- ldr r2, [r0+0x08]
+ ldr r2, [r0+0x04]
; gcrRegs +[r2]
- mov r0, r3
+ ldr r0, [r0+0x08]
+ str r0, [sp+0x04] // [V10 tmp8]
+ ; GC ptr vars +{V10}
+ mov r0, r1
; gcrRegs -[r0]
+ mov r1, r2
+ ; gcrRegs +[r1]
+ ldr r2, [sp+0x04] // [V10 tmp8]
movw r3, 0xd1ff
movt r3, 0xd1ff
ldr r3, [r3]
+ ; GC ptr vars -{V10}
blx r3 // System.Collections.Immutable.ImmutableDictionary`2[int,System.Nullable`1[int]]:ContainsKey(int,MutationInput[int,System.Nullable`1[int]]):bool
; gcrRegs -[r1-r2]
; gcr arg pop 0
- ;; size=18 bbWeight=1 PerfScore 7.00
+ ;; size=24 bbWeight=1 PerfScore 10.00
G_M10664_IG03: ; , epilog, nogc, extend
- pop {r3,pc}
- ;; size=2 bbWeight=1 PerfScore 1.00
+ add sp, 12
+ pop {pc}
+ ;; size=4 bbWeight=1 PerfScore 2.00
Arg ordering diff: Sorting the arguments:
- [000000] -> #1 (non-struct local)
- [000019] [000000]
- [000019] -> #0 (last arg)
- [000019] [000000]
+ Arguments have register interference. Argument order and SCCs:
+ [ [000000] ]
+ [ [000019] ]
+Deferred argument ('r0'):
+ [000000] -----+------ ▌ LCL_VAR int V01 arg1
+Moved to late list
Deferred argument ('r1'):
( 9, 6) [000019] ------------ ▌ LCL_VAR struct<System.Collections.Immutable.ImmutableDictionary`2+MutationInput[System.Int32,System.Nullable`1[System.Int32]], 8>(P) V04 tmp2
▌ ref V04._root (offs=0x00) -> V09 tmp7
▌ ref V04._comparers (offs=0x04) -> V10 tmp8
Moved to late list
-Deferred argument ('r0'):
- [000000] -----+------ ▌ LCL_VAR int V01 arg1
-Moved to late list
-Register placement order: r1 r0
-Multireg struct argument V04 : CallArg[[000019].LCL_VAR struct (By value), 2 regs: r1 r2, byteAlignment=4, isLate, processed, isStruct]
+Register placement order: r0 r1
+Multireg struct argument V04 : CallArg[[000019].LCL_VAR struct (By value), 2 regs: r1 r2, byteAlignment=4, isLate, isStruct] LSRA puts Full IR with some comments: ***** BB01
STMT00008 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
[000037] -A-XG+------ ▌ ASG ref
[000036] D----+-N---- ├──▌ LCL_VAR ref V05 tmp3 // put in r2
[000024] ---XG+------ └──▌ IND ref
[000043] -----+------ └──▌ ADD byref
[000001] -----+------ ├──▌ LCL_VAR ref V00 this
[000042] -----+------ └──▌ CNS_INT int 4 Fseq[_root]
***** BB01
STMT00009 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
[000039] -A-XG+------ ▌ ASG ref
[000038] D----+-N---- ├──▌ LCL_VAR ref V06 tmp4 // put in r0, presumably because of last use of V00 on the RHS, and V00 is in r0
[000015] ---XG+------ └──▌ IND ref
[000045] -----+------ └──▌ ADD byref
[000014] -----+------ ├──▌ LCL_VAR ref V00 this
[000044] -----+------ └──▌ CNS_INT int 8 Fseq[_comparers]
***** BB01
STMT00006 ( INL03 @ 0x000[E-] ... ??? ) <- INL01 @ 0x000[E-] <- INLRT @ 0x000[E-]
[000030] -A---+------ ▌ ASG ref
[000029] D----+-N---- ├──▌ LCL_VAR ref V09 tmp7
[000028] -----+------ └──▌ LCL_VAR ref V05 tmp3
***** BB01
STMT00007 ( INL03 @ 0x007[E-] ... ??? ) <- INL01 @ 0x000[E-] <- INLRT @ 0x000[E-]
[000035] -A---+------ ▌ ASG ref
[000034] D----+-N---- ├──▌ LCL_VAR ref V10 tmp8
[000033] -----+------ └──▌ LCL_VAR ref V06 tmp4
***** BB01
STMT00003 ( ??? ... ??? )
[000011] --CXG+------ ▌ RETURN int
[000004] --CXG+------ └──▌ CALL int System.Collections.Immutable.ImmutableDictionary`2[Int32,Nullable`1][System.Int32,System.Nullable`1[System.Int32]].ContainsKey
[000000] -----+------ arg0 in r0 ├──▌ LCL_VAR int V01 arg1
[000046] -c---------- arg1 r1,r2 └──▌ FIELD_LIST struct
[000047] ------------ ofs 0 ├──▌ LCL_VAR ref V09 tmp7
[000048] ------------ ofs 4 └──▌ LCL_VAR ref V10 tmp8 |
libraries.pmi arm32 regression 40 (37.74% of base) : 77451.dasm - System.Security.Cryptography.ECDiffieHellmanOpenSsl:GenerateKey(System.Security.Cryptography.ECCurve):this Diff: @@ -28,46 +28,48 @@ G_M52640_IG02: ; gcrefRegs=0010 {r4}, byrefRegs=0000 {}, byref
ldr r3, [r3]
blx r3 // System.Security.Cryptography.ECDiffieHellmanOpenSsl:ThrowIfDisposed():this
; gcrRegs -[r0]
- ldr r0, [sp+0x40]
+ ldr r0, [r4+0x0C]
; gcrRegs +[r0]
- str r0, [sp]
- ldr r0, [sp+0x44]
- str r0, [sp+0x04]
- ldr r0, [sp+0x48]
- str r0, [sp+0x08]
- ldr r0, [sp+0x4C]
- str r0, [sp+0x0C]
- ldr r0, [sp+0x50]
- str r0, [sp+0x10]
- ldr r0, [sp+0x54]
- ; gcrRegs -[r0]
- str r0, [sp+0x14]
- ldr r0, [sp+0x58]
- ; gcrRegs +[r0]
- str r0, [sp+0x18]
- ldr r0, [sp+0x5C]
- str r0, [sp+0x1C]
- ldr r0, [sp+0x60]
- ; gcrRegs -[r0]
- str r0, [sp+0x20]
- ldr r0, [sp+0x64]
- ; gcrRegs +[r0]
- str r0, [sp+0x24]
+ ldr lr, [sp+0x40]
+ ; gcrRegs +[lr]
+ str lr, [sp]
+ ldr lr, [sp+0x44]
+ str lr, [sp+0x04]
+ ldr lr, [sp+0x48]
+ str lr, [sp+0x08]
+ ldr lr, [sp+0x4C]
+ str lr, [sp+0x0C]
+ ldr lr, [sp+0x50]
+ str lr, [sp+0x10]
+ ldr lr, [sp+0x54]
+ ; gcrRegs -[lr]
+ str lr, [sp+0x14]
+ ldr lr, [sp+0x58]
+ ; gcrRegs +[lr]
+ str lr, [sp+0x18]
+ ldr lr, [sp+0x5C]
+ str lr, [sp+0x1C]
+ ldr lr, [sp+0x60]
+ ; gcrRegs -[lr]
+ str lr, [sp+0x20]
+ ldr lr, [sp+0x64]
+ ; gcrRegs +[lr]
+ str lr, [sp+0x24]
ldr r1, [sp+0x34]
; gcrRegs +[r1]
ldr r2, [sp+0x38]
; gcrRegs +[r2]
ldr r3, [sp+0x3C]
; gcrRegs +[r3]
- ldr r0, [r4+0x0C]
movw lr, 0xd1ff
+ ; gcrRegs -[lr]
movt lr, 0xd1ff
ldr lr, [lr]
ldr r12, [r0]
blx lr // hackishMethodName
; gcrRegs -[r0-r3]
str r0, [r4+0x08]
- ;; size=82 bbWeight=1 PerfScore 35.00
+ ;; size=122 bbWeight=1 PerfScore 35.00
G_M52640_IG03: ; , epilog, nogc, extend
add sp, 40
pop {r4,r11,lr}
@@ -75,7 +77,7 @@ G_M52640_IG03: ; , epilog, nogc, extend
bx lr
;; size=10 bbWeight=1 PerfScore 4.00 diff --git "a/.\\base.txt" "b/.\\diff.txt"
index 131af5a..bcc2b16 100644
--- "a/.\\base.txt"
+++ "b/.\\diff.txt"
@@ -1,16 +1,13 @@
Sorting the arguments:
- [000008] -> #0 (cost 9)
- [000008] [000004]
- [000004] -> #1 (last arg)
- [000008] [000004]
+ No interference found between arguments.
+Deferred argument ('r0'):
+ [000004] ---XG+------ ▌ IND ref
+ [000015] -----+------ └──▌ ADD byref
+ [000003] -----+------ ├──▌ LCL_VAR ref V00 this
+ [000014] -----+------ └──▌ CNS_INT int 12 Fseq[hackishFieldName]
+Moved to late list
Deferred argument ('r1'):
( 9, 6) [000008] ------------ ▌ LCL_VAR struct<System.Security.Cryptography.ECCurve, 52> V01 arg1
Moved to late list
-Deferred argument ('r0'):
- ( 6, 3) [000004] ---XG------- ▌ IND ref
- ( 5, 5) [000015] -------N---- └──▌ ADD byref
- ( 3, 2) [000003] ------------ ├──▌ LCL_VAR ref V00 this
- ( 1, 2) [000014] ------------ └──▌ CNS_INT int 12 Fseq[hackishFieldName]
-Moved to late list
-Register placement order: r1 r0
+Register placement order: r0 r1
Seems we allocate an internal register for the following N027 (???,???) [000024] ------------ t24 = ▌ PUTARG_SPLIT struct (40 stackByteSize), (3 numRegs) REG r1,r2,r3 Presumably we don't need any such register and can just use one of the target registers as a temporary scratch register. Since these are argument registers, they can always use thumb encoding. |
We can almost always avoid allocating an internal register, which may be expensive if lr is picked since we cannot use thumb encoding for it. The only case where we need an internal register is when we have a source that is in a register, and we have a single taget register to place that conflicts with that source register. The to-stack copies then need an intermediate scratch register to avoid clobbering the source register. Nit Nit 2
benchmarks.run arm32 regression[14:31:43] 8 (22.22% of base) : 10715.dasm - <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this Diff: ; Assembly listing for method <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this
; Emitting BLENDED_CODE for generic ARM CPU - Unix
; optimized code
; sp based frame
; fully interruptible
; No matching PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
-; V00 this [V00,T00] ( 6, 5 ) ref -> r0 this class-hnd single-def
+; V00 this [V00,T00] ( 6, 5 ) ref -> [sp+04H] this class-hnd single-def
; V01 arg1 [V01,T01] ( 4, 4 ) ref -> r1 class-hnd single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T02] ( 2, 4 ) int -> r2 "impAppendStmt"
; V04 cse0 [V04,T03] ( 4, 3 ) ref -> r3 "CSE - aggressive"
;
-; Lcl frame size = 4
+; Lcl frame size = 12
G_M11432_IG01: ; gcVars=00000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
- push {r3,lr}
- ;; size=2 bbWeight=1 PerfScore 1.00
+ push {lr}
+ sub sp, 12
+ ;; size=4 bbWeight=1 PerfScore 2.00
G_M11432_IG02: ; gcrefRegs=0003 {r0 r1}, byrefRegs=0000 {}, byref, isz
; gcrRegs +[r0-r1]
ldr r2, [r0+0x0C]
ldr r3, [r1+0x08]
; gcrRegs +[r3]
ldr r3, [r3+0x04]
; gcrRegs -[r3]
adds r2, r3, r2
str r2, [r0+0x0C]
ldr r3, [r1+0x0C]
; gcrRegs +[r3]
cmp r3, 0
beq SHORT G_M11432_IG05
;; size=16 bbWeight=1 PerfScore 8.00
G_M11432_IG03: ; gcrefRegs=0009 {r0 r3}, byrefRegs=0000 {}, byref
; gcrRegs -[r1]
+ str r0, [sp+0x04]
+ ; GC ptr vars +{V00}
ldrb r2, [r0+0x10]
- ldr r1, [r0+0x04]
- ; gcrRegs +[r1]
ldr r0, [r3+0x04]
+ ldr r1, [sp+0x04] // [V00 this]
+ ; gcrRegs +[r1]
+ ldr r1, [r1+0x04]
ldr r3, [r3+0x0C]
; gcrRegs -[r3]
- ;; size=8 bbWeight=0.50 PerfScore 2.00
+ ;; size=12 bbWeight=0.50 PerfScore 3.00
G_M11432_IG04: ; , epilog, nogc, extend
- add sp, 4
+ add sp, 12
pop lr
bx r3 // hackishMethodName
; gcr arg pop 0
;; size=8 bbWeight=0.50 PerfScore 1.50
G_M11432_IG05: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc
; gcrRegs -[r0-r1]
- pop {r3,pc}
- ;; size=2 bbWeight=0.50 PerfScore 0.50
+ ; GC ptr vars -{V00}
+ add sp, 12
+ pop {pc}
+ ;; size=4 bbWeight=0.50 PerfScore 1.00
-; Total bytes of code 36, prolog size 2, PerfScore 16.60, instruction count 17, allocated bytes for code 36 (MethodHash=a976d357) for method <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this
+; Total bytes of code 44, prolog size 4, PerfScore 19.90, instruction count 21, allocated bytes for code 44 (MethodHash=a976d357) for method <>c__DisplayClass23_0:<UpdateStack>b__0(Sigil.Impl.StackTransition):this
; ============================================================ Ordering diff: Sorting the arguments:
- [000026] -> #0 (cost 7)
- [000026] [000024] [000032]
- [000024] -> #1 (cost 6)
- [000026] [000024] [000032]
- [000032] -> #2 (last arg)
- [000026] [000024] [000032]
+ Arguments have register interference. Argument order and SCCs:
+ [ [000026] ]
+ [ [000032] [000024] ]
Deferred argument ('r2'):
- ( 7, 4) [000026] ---XG------- ▌ IND bool
- ( 5, 5) [000047] -------N---- └──▌ ADD byref
- ( 3, 2) [000025] ------------ ├──▌ LCL_VAR ref V00 this
- ( 1, 2) [000046] ------------ └──▌ CNS_INT int 16 Fseq[hackishFieldName]
-Moved to late list
-Deferred argument ('r1'):
- ( 6, 3) [000024] ---XG------- ▌ IND ref
- ( 5, 5) [000045] -------N---- └──▌ ADD byref
- ( 3, 2) [000023] ------------ ├──▌ LCL_VAR ref V00 this
- ( 1, 2) [000044] ------------ └──▌ CNS_INT int 4 Fseq[hackishFieldName]
+ [000026] ---XG+------ ▌ IND bool
+ [000047] -----+------ └──▌ ADD byref
+ [000025] -----+------ ├──▌ LCL_VAR ref V00 this
+ [000046] -----+------ └──▌ CNS_INT int 16 Fseq[hackishFieldName]
Moved to late list
Deferred argument ('r0'):
- ( 6, 3) [000032] ---XG------- ▌ IND ref
- ( 5, 5) [000043] -------N---- └──▌ ADD byref
- ( 3, 2) [000020] ------------ ├──▌ LCL_VAR ref V01 arg1
- ( 1, 2) [000042] ------------ └──▌ CNS_INT int 12 Fseq[<Before>k__BackingField]
+ [000032] ---XG+------ ▌ IND ref
+ [000043] -----+------ └──▌ ADD byref
+ [000020] -----+------ ├──▌ LCL_VAR ref V01 arg1
+ [000042] -----+------ └──▌ CNS_INT int 12 Fseq[<Before>k__BackingField]
+Moved to late list
+Deferred argument ('r1'):
+ [000024] ---XG+------ ▌ IND ref
+ [000045] -----+------ └──▌ ADD byref
+ [000023] -----+------ ├──▌ LCL_VAR ref V00 this
+ [000044] -----+------ └──▌ CNS_INT int 4 Fseq[hackishFieldName]
Moved to late list
-Register placement order: r2 r1 r0
+Register placement order: r2 r0 r1 Looks like a cycle. Previously we place the "more expensive" arg first, which is |
benchmarks.run arm32 regression 4 (15.38% of base) : 17304.dasm - System.Collections.CopyTo`1[int]:ImmutableArray():this Diff: ; Assembly listing for method System.Collections.CopyTo`1[int]:ImmutableArray():this
; Emitting BLENDED_CODE for generic ARM CPU - Unix
; optimized code
; sp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 this [V00,T00] ( 4, 4 ) ref -> r0 this class-hnd single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
-; V02 tmp1 [V02,T01] ( 2, 4 ) byref -> r2 single-def "Inlining Arg"
+; V02 tmp1 [V02,T01] ( 2, 4 ) byref -> r1 single-def "Inlining Arg"
;* V03 tmp2 [V03 ] ( 0, 0 ) struct ( 4) zero-ref ld-addr-op "Inline stloc first use temp"
-; V04 tmp3 [V04,T02] ( 2, 4 ) ref -> r1 class-hnd single-def "Inlining Arg"
+; V04 tmp3 [V04,T02] ( 2, 4 ) ref -> r2 class-hnd single-def "Inlining Arg"
;* V05 tmp4 [V05 ] ( 0, 0 ) ref -> zero-ref class-hnd "impAppendStmt"
; V06 tmp5 [V06,T03] ( 3, 3 ) ref -> r0 single-def V03.array(offs=0x00) P-INDEP "field V03.array (fldOffset=0x0)"
-; V07 cse0 [V07,T04] ( 2, 2 ) int -> r2 "CSE - aggressive"
+; V07 cse0 [V07,T04] ( 2, 2 ) int -> r3 "CSE - aggressive"
;
; Lcl frame size = 4
G_M316_IG01: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
push {r3,lr}
;; size=2 bbWeight=1 PerfScore 1.00
G_M316_IG02: ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref
; gcrRegs +[r0]
- add r2, r0, 20
- ; byrRegs +[r2]
- ldr r1, [r0+0x0C]
+ add r1, r0, 20
+ ; byrRegs +[r1]
+ ldr r2, [r0+0x0C]
+ ; gcrRegs +[r2]
+ ldr r0, [r1]
+ ldr r3, [r0+0x04]
+ mov r1, r2
; gcrRegs +[r1]
- ldr r0, [r2]
- ldr r2, [r0+0x04]
- ; byrRegs -[r2]
+ ; byrRegs -[r1]
+ mov r2, r3
+ ; gcrRegs -[r2]
movw r3, 0xd1ff
movt r3, 0xd1ff
ldr r3, [r3]
blx r3 // hackishMethodName
; gcrRegs -[r0-r1]
; gcr arg pop 0
- ;; size=22 bbWeight=1 PerfScore 8.00
+ ;; size=26 bbWeight=1 PerfScore 10.00
G_M316_IG03: ; , epilog, nogc, extend
pop {r3,pc}
;; size=2 bbWeight=1 PerfScore 1.00
-; Total bytes of code 26, prolog size 2, PerfScore 12.60, instruction count 10, allocated bytes for code 26 (MethodHash=ad42fec3) for method System.Collections.CopyTo`1[int]:ImmutableArray():this
+; Total bytes of code 30, prolog size 2, PerfScore 15.00, instruction count 12, allocated bytes for code 30 (MethodHash=ad42fec3) for method System.Collections.CopyTo`1[int]:ImmutableArray():this Ordering diff: Sorting the arguments:
- [000018] -> #2 (non-struct local)
- [000017] [000042] [000018]
- [000017] -> #1 (non-struct local)
- [000042] [000017] [000018]
- [000042] -> #0 (last arg)
- [000042] [000017] [000018]
-Deferred argument ('r2'):
- [000042] ---X-+------ ▌ ARR_LENGTH int
- [000041] -----+------ └──▌ LCL_VAR ref V06 tmp5
-Moved to late list
+ No interference found between arguments.
Deferred argument ('r0'):
[000017] -----+------ ▌ LCL_VAR ref V06 tmp5
Moved to late list
Deferred argument ('r1'):
[000018] -----+------ ▌ LCL_VAR ref V04 tmp3
Moved to late list
+Deferred argument ('r2'):
+ [000042] ---X-+------ ▌ ARR_LENGTH int
+ [000041] -----+------ └──▌ LCL_VAR ref V06 tmp5
+Moved to late list
-Register placement order: r2 r0 r1
+Register placement order: r0 r1 r2 Just seems like slightly worse RA, instructions here are otherwise in the same order as before so the only thing added is some extra resolution moves. |
benchmarks.run arm32 regression 10 (12.20% of base) : 49443.dasm - System.Collections.CopyTo`1[int]:Span():this ; Assembly listing for method System.Collections.CopyTo`1[int]:Span():this
; Emitting BLENDED_CODE for generic ARM CPU - Unix
; optimized code
; r11 based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 4 single block inlinees; 5 inlinees without PGO data
; Final local variable assignments
;
; V00 this [V00,T00] ( 4, 4 ) ref -> r0 this class-hnd single-def
;* V01 loc0 [V01 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "NewObj constructor temp"
;* V04 tmp2 [V04 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "NewObj constructor temp"
-; V05 tmp3 [V05,T01] ( 4, 6 ) ref -> r2 class-hnd single-def "Inlining Arg"
-; V06 tmp4 [V06,T02] ( 4, 6 ) ref -> r2 class-hnd single-def "Inlining Arg"
+; V05 tmp3 [V05,T01] ( 4, 6 ) ref -> r1 class-hnd single-def "Inlining Arg"
+; V06 tmp4 [V06,T02] ( 4, 6 ) ref -> r0 class-hnd single-def "Inlining Arg"
;* V07 tmp5 [V07 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "Inlining Arg"
;* V08 tmp6 [V08 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
;* V09 tmp7 [V09 ] ( 0, 0 ) byref -> zero-ref single-def "Inlining Arg"
;* V10 tmp8 [V10 ] ( 0, 0 ) byref -> zero-ref single-def "Inlining Arg"
;* V11 tmp9 [V11 ] ( 0, 0 ) int -> zero-ref "Inlining Arg"
;* V12 tmp10 [V12 ] ( 0, 0 ) int -> zero-ref "Inlining Arg"
; V13 tmp11 [V13,T08] ( 2, 2 ) byref -> r1 single-def V01._reference(offs=0x00) P-INDEP "field V01._reference (fldOffset=0x0)"
-; V14 tmp12 [V14,T03] ( 3, 3 ) int -> r3 single-def V01._length(offs=0x04) P-INDEP "field V01._length (fldOffset=0x4)"
-; V15 tmp13 [V15,T04] ( 3, 2 ) byref -> r1 V03._reference(offs=0x00) P-INDEP "field V03._reference (fldOffset=0x0)"
-; V16 tmp14 [V16,T06] ( 3, 2 ) int -> r3 V03._length(offs=0x04) P-INDEP "field V03._length (fldOffset=0x4)"
-; V17 tmp15 [V17,T05] ( 3, 2 ) byref -> r0 V04._reference(offs=0x00) P-INDEP "field V04._reference (fldOffset=0x0)"
-; V18 tmp16 [V18,T07] ( 3, 2 ) int -> lr V04._length(offs=0x04) P-INDEP "field V04._length (fldOffset=0x4)"
+; V14 tmp12 [V14,T03] ( 3, 3 ) int -> r2 single-def V01._length(offs=0x04) P-INDEP "field V01._length (fldOffset=0x4)"
+; V15 tmp13 [V15,T04] ( 3, 2 ) byref -> registers V03._reference(offs=0x00) P-INDEP "field V03._reference (fldOffset=0x0)"
+; V16 tmp14 [V16,T06] ( 3, 2 ) int -> registers V03._length(offs=0x04) P-INDEP "field V03._length (fldOffset=0x4)"
+; V17 tmp15 [V17,T05] ( 3, 2 ) byref -> registers V04._reference(offs=0x00) P-INDEP "field V04._reference (fldOffset=0x0)"
+; V18 tmp16 [V18,T07] ( 3, 2 ) int -> registers V04._length(offs=0x04) P-INDEP "field V04._length (fldOffset=0x4)"
; V19 tmp17 [V19,T09] ( 2, 2 ) byref -> r0 single-def V07._reference(offs=0x00) P-INDEP "field V07._reference (fldOffset=0x0)"
;* V20 tmp18 [V20 ] ( 0, 0 ) int -> zero-ref single-def V07._length(offs=0x04) P-INDEP "field V07._length (fldOffset=0x4)"
;
-; Lcl frame size = 0
+; Lcl frame size = 4
G_M42859_IG01: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
- push {r11,lr}
- mov r11, sp
- ;; size=6 bbWeight=1 PerfScore 2.00
+ push {r3,r4,r11,lr}
+ add r11, sp, 8
+ ;; size=8 bbWeight=1 PerfScore 2.00
G_M42859_IG02: ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref, isz
; gcrRegs +[r0]
- ldr r2, [r0+0x04]
- ; gcrRegs +[r2]
- cmp r2, 0
+ ldr r1, [r0+0x04]
+ ; gcrRegs +[r1]
+ cmp r1, 0
bne SHORT G_M42859_IG04
;; size=6 bbWeight=1 PerfScore 3.00
G_M42859_IG03: ; gcrefRegs=0001 {r0}, byrefRegs=0000 {}, byref, isz
- ; gcrRegs -[r2]
+ ; gcrRegs -[r1]
movs r1, 0
- movs r3, 0
+ movs r2, 0
b SHORT G_M42859_IG05
;; size=6 bbWeight=0.50 PerfScore 1.50
-G_M42859_IG04: ; gcrefRegs=0005 {r0 r2}, byrefRegs=0000 {}, byref
- ; gcrRegs +[r2]
- add r1, r2, 8
+G_M42859_IG04: ; gcrefRegs=0003 {r0 r1}, byrefRegs=0000 {}, byref
+ ; gcrRegs +[r1]
+ add r2, r1, 8
+ ; byrRegs +[r2]
+ ldr r1, [r1+0x04]
+ ; gcrRegs -[r1]
+ mov r3, r1
+ mov r1, r2
; byrRegs +[r1]
- ldr r3, [r2+0x04]
- ;; size=6 bbWeight=0.50 PerfScore 1.00
+ mov r2, r3
+ ; byrRegs -[r2]
+ ;; size=12 bbWeight=0.50 PerfScore 2.50
G_M42859_IG05: ; gcrefRegs=0001 {r0}, byrefRegs=0002 {r1}, byref, isz
- ; gcrRegs -[r2]
- ldr r2, [r0+0x0C]
- ; gcrRegs +[r2]
- cmp r2, 0
+ ldr r0, [r0+0x0C]
+ cmp r0, 0
bne SHORT G_M42859_IG07
;; size=6 bbWeight=1 PerfScore 3.00
G_M42859_IG06: ; gcrefRegs=0000 {}, byrefRegs=0002 {r1}, byref, isz
- ; gcrRegs -[r0 r2]
+ ; gcrRegs -[r0]
movs r0, 0
- mov lr, 0
+ movs r3, 0
b SHORT G_M42859_IG08
- ;; size=8 bbWeight=0.50 PerfScore 1.50
-G_M42859_IG07: ; gcrefRegs=0004 {r2}, byrefRegs=0002 {r1}, byref
- ; gcrRegs +[r2]
- add r0, r2, 8
+ ;; size=6 bbWeight=0.50 PerfScore 1.50
+G_M42859_IG07: ; gcrefRegs=0001 {r0}, byrefRegs=0002 {r1}, byref
+ ; gcrRegs +[r0]
+ add r3, r0, 8
+ ; byrRegs +[r3]
+ ldr r0, [r0+0x04]
+ ; gcrRegs -[r0]
+ mov r4, r0
+ mov r0, r3
; byrRegs +[r0]
- ldr lr, [r2+0x04]
- ;; size=8 bbWeight=0.50 PerfScore 1.00
+ mov r3, r4
+ ; byrRegs -[r3]
+ ;; size=12 bbWeight=0.50 PerfScore 2.50
G_M42859_IG08: ; gcrefRegs=0000 {}, byrefRegs=0003 {r0 r1}, byref, isz
- ; gcrRegs -[r2]
- cmp r3, lr
+ cmp r2, r3
bhi SHORT G_M42859_IG10
- lsls r2, r3, 2
+ lsls r2, r2, 2
movw r3, 0xd1ff
movt r3, 0xd1ff
ldr r3, [r3]
blx r3 // hackishMethodName
; byrRegs -[r0-r1]
;; size=18 bbWeight=1 PerfScore 7.00
G_M42859_IG09: ; , epilog, nogc, extend
- pop {r11,pc}
+ pop {r3,r4,r11,pc}
;; size=4 bbWeight=1 PerfScore 1.00
G_M42859_IG10: ; gcVars=00000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
movw r3, 0xd1ff
movt r3, 0xd1ff
ldr r3, [r3]
blx r3 // System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
bkpt
;; size=14 bbWeight=0 PerfScore 0.00
-; Total bytes of code 82, prolog size 6, PerfScore 29.20, instruction count 31, allocated bytes for code 82 (MethodHash=a24e5894) for method System.Collections.CopyTo`1[int]:Span():this
+; Total bytes of code 92, prolog size 8, PerfScore 33.20, instruction count 37, allocated bytes for code 92 (MethodHash=a24e5894) for method System.Collections.CopyTo`1[int]:Span():this Sorting the arguments:
- [000130] -> #2 (non-struct local)
- [000129] [000133] [000130]
- [000129] -> #1 (non-struct local)
- [000133] [000129] [000130]
- [000133] -> #0 (last arg)
- [000133] [000129] [000130]
-Deferred argument ('r2'):
- [000133] -----+------ ▌ LSH int
- [000131] -----+------ ├──▌ LCL_VAR int V14 tmp12
- [000132] -----+------ └──▌ CNS_INT int 2
-Moved to late list
+ No interference found between arguments.
Deferred argument ('r0'):
[000129] -----+------ ▌ LCL_VAR byref V19 tmp17
Moved to late list
Deferred argument ('r1'):
[000130] -----+------ ▌ LCL_VAR byref V13 tmp11
Moved to late list
+Deferred argument ('r2'):
+ [000133] -----+------ ▌ LSH int
+ [000131] -----+------ ├──▌ LCL_VAR int V14 tmp12
+ [000132] -----+------ └──▌ CNS_INT int 2
+Moved to late list
-Register placement order: r2 r0 r1
+Register placement order: r0 r1 r2 A small change that seems to lead to overall globally worse register allocation. Thankfully not inside the main |
…ARG_SPLIT" This reverts commit a54c113.
benchmarks.run linux-x64 18 (78.26% of base) : 1413.dasm - System.ValueTuple`2[int,int]:GetHashCode():int:this ;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;* V04 tmp1 [V04 ] ( 0, 0 ) byref -> zero-ref single-def "impAppendStmt"
;* V05 tmp2 [V05 ] ( 0, 0 ) byref -> zero-ref
-; V06 tmp3 [V06,T02] ( 2, 2 ) int -> rax
-; V07 tmp4 [V07,T01] ( 2, 4 ) byref -> rdi single-def "impAppendStmt"
+; V06 tmp3 [V06,T02] ( 2, 2 ) int -> rsi
+; V07 tmp4 [V07,T01] ( 2, 4 ) byref -> [rbp-08H] single-def "impAppendStmt"
;* V08 tmp5 [V08 ] ( 0, 0 ) int -> zero-ref
;* V09 tmp6 [V09 ] ( 0, 0 ) byref -> zero-ref
;* V10 tmp7 [V10 ] ( 0, 0 ) int -> zero-ref
;* V11 tmp8 [V11 ] ( 0, 0 ) int -> zero-ref
;
-; Lcl frame size = 0
+; Lcl frame size = 16
G_M63581_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
push rbp
- mov rbp, rsp
- ;; size=4 bbWeight=1 PerfScore 1.25
+ sub rsp, 16
+ lea rbp, [rsp+10H]
+ ;; size=10 bbWeight=1 PerfScore 1.75
G_M63581_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000080 {rdi}, byref
; byrRegs +[rdi]
- mov eax, dword ptr [rdi]
- add rdi, 4
mov esi, dword ptr [rdi]
- mov edi, eax
+ add rdi, 4
+ mov bword ptr [rbp-08H], rdi
+ ; GC ptr vars +{V07}
+ mov edi, esi
; byrRegs -[rdi]
+ mov rsi, bword ptr [rbp-08H]
+ ; byrRegs +[rsi]
+ mov esi, dword ptr [rsi]
+ ; byrRegs -[rsi]
+ ; GC ptr vars -{V07}
call [System.HashCode:Combine[int,int](int,int):int]
nop
- ;; size=17 bbWeight=1 PerfScore 7.75
+ ;; size=25 bbWeight=1 PerfScore 9.75
G_M63581_IG03: ; , epilog, nogc, extend
+ add rsp, 16
pop rbp
ret
- ;; size=2 bbWeight=1 PerfScore 1.50
+ ;; size=6 bbWeight=1 PerfScore 1.75
-; Total bytes of code 23, prolog size 4, PerfScore 12.80, instruction count 10, allocated bytes for code 23 (MethodHash=47bb07a2) for method System.ValueTuple`2[int,int]:GetHashCode():int:this
+; Total bytes of code 41, prolog size 10, PerfScore 17.35, instruction count 14, allocated bytes for code 41 (MethodHash=47bb07a2) for method System.ValueTuple`2[int,int]:GetHashCode():int:this Sorting the arguments:
- [000024] -> #1 (non-struct local)
- [000063] [000024]
- [000063] -> #0 (last arg)
- [000063] [000024]
+ No interference found between arguments.
+Deferred argument ('rdi'):
+ [000024] -----+----- ▌ LCL_VAR int V06 tmp3
+Moved to late list
Deferred argument ('rsi'):
[000063] ---XG+----- ▌ IND int
[000035] -----+----- └──▌ LCL_VAR byref V07 tmp4
Moved to late list
-Deferred argument ('rdi'):
- [000024] -----+----- ▌ LCL_VAR int V06 tmp3
-Moved to late list
-Register placement order: rsi rdi
+Register placement order: rdi rsi
Another case where we were lucky to get the order right before -- LSRA allocates |
#76671 gets a lot of the diffs, but the diffs of this is still quite large, especially on win-x64. I'm investigating some of the regressions with that change in now. win-x64 benchmarks regression17 (70.83% of base) : 3299.dasm - System.Xml.Schema.XmlSchemaObjectCollection:OnInsert(int,System.Object):this Diff: @@ -6,40 +6,48 @@
; No matching PGO data
; Final local variable assignments
;
-; V00 this [V00,T00] ( 4, 3.50) ref -> rdx this class-hnd single-def
+; V00 this [V00,T00] ( 4, 3.50) ref -> rcx this class-hnd single-def
;* V01 arg1 [V01 ] ( 0, 0 ) int -> zero-ref single-def
; V02 arg2 [V02,T01] ( 3, 2.50) ref -> r8 class-hnd single-def
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
-; V04 tmp1 [V04,T02] ( 2, 4 ) ref -> rcx class-hnd single-def "dup spill"
-; V05 tmp2 [V05,T03] ( 4, 3 ) ref -> rcx single-def
+; V04 tmp1 [V04,T02] ( 2, 4 ) ref -> rdx class-hnd single-def "dup spill"
+; V05 tmp2 [V05,T03] ( 4, 3 ) ref -> [rsp+00H] spill-single-def
;
-; Lcl frame size = 0
+; Lcl frame size = 8
G_M49570_IG01: ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
- mov rdx, rcx
+ push rax
+ ;; size=1 bbWeight=1 PerfScore 1.00
+G_M49570_IG02: ; gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, byref, isz
+ ; gcrRegs +[rcx r8]
+ mov rdx, gword ptr [rcx+10H]
; gcrRegs +[rdx]
- ;; size=3 bbWeight=1 PerfScore 0.25
-G_M49570_IG02: ; gcrefRegs=00000104 {rdx r8}, byrefRegs=00000000 {}, byref, isz
- ; gcrRegs +[r8]
- mov rcx, gword ptr [rdx+10H]
- ; gcrRegs +[rcx]
- test rcx, rcx
+ mov gword ptr [rsp], rdx
+ ; GC ptr vars +{V05}
+ test rdx, rdx
jne SHORT G_M49570_IG04
- ;; size=9 bbWeight=1 PerfScore 3.25
+ ;; size=13 bbWeight=1 PerfScore 4.25
G_M49570_IG03: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
; gcrRegs -[rcx rdx r8]
+ ; GC ptr vars -{V05}
+ add rsp, 8
ret
- ;; size=1 bbWeight=0.50 PerfScore 0.50
-G_M49570_IG04: ; gcVars=0000000000000000 {}, gcrefRegs=00000106 {rcx rdx r8}, byrefRegs=00000000 {}, gcvars, byref
- ; gcrRegs +[rcx rdx r8]
+ ;; size=5 bbWeight=0.50 PerfScore 0.62
+G_M49570_IG04: ; gcVars=0000000000000008 {V05}, gcrefRegs=00000102 {rcx r8}, byrefRegs=00000000 {}, gcvars, byref
+ ; gcrRegs +[rcx r8]
+ ; GC ptr vars +{V05}
+ mov rdx, rcx
+ ; gcrRegs +[rdx]
+ mov rcx, gword ptr [rsp]
mov rax, qword ptr [rcx]
mov rax, qword ptr [rax+48H]
- ;; size=7 bbWeight=0.50 PerfScore 2.00
+ ;; size=14 bbWeight=0.50 PerfScore 2.62
G_M49570_IG05: ; , epilog, nogc, extend
+ add rsp, 8
tail.jmp [rax+20H]System.Xml.Schema.XmlSchemaObject:hackishMethodName
- ;; size=4 bbWeight=0.50 PerfScore 1.00
+ ;; size=8 bbWeight=0.50 PerfScore 1.12
New arg placement order: @@ -1,18 +1,16 @@
Sorting the arguments:
- [000015] -> #2 (non-struct local)
- [000013] [000014] [000015]
- [000014] -> #1 (non-struct local)
- [000013] [000014] [000015]
- [000013] -> #0 (non-struct local)
- [000013] [000014] [000015]
-Deferred argument ('rcx'):
- [000013] -----+----- ▌ LCL_VAR ref V05 tmp2
-Moved to late list
+ Arguments have register interference. Argument order and SCCs:
+ [ [000014] ]
+ [ [000013] ]
+ [ [000015] ]
Deferred argument ('rdx'):
[000014] -----+----- ▌ LCL_VAR ref V00 this
Moved to late list
+Deferred argument ('rcx'):
+ [000013] -----+----- ▌ LCL_VAR ref V05 tmp2
+Moved to late list
Deferred argument ('r8'):
[000015] -----+----- ▌ LCL_VAR ref V02 arg2
Moved to late list
-Register placement order: rcx rdx r8
+Register placement order: rdx rcx r8 LSRA then realizes that it can leave Last use of V05 between PUTARG and CALL. Removing occupied arg regs from preferences: [rdx r8] Will need to look at bit closer at this one to figure out why the preferencing does not kick in. It seems like allocating |
Just realized, that we should also print the new preference set on this line, just to make things clear.
This are all the refpositions for that Interval
Although here we say that preference is just
So, check when allocating for the |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
No description provided.