Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop cloning and pgo #48850

Open
AndyAyersMS opened this issue Feb 26, 2021 · 6 comments
Open

loop cloning and pgo #48850

AndyAyersMS opened this issue Feb 26, 2021 · 6 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 26, 2021

Loop cloning is not a good citizen when it comes to PGO. There are two aspects to this:

  1. Cloning heuristics should take PGO into account (along with other things). The benefit of cloning is in part related to the relative weight of the loop body, which is knowable with PGO. It is also related to the relative trip count of the loop, which is also knowable with PGO. So a loop that executes a lot and iterates a lot is a better cloning candidate than a loop that either doesn't execute a lot or doesn't iterate a lot. Within the loop the relative frequency of the operations that can be bypassed or further optimized with cloning should also be taken into account (eg if there's a bounds check but only under some rarely true condition, the benefit of cloning is reduced).
  2. Cloning does not make appropriate updates to profile data (see example below). One would imagine that post cloning the profile data should show that we now think the code is more likely to execute the cloned loop; otherwise why would we have cloned in the first place? Exactly how to split up the profile data is perhaps an open question. In cases like ours where the focus is on bounds checks and control flow checks are implicit there is no obvious signal we can measure (short of tracking how often exceptions are thrown) and we may have to simply guess; something like a 0.1/0.9 split seems plausible but we might consider even more biased splits.

If we start cloning to enable explicit control flow optimization (aka "loop unswitching") then we will have data telling us roughly how to divide up the flow. For example if the loop body contains a loop-invariant test and branch (or a test that can be rendered loop invariant with a suitable upfront check) we will know how often the branches were taken from PGO data and that tells us how to split the profile data.

An example of what happens today:

BeforeAfter

Note how cloning duplicates the block weights for the loop blocks, doesn't set reasonable weights for the new blocks it adds, and messes up the edge weights even in some "remote" parts of the flow graph (suspect this is because fgUpdateChangedFlowGraph recomputes the pred lists, which seems a bit odd).

cc @BruceForstall

category:cq
theme:loop-opt
skill-level:expert
cost:large
impact:medium

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 26, 2021
@AndyAyersMS AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Feb 26, 2021
@BruceForstall BruceForstall added this to the 6.0.0 milestone Feb 26, 2021
@AndyAyersMS
Copy link
Member Author

The extra calls to fgComputePreds seem like the could be avoided. Currently they are masking some kind of faulty pred list maintenance in places, and a general failure to properly maintain BBF_JUMP_TARGET. Both of those seem fixable and then we can remove the need to rebuild pred lists (and thereby preserve the edge weights).

Sample of spmi failures if I suppress the recomputations:

ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.Text.EncodingProvider:GetEncodingFromProvider(int):System.Text.Encoding' during 'Generate code' (IL size 51)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.RuntimeType:IsSubclassOf(System.Type):bool:this' during 'Generate code' (IL size 94)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1807) - Assertion failed 'blockPred->bbNext == block || blockPred->bbJumpDest == block' in 'MemberInfoCache`1[__Canon][System.__Canon]:MergeWithGlobalList(System.__Canon[]):this' during 'Clone loops' (IL size 256)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.Reflection.RuntimeMethodInfo:get_ContainsGenericParameters():bool:this' during 'Generate code' (IL size 74)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1807) - Assertion failed 'blockPred->bbNext == block || blockPred->bbJumpDest == block' in 'System.RuntimeType:MakeGenericType(System.Type[]):System.Type:this' during 'Clone loops' (IL size 315)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.String:IsNullOrWhiteSpace(System.String):bool' during 'Generate code' (IL size 40)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1807) - Assertion failed 'blockPred->bbNext == block || blockPred->bbJumpDest == block' in 'System.Linq.EnumerableSorter`2[__Canon,Int32][System.__Canon,System.Int32]:ComputeKeys(System.__Canon[],int):this' during 'Clone loops' (IL size 73)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1811) - Assertion failed 'blockPred->bbNext == block' in 'System.Number:FindSection(System.ReadOnlySpan`1[Char],int):int' during 'Find loops' (IL size 190)

Problematic phases for pred list maintenance seem to be "Find loops", "Unroll loops", and "Clone loops".

Not as clear yet where we're dropping BBF_JUMP_TARGET updates. We should add checking for this to the post-phase checks.

@EgorBo
Copy link
Member

EgorBo commented Feb 27, 2021

Shouldn't all blocks of the cloned loop have zero weight by default even without the PGO?

@EgorBo
Copy link
Member

EgorBo commented Feb 27, 2021

@AndyAyersMS same for CloneFinally optimization I believe:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void DoWork() { }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void DoWorkFinally() { }


    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Test()
    {
        try
        {
            DoWork();
        }
        finally
        {
            DoWorkFinally();
        }
    }

Tier0->Tier1 + TieredPGO:

G_M24707_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC30             sub      rsp, 48
       488D6C2430           lea      rbp, [rsp+30H]
       488965F0             mov      qword ptr [rbp-10H], rsp
                                                ;; bbWeight=1    PerfScore 2.75
G_M24707_IG02:              ;; offset=000EH
       E8050BFEFF           call     Program:DoWork()
       90                   nop
                                                ;; bbWeight=1    PerfScore 1.25
G_M24707_IG03:              ;; offset=0014H
       E8070BFEFF           call     Program:DoWorkFinally()
       90                   nop
                                                ;; bbWeight=1    PerfScore 1.25
G_M24707_IG04:              ;; offset=001AH
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret
                                                ;; bbWeight=1    PerfScore 2.00
G_M24707_IG05:              ;; offset=0020H
       55                   push     rbp
       4883EC30             sub      rsp, 48
       488B6920             mov      rbp, qword ptr [rcx+32]
       48896C2420           mov      qword ptr [rsp+20H], rbp
       488D6D30             lea      rbp, [rbp+30H]
                                                ;; bbWeight=1    PerfScore 4.75  <-- bbWeight should be zero
G_M24707_IG06:              ;; offset=0032H
       E8E90AFEFF           call     Program:DoWorkFinally()
       90                   nop
                                                ;; bbWeight=1    PerfScore 1.25  <-- bbWeight should be zero
G_M24707_IG07:              ;; offset=0038H
       4883C430             add      rsp, 48
       5D                   pop      rbp
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.75  <-- bbWeight should be zero

(with TieredPGO=0 weights look better)

@AndyAyersMS
Copy link
Member Author

@EgorBo not sure what the right division is for cloning. Setting the unoptimized loops counts to zero may be too drastic; we still want to lightly optimize it in case that's where the program goes at runtime.

Good point about finally cloning -- this isn't caught by the consistency checker as it considers each EH handler more or less as its own separate graph. Also note we can't always divert all the non-eh flow to the clone.... so we need to do some math and figure out the right scaling here. Likely something like: scale the finally clone to match the incoming flow, scale the original finally to preserve overall count balance for non EH. If there is no other non-EH flow or all other non-EH flow is zero, then perhaps still keep a bit of residual profile around so we can lightly optimize the original finally.

@BruceForstall
Copy link
Member

Loop cloning no longer recomputes preds lists: #51757

Loop cloning now scales block weights for the cloned (slow path) loop, at 1%/99% ratio: #51901

Cloning currently doesn't change / scale any edge weights.

The only benefit analysis cloning does is to determine if there are loop optimizations that would be done if cloning occurs, namely, removal of array bounds checks. Adding other considerations, such as PGO weight considerations, and comparing cost against benefit, can still be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Status: Optimizations
Development

No branches or pull requests

3 participants