[NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding #107766

filipnavara · 2024-09-12T21:18:44Z

Contributes to #76371

The are two changes in the PR that work in tandem. The first is a JIT change for generating slightly different frame layout on NativeAOT/ARM64/Apple platforms (iOS, macOS and tvOS). The second change is ObjWriter code that recognizes the structurally compatible unwinding information and generates compact unwinding codes in the object files instead of verbose DWARF unwinding information.

For NativeAOT/ARM64/Apple ABI do the following:

Save callee registers in opposite order and in pairs.
Prefer saving FP/LR on the top of the frame. Heuristics are used to avoid worse code quality outside of prolog/epilog due to addressing range limits of the ARM64 instruction set.
Added optimization to lvaFrameAddress to rewrite FP-x references to SP+y when possible. This allows efficient addressing using positive indexes when FP points to the top of the frame. It mimics similar optimization on ARM32.

Each of these changes comes with some caveats:

Saving registers in pairs may lead to 16 bytes more used in the stack space. The ARM64 stack has to be 16 byte aligned. If the prolog previously saved odd number of integer callee saved registers and odd number of floating point callee saved registers, we'd now generate one more 16 byte stack slot. This seems to be rather rare occurrence. It does not affect the number of instructions used. Likewise, the changed order doesn't have any impact on code size.
Saving FP/LR at the top of the frame may cause one additional instruction in the prolog if any local or temporary variables are used. Additionally, addressing using negative offsets from FP is less efficient than addressing with positive offsets since the ARM64 instruction encoding can only address negative offsets with non-scaled encoding which significantly reduces the addressable range. This is mostly mitigated by the optimization in lvaFrameAddress. The additional increase in prolog size also causes a cascading effect where loop alignment and related code alignment (32-byte alignment for method start) significantly contribute to the code section size. In some cases the same alignment logic may also reduce the size.

As you can see, this is a bit of a trade-off and it's valid to ask if it's worth it.

Firstly, let me address the size question. The loop alignment seems to be the biggest contributor to the code size variation, and I filed issue #107284 to investigate whether we can come up with a better defaults for Apple platforms. The code size changes are predominantly contained to the small fixed change in the prolog size and the additional alignment. When testing the prototype on MAUI apps the biggest culprit to increased code size was the code generated from XAML that generates methods with ton of local variables. The change in lvaFrameAddress practically eliminated any negative effect of the frame layout on this type of code. Without the change the regression was nearly 50% due to stack references needing an indirect load with a register and extra instruction. Outside of MAUI the biggest visible effect is on code from Regex source generator but in that case it's quite evenly split between size improvements and regressions. That suggests the regex code may be a good candidate for measuring the performance characteristics of the loop alignment. The code size regressions on few examples I tried amount to around 2% (incl. the alignment). The saved space in the DWARF unwinding section is hovering around 90% +- 3%. To put that into absolute numbers we are looking at savings around 0.7 Mb for dotnet new maui app and around 3 Mb for System.Runtime.Tests in the linked executables. The savings in the size of the unwinding information far outweigh any increase in the code size.

Secondly, part of the motivation is that the Apple linker is notoriously buggy with processing the DWARF unwinding data. The compact unwinding tables are used as an index to the DWARF data for anything that cannot be expressed using compact unwinding code directly. Due to the structure of the tables this limits the effective offset of DWARF info to 24 bits and places a hard limit on the DWARF unwinding info size. At least with some versions of the Apple linker, breaking this limit results in silent corruption and runtime failures.

Comparison between `main` and PR for System.Runtime.Tests in Release build

Compare raw size of linked binaries:

ls -l ./artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests ../runtime-main/artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests
-rwxr-xr-x  1 filipnavara  staff  48913112 Sep 12 22:26 ../runtime-main/artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests
-rwxr-xr-x  1 filipnavara  staff  45827128 Sep 12 22:34 ./artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests

Bloaty check of linked binaries:

bloaty -d symbols ./artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests -- ../runtime-main/artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.9%  +406Ki  +1.9%  +406Ki    [__TEXT,__managedcode]
   +55% +8.80Ki   +54% +8.80Ki    [__TEXT]
  +0.1% +2.51Ki  +0.1% +2.51Ki    [__DATA,.dotnet_eh_table]
  +0.1%    +272  +0.1%    +272    [__DATA,__data]
  -0.0%      -5  -0.0%      -5    [__TEXT,__cstring]
  -0.0%     -16  -0.0%     -16    [__TEXT,__const]
 -33.0% -2.78Ki -25.0% -2.78Ki    [__DATA]
  -3.6% -21.7Ki  -2.6% -16.0Ki    [__LINKEDIT]
 -20.5%  -437Ki -20.5%  -437Ki    [__TEXT,__unwind_info]
 -88.7% -2.90Mi -88.7% -2.90Mi    [__TEXT,__eh_frame]
  -6.3% -2.94Mi  -5.4% -2.94Mi    TOTAL

Bloaty check of object files:

bloaty ./artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o -- ../runtime-main/artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o 
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.9%  +406Ki  +1.9%  +406Ki    ,__managedcode
  +0.1% +9.64Ki  +0.1% +9.64Ki    ,__debug_loc
  +0.1% +2.51Ki  +0.1% +2.51Ki    ,.dotnet_eh_table
  +0.0%    +348  +0.0%    +348    ,__debug_line
  +0.1%    +272  +0.1%    +272    ,__data
  +0.0%     +20  +0.0%     +20    ,__debug_info
  -0.1%      -1  -2.3%      -1    []
  -0.0%      -9  -0.0%      -9    ,__const
  -6.8% -1.76Mi  [ = ]       0    [Unmapped]
 -88.7% -2.90Mi -88.7% -2.90Mi    ,__eh_frame
  -2.4% -4.25Mi  -2.4% -2.49Mi    TOTAL

Bloaty check of object files (detailed):

bloaty -d symbols ./artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o -- ../runtime-main/artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.9%  +410Ki  +1.6%  +410Ki    [36252 Others]
  +0.1% +9.64Ki  +0.1% +9.64Ki    [,__debug_loc]
  [NEW] +7.19Ki  [NEW] +6.97Ki    _TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F026929D4AD63EB6EDB749A9B02B133C084237B3AFB8777B3DE0107C755B91565__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +2.42Ki  [NEW] +2.20Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__EnsureArrayIndexRegex_5_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +2.02Ki  [NEW] +1.81Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__Regex1_3_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.54Ki  [NEW] +1.51Ki    _lsda0_TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F026929D4AD63EB6EDB749A9B02B133C084237B3AFB8777B3DE0107C755B91565__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.44Ki  [NEW] +1.23Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__P0Regex_6_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.18Ki  [NEW]    +976    _System_Runtime_Tests_System_Text_RegularExpressions_Generated__RegexGenerator_g_FB4BB6619E624A9FE5F4E687DA0CF38E2970A0CF70BB059A398625E558ACC7DAC__IanaAbbreviationRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.14Ki  [NEW]    +960    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__Regex2_4_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.11Ki  [NEW]    +912    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__UnknownNodeObjectEmptyRegex_8_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL]   -1017  [DEL]    -800    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__EncodeCharRegex_1_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.11Ki  [DEL]    -912    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__UnknownNodeObjectEmptyRegex_8_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.14Ki  [DEL]    -960    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__Regex2_4_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.17Ki  [DEL]    -960    _System_Runtime_Tests_System_Text_RegularExpressions_Generated__RegexGenerator_g_F23FFE14CED6C53CC123B603EF102D84787AD8CF9A59E83434F2BDF516814C2CC__IanaAbbreviationRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.44Ki  [DEL] -1.23Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__P0Regex_6_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.57Ki  [DEL] -1.54Ki    _lsda0_TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F25E40F247ED975110B5E93851ECC9A2CB78C4DC17A73BF5F5EE6D76B142C41B6__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -2.00Ki  [DEL] -1.80Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__Regex1_3_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -2.42Ki  [DEL] -2.20Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__EnsureArrayIndexRegex_5_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -7.19Ki  [DEL] -6.97Ki    _TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F25E40F247ED975110B5E93851ECC9A2CB78C4DC17A73BF5F5EE6D76B142C41B6__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  -6.8% -1.76Mi  [ = ]       0    [Unmapped]
 -88.7% -2.90Mi -88.7% -2.90Mi    lsection3
  -2.4% -4.25Mi  -2.4% -2.49Mi    TOTAL

unwinding format. For NativeAOT/ARM64/Apple API do the following: - Save callee registers in opposite order and in pairs. - Prefer saving FP/LR on the top of the frame. Heuristics are used to avoid worse code quality outside of prolog/epilog due to addressing range limits of the ARM64 instruction set. - Added optimization to lvaFrameAddress to rewrite FP-x references to SP+y when possible. This allows efficient addressing using positive indexes when FP points to the top of the frame. It mimics similar optimization on ARM32.

into compact unwinding code

filipnavara · 2024-09-12T21:20:28Z

The dominant part are the JIT changes, hence area-CodeGen-coreclr is the right label.

cc @dotnet/ilc-contrib @VSadov @ivanpovazan

dotnet-policy-service · 2024-09-12T21:26:26Z

Tagging subscribers to 'os-ios': @vitek-karas, @kotlarmilos, @ivanpovazan, @steveisok, @akoeplinger
See info in area-owners.md if you want to be subscribed.

dotnet-policy-service · 2024-09-12T21:26:27Z

Tagging subscribers to 'os-tvos': @vitek-karas, @kotlarmilos, @ivanpovazan, @steveisok, @akoeplinger
See info in area-owners.md if you want to be subscribed.

filipnavara added 2 commits September 12, 2024 22:42

ObjWriter: For Mach-O ARM64 try to convert the DWARF CFI unwinding codes

73bea2e

into compact unwinding code

filipnavara requested a review from MichalStrehovsky as a code owner September 12, 2024 21:18

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 12, 2024

filipnavara requested review from BruceForstall and removed request for MichalStrehovsky September 12, 2024 21:18

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 12, 2024

filipnavara added arch-arm64 os-mac-os-x macOS aka OSX os-ios Apple iOS os-tvos Apple tvOS labels Sep 12, 2024

This was referenced Sep 13, 2024

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

AndyAyersMS mentioned this pull request Sep 16, 2024

BasicObjectsRoundTripAndMatch fails #107629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding #107766

[NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding #107766

filipnavara commented Sep 12, 2024 •

edited

Loading

filipnavara commented Sep 12, 2024 •

edited

Loading

dotnet-policy-service bot commented Sep 12, 2024

dotnet-policy-service bot commented Sep 12, 2024

[NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding #107766

Are you sure you want to change the base?

[NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding #107766

Conversation

filipnavara commented Sep 12, 2024 • edited Loading

filipnavara commented Sep 12, 2024 • edited Loading

dotnet-policy-service bot commented Sep 12, 2024

dotnet-policy-service bot commented Sep 12, 2024

filipnavara commented Sep 12, 2024 •

edited

Loading

filipnavara commented Sep 12, 2024 •

edited

Loading