Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding #107766

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

filipnavara
Copy link
Member

@filipnavara filipnavara commented Sep 12, 2024

Contributes to #76371

The are two changes in the PR that work in tandem. The first is a JIT change for generating slightly different frame layout on NativeAOT/ARM64/Apple platforms (iOS, macOS and tvOS). The second change is ObjWriter code that recognizes the structurally compatible unwinding information and generates compact unwinding codes in the object files instead of verbose DWARF unwinding information.

For NativeAOT/ARM64/Apple ABI do the following:

  1. Save callee registers in opposite order and in pairs.
  2. Prefer saving FP/LR on the top of the frame. Heuristics are used to avoid worse code quality outside of prolog/epilog due to addressing range limits of the ARM64 instruction set.
  3. Added optimization to lvaFrameAddress to rewrite FP-x references to SP+y when possible. This allows efficient addressing using positive indexes when FP points to the top of the frame. It mimics similar optimization on ARM32.

Each of these changes comes with some caveats:

  1. Saving registers in pairs may lead to 16 bytes more used in the stack space. The ARM64 stack has to be 16 byte aligned. If the prolog previously saved odd number of integer callee saved registers and odd number of floating point callee saved registers, we'd now generate one more 16 byte stack slot. This seems to be rather rare occurrence. It does not affect the number of instructions used. Likewise, the changed order doesn't have any impact on code size.
  2. Saving FP/LR at the top of the frame may cause one additional instruction in the prolog if any local or temporary variables are used. Additionally, addressing using negative offsets from FP is less efficient than addressing with positive offsets since the ARM64 instruction encoding can only address negative offsets with non-scaled encoding which significantly reduces the addressable range. This is mostly mitigated by the optimization in lvaFrameAddress. The additional increase in prolog size also causes a cascading effect where loop alignment and related code alignment (32-byte alignment for method start) significantly contribute to the code section size. In some cases the same alignment logic may also reduce the size.

As you can see, this is a bit of a trade-off and it's valid to ask if it's worth it.

Firstly, let me address the size question. The loop alignment seems to be the biggest contributor to the code size variation, and I filed issue #107284 to investigate whether we can come up with a better defaults for Apple platforms. The code size changes are predominantly contained to the small fixed change in the prolog size and the additional alignment. When testing the prototype on MAUI apps the biggest culprit to increased code size was the code generated from XAML that generates methods with ton of local variables. The change in lvaFrameAddress practically eliminated any negative effect of the frame layout on this type of code. Without the change the regression was nearly 50% due to stack references needing an indirect load with a register and extra instruction. Outside of MAUI the biggest visible effect is on code from Regex source generator but in that case it's quite evenly split between size improvements and regressions. That suggests the regex code may be a good candidate for measuring the performance characteristics of the loop alignment. The code size regressions on few examples I tried amount to around 2% (incl. the alignment). The saved space in the DWARF unwinding section is hovering around 90% +- 3%. To put that into absolute numbers we are looking at savings around 0.7 Mb for dotnet new maui app and around 3 Mb for System.Runtime.Tests in the linked executables. The savings in the size of the unwinding information far outweigh any increase in the code size.

Secondly, part of the motivation is that the Apple linker is notoriously buggy with processing the DWARF unwinding data. The compact unwinding tables are used as an index to the DWARF data for anything that cannot be expressed using compact unwinding code directly. Due to the structure of the tables this limits the effective offset of DWARF info to 24 bits and places a hard limit on the DWARF unwinding info size. At least with some versions of the Apple linker, breaking this limit results in silent corruption and runtime failures.

Comparison between `main` and PR for System.Runtime.Tests in Release build

Compare raw size of linked binaries:

ls -l ./artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests ../runtime-main/artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests
-rwxr-xr-x  1 filipnavara  staff  48913112 Sep 12 22:26 ../runtime-main/artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests
-rwxr-xr-x  1 filipnavara  staff  45827128 Sep 12 22:34 ./artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests

Bloaty check of linked binaries:

bloaty -d symbols ./artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests -- ../runtime-main/artifacts/bin/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/publish/System.Runtime.Tests
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.9%  +406Ki  +1.9%  +406Ki    [__TEXT,__managedcode]
   +55% +8.80Ki   +54% +8.80Ki    [__TEXT]
  +0.1% +2.51Ki  +0.1% +2.51Ki    [__DATA,.dotnet_eh_table]
  +0.1%    +272  +0.1%    +272    [__DATA,__data]
  -0.0%      -5  -0.0%      -5    [__TEXT,__cstring]
  -0.0%     -16  -0.0%     -16    [__TEXT,__const]
 -33.0% -2.78Ki -25.0% -2.78Ki    [__DATA]
  -3.6% -21.7Ki  -2.6% -16.0Ki    [__LINKEDIT]
 -20.5%  -437Ki -20.5%  -437Ki    [__TEXT,__unwind_info]
 -88.7% -2.90Mi -88.7% -2.90Mi    [__TEXT,__eh_frame]
  -6.3% -2.94Mi  -5.4% -2.94Mi    TOTAL

Bloaty check of object files:

bloaty ./artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o -- ../runtime-main/artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o 
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.9%  +406Ki  +1.9%  +406Ki    ,__managedcode
  +0.1% +9.64Ki  +0.1% +9.64Ki    ,__debug_loc
  +0.1% +2.51Ki  +0.1% +2.51Ki    ,.dotnet_eh_table
  +0.0%    +348  +0.0%    +348    ,__debug_line
  +0.1%    +272  +0.1%    +272    ,__data
  +0.0%     +20  +0.0%     +20    ,__debug_info
  -0.1%      -1  -2.3%      -1    []
  -0.0%      -9  -0.0%      -9    ,__const
  -6.8% -1.76Mi  [ = ]       0    [Unmapped]
 -88.7% -2.90Mi -88.7% -2.90Mi    ,__eh_frame
  -2.4% -4.25Mi  -2.4% -2.49Mi    TOTAL

Bloaty check of object files (detailed):

bloaty -d symbols ./artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o -- ../runtime-main/artifacts/obj/System.Runtime.Tests/Release/net9.0-unix/osx-arm64/native/System.Runtime.Tests.o
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.9%  +410Ki  +1.6%  +410Ki    [36252 Others]
  +0.1% +9.64Ki  +0.1% +9.64Ki    [,__debug_loc]
  [NEW] +7.19Ki  [NEW] +6.97Ki    _TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F026929D4AD63EB6EDB749A9B02B133C084237B3AFB8777B3DE0107C755B91565__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +2.42Ki  [NEW] +2.20Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__EnsureArrayIndexRegex_5_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +2.02Ki  [NEW] +1.81Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__Regex1_3_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.54Ki  [NEW] +1.51Ki    _lsda0_TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F026929D4AD63EB6EDB749A9B02B133C084237B3AFB8777B3DE0107C755B91565__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.44Ki  [NEW] +1.23Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__P0Regex_6_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.18Ki  [NEW]    +976    _System_Runtime_Tests_System_Text_RegularExpressions_Generated__RegexGenerator_g_FB4BB6619E624A9FE5F4E687DA0CF38E2970A0CF70BB059A398625E558ACC7DAC__IanaAbbreviationRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.14Ki  [NEW]    +960    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__Regex2_4_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [NEW] +1.11Ki  [NEW]    +912    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F417AD2970E27AC5D022778EC7081B94838CBCB072379596BB27CB8792C23ED76__UnknownNodeObjectEmptyRegex_8_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL]   -1017  [DEL]    -800    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__EncodeCharRegex_1_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.11Ki  [DEL]    -912    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__UnknownNodeObjectEmptyRegex_8_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.14Ki  [DEL]    -960    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__Regex2_4_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.17Ki  [DEL]    -960    _System_Runtime_Tests_System_Text_RegularExpressions_Generated__RegexGenerator_g_F23FFE14CED6C53CC123B603EF102D84787AD8CF9A59E83434F2BDF516814C2CC__IanaAbbreviationRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.44Ki  [DEL] -1.23Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__P0Regex_6_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -1.57Ki  [DEL] -1.54Ki    _lsda0_TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F25E40F247ED975110B5E93851ECC9A2CB78C4DC17A73BF5F5EE6D76B142C41B6__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -2.00Ki  [DEL] -1.80Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__Regex1_3_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -2.42Ki  [DEL] -2.20Ki    _S_P_Xml_System_Text_RegularExpressions_Generated__RegexGenerator_g_F24C2B164CF72F5E63C813A2B442D56F3F17BFFC74EAF7C2818EB7F6278C5183A__EnsureArrayIndexRegex_5_RunnerFactory_Runner__TryMatchAtCurrentPosition
  [DEL] -7.19Ki  [DEL] -6.97Ki    _TestUtilities_Unicode_System_Text_RegularExpressions_Generated__RegexGenerator_g_F25E40F247ED975110B5E93851ECC9A2CB78C4DC17A73BF5F5EE6D76B142C41B6__GetRegex_0_RunnerFactory_Runner__TryMatchAtCurrentPosition
  -6.8% -1.76Mi  [ = ]       0    [Unmapped]
 -88.7% -2.90Mi -88.7% -2.90Mi    lsection3
  -2.4% -4.25Mi  -2.4% -2.49Mi    TOTAL

unwinding format.

For NativeAOT/ARM64/Apple API do the following:
- Save callee registers in opposite order and in pairs.
- Prefer saving FP/LR on the top of the frame. Heuristics are used to
  avoid worse code quality outside of prolog/epilog due to addressing
  range limits of the ARM64 instruction set.
- Added optimization to lvaFrameAddress to rewrite FP-x references to
  SP+y when possible. This allows efficient addressing using positive
  indexes when FP points to the top of the frame. It mimics similar
  optimization on ARM32.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 12, 2024
@filipnavara filipnavara requested review from BruceForstall and removed request for MichalStrehovsky September 12, 2024 21:18
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 12, 2024
@filipnavara
Copy link
Member Author

filipnavara commented Sep 12, 2024

The dominant part are the JIT changes, hence area-CodeGen-coreclr is the right label.

cc @dotnet/ilc-contrib @VSadov @ivanpovazan

@filipnavara filipnavara added arch-arm64 os-mac-os-x macOS aka OSX os-ios Apple iOS os-tvos Apple tvOS labels Sep 12, 2024
Copy link
Contributor

Tagging subscribers to 'os-ios': @vitek-karas, @kotlarmilos, @ivanpovazan, @steveisok, @akoeplinger
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Tagging subscribers to 'os-tvos': @vitek-karas, @kotlarmilos, @ivanpovazan, @steveisok, @akoeplinger
See info in area-owners.md if you want to be subscribed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member os-ios Apple iOS os-mac-os-x macOS aka OSX os-tvos Apple tvOS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant