Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Enable CI for macOS x64/arm64 #75421

Merged
merged 10 commits into from
Sep 30, 2022
Merged

Conversation

filipnavara
Copy link
Member

No description provided.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Sep 11, 2022
@ghost
Copy link

ghost commented Sep 11, 2022

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: filipnavara
Assignees: -
Labels:

area-Infrastructure-libraries

Milestone: -

@filipnavara
Copy link
Member Author

I assume that someone will need to run appropriate /azp run runtime-extra-platforms-other trigger.

@jkotas

This comment was marked as outdated.

@azure-pipelines

This comment was marked as outdated.

@jkotas
Copy link
Member

jkotas commented Sep 11, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@filipnavara
Copy link
Member Author

Thanks!

@filipnavara
Copy link
Member Author

filipnavara commented Sep 11, 2022

I realized a lot of tests will fail because #75333 was not merged yet. I also should enable regular NativeAOT osx-arm64 build in the non-extra pipelines. Let's see the results first though...

@jkotas
Copy link
Member

jkotas commented Sep 11, 2022

Looks like this needs fixing:

<CppCompilerAndLinker Condition="'$(TargetArchitecture)' != '$(BuildArchitecture)' and '$(HostOS)' != 'windows'">clang-9</CppCompilerAndLinker>

@filipnavara
Copy link
Member Author

Looks like this needs fixing

Yep. Not quite sure what the correct fix is but I will give it a try.

@jkotas
Copy link
Member

jkotas commented Sep 11, 2022

I think we want to use clang-9 for Linux only (it is a workaround for missing clang alias in our Linux docker images).

@filipnavara
Copy link
Member Author

Yep, seems like regular clang (ie. unspecified CppCompilerAndLinker) works for the cross compiled binaries.

@jkotas
Copy link
Member

jkotas commented Sep 11, 2022

The cross-compilation does not work as expected:

2022-09-11T21:40:09.9502410Z   ld: warning: ignoring file /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.arm64.Release/aotsdk/libbootstrapper.a, building for macOS-x86_64 but attempting to link with file built for macOS-arm64

Do we need to pass some kind of target architecture flag to clang?

@am11
Copy link
Member

am11 commented Sep 12, 2022

A cross compilation flag (--target) is excluded from TargetOS:OSX, and since --target on OSX is version suffixed (such as, arm64-apple-darwin21.6.0), we prefer the simpler -arch argument.

Try applying this kind of a patch (untested, as I don't have mac with intel chip):

--- a/src/coreclr/nativeaot/BuildIntegration/Microsoft.NETCore.Native.Unix.targets
+++ b/src/coreclr/nativeaot/BuildIntegration/Microsoft.NETCore.Native.Unix.targets
@@ -29,11 +29,12 @@ The .NET Foundation licenses this file to you under the MIT license.
       <FullRuntimeName Condition="'$(ServerGarbageCollection)' == 'true'">libRuntime.ServerGC</FullRuntimeName>
 
       <CrossCompileRid />
-      <CrossCompileRid Condition="'$(TargetOS)' != 'OSX' and !$(RuntimeIdentifier.EndsWith('-$(OSHostArch)'))">$(RuntimeIdentifier)</CrossCompileRid>
+      <CrossCompileRid Condition="!$(RuntimeIdentifier.EndsWith('-$(OSHostArch)'))">$(RuntimeIdentifier)</CrossCompileRid>
 
       <CrossCompileArch />
       <CrossCompileArch Condition="$(CrossCompileRid.EndsWith('-x64'))">x86_64</CrossCompileArch>
-      <CrossCompileArch Condition="$(CrossCompileRid.EndsWith('-arm64'))">aarch64</CrossCompileArch>
+      <CrossCompileArch Condition="$(CrossCompileRid.EndsWith('-arm64')) and '$(TargetOS)' != 'OSX'">aarch64</CrossCompileArch>
+      <CrossCompileArch Condition="$(CrossCompileRid.EndsWith('-arm64')) and '$(TargetOS)' == 'OSX'">arm64</CrossCompileArch>
 
       <TargetTriple />
       <TargetTriple Condition="'$(CrossCompileArch)' != ''">$(CrossCompileArch)-linux-gnu</TargetTriple>
@@ -85,7 +86,8 @@ The .NET Foundation licenses this file to you under the MIT license.
     <ItemGroup>
       <LinkerArg Include="@(NativeLibrary)" />
       <LinkerArg Include="--sysroot=$(SysRoot)" Condition="'$(SysRoot)' != ''" />
-      <LinkerArg Include="--target=$(TargetTriple)" Condition="'$(TargetTriple)' != ''" />
+      <LinkerArg Include="--target=$(TargetTriple)" Condition="'$(TargetOS)' != 'OSX' and '$(TargetTriple)' != ''" />
+      <LinkerArg Include="-arch $(CrossCompileArch)" Condition="'$(TargetOS)' == 'OSX' and '$(CrossCompileArch)' != ''" />
       <LinkerArg Include="-g" Condition="$(NativeDebugSymbols) == 'true'" />
       <LinkerArg Include="-Wl,--strip-debug" Condition="$(NativeDebugSymbols) != 'true' and '$(TargetOS)' != 'OSX'" />
       <LinkerArg Include="-Wl,-rpath,'$(IlcRPath)'" />

@filipnavara
Copy link
Member Author

untested, as I don't have mac with intel chip

I will test it today. You can run the whole build under arch -arch x86-64 ./build.sh ... though. That should allow testing the scenario even on M1/M2 Macs.

@am11
Copy link
Member

am11 commented Sep 12, 2022

It took these changes to succeed the cross build: am11@542104d.

You can run the whole build under arch -arch x86-64 ./build.sh ... though

Good to know. It was -arch x86_64 though.

@filipnavara
Copy link
Member Author

It got past the point of previous failure. We will need to re-trigger the extra pipeline.

@jkotas
Copy link
Member

jkotas commented Sep 12, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@filipnavara
Copy link
Member Author

macOS x64 is still crash fest unfortunately. I will check the dumps but if I don't find anything obvious I will split that off into separate PR.

macOS ARM64 all seem like small things that should be easy to fix. I plan to do it later this week.

@filipnavara
Copy link
Member Author

filipnavara commented Sep 13, 2022

macOS arm64:

  • System.Memory tests crash could not be reproduced locally on my machine [even with the exact Helix payload]. It seems we don't do dumps on Helix for NativeAOT so there's no information to diagnose at the moment.
  • System.IO.MemoryMappedFiles tests only failed for cross-compilation. I pushed a fix.
  • OSVersion_ValidVersion_OSX test relies on how CoreCLR fills the RID information. NativeAOT hard-codes it during compilation process. I disabled the test since it was supposed to test Environment.OSVersion (which works) and not RuntimeInformation.RuntimeIdentifier.

macOS x64:
I can reproduce the crashes locally but it's basically undebuggable. When running under lldb I run into loop in SuspendAllThreads that seems to be never ending. It keeps trying to hijack threads through PalHijack which sends signal to the target thread. Since lldb is slower with delivering the signals it keeps calling PalHijack so frequently that ActivationHandler doesn't seem to have time to run. I am not sure if it's something specific to lldb or Rosetta but it seems bit fishy to me. cc @janvorli

UPD: The macOS x64 may have problem stack walking but it could easily be something in the context capture too... The Debug build has meaningful asserts at least.

UPD 2: There's definitely garbage compact unwinding information. In some cases it encodes stack size as reference into the code but it points to instruction that's not subl $xxx,%esp (functionStart is pointing into middle of the code) and computes grabage. This garbage is then used as relative offset to stack and causes the SIGSEGVs.

UPD 3: The incorrect functionStart is actually bug in my code, it just was not used on arm64 so it went unnoticed. Should be a one-line fix.

UPD 4: With the fixed compact unwinding I am back to failing thread suspension/hijacking. In debug builds I also get asserts from Rosetta all the time when running under lldb (or the hang mentioned above). When running without debugger it crashes with:

Thread 13 Crashed:
0   ???                           	    0x7ff896c7a9a8 ???
1   libsystem_kernel.dylib        	    0x7ff80790f30e __pthread_kill + 10
2   libsystem_pthread.dylib       	    0x7ff807946f7b pthread_kill + 263
3   libsystem_c.dylib             	    0x7ff807890ca5 abort + 123
4   System.Collections.Tests      	       0x104d754ec Assert(char const*, char const*, unsigned int, char const*) + 28 (rhassert.cpp:24)
5   System.Collections.Tests      	       0x104d8e264 Thread::HijackReturnAddressWorker(StackFrameIterator*, void (*)()) + 276 (thread.cpp:789)
6   System.Collections.Tests      	       0x104d8ddee Thread::HijackReturnAddress(UNIX_CONTEXT*, void (*)()) + 174 (thread.cpp:764)
7   System.Collections.Tests      	       0x104d8dc85 Thread::HijackCallback(UNIX_CONTEXT*, void*) + 485 (thread.cpp:689)
8   System.Collections.Tests      	       0x104e0fbb6 ActivationHandler(int, __siginfo*, void*) + 102 (PalRedhawkUnix.cpp:1021)
9   libsystem_platform.dylib      	    0x7ff807971c1d _sigtramp + 29
10  ???                           	       0x11c00e680 ???

@jkotas
Copy link
Member

jkotas commented Sep 13, 2022

With the fixed compact unwinding I am back to failing thread suspension/hijacking. In debug builds I also get asserts from Rosetta all the time when running under lldb (or the hang mentioned above).

@VSadov Could you please take a look?

@filipnavara
Copy link
Member Author

filipnavara commented Sep 13, 2022

With the fixed compact unwinding I am back to failing thread suspension/hijacking. In debug builds I also get asserts from Rosetta all the time when running under lldb (or the hang mentioned above).

@VSadov Could you please take a look?

My theory is that the underlying reason why the hijacking fails is that the stack unwinding doesn't quite work correctly for this particular use case. The compact unwinding may not be accurate when unwinding a frame which is currently in the prolog/epilog. This likely happens significantly more often in the Rosetta emulation layer where I seem to hit the asserts with the signal being delivered exactly at the first instruction of a method. In fact, not even lldb can unwind these frames in the bt command.

On ARM64 this doesn't happen, or at least not that often, because most of the managed methods get compiled with prologs/epilogs that cannot be represented as compact unwinding and thus it falls back to DWARF unwinding information in the end (compact unwinding is still needed to make the linker happy and insert pointers to the DWARF info). DWARF has enough information to unwind even prologs/epilogs correctly.

@filipnavara
Copy link
Member Author

I added some rudimentary handling of the function prologs in UnixNativeCodeManager::GetReturnAddressHijackInfo. It seems to be good enough to let me finish the tests that were previously failing on osx-x64. I didn't do extensive testing but I guess it would interesting to have another go at /azp run runtime-extra-platforms.

@filipnavara
Copy link
Member Author

Rebased on top of main and manually bumped ObjWriter packages (maestro merges are currently blocked by build failures). Should be ready for another /azp run runtime-extra-platforms run.

@jkotas
Copy link
Member

jkotas commented Sep 30, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@filipnavara
Copy link
Member Author

filipnavara commented Sep 30, 2022

(+) No more random crashes.
(-) Few more interop tests need a tweak
(-) The ManyConcurrentAddsTakes_ForceContentionWithToArray failure on osx-arm64 may be real problem

The AzDO link doesn't seem to work anymore. Here are the details:

 "HelixJobId": "9096e23d-ea87-461c-b90d-4f1a203edfcb", "HelixWorkItemName": "System.Collections.Concurrent.Tests" }


Error message
Assert.DoesNotContain() Failure
Found:    0
In value: Int32[] [0]


Stack trace
   at System.Collections.Concurrent.Tests.ProducerConsumerCollectionTests.ManyConcurrentAddsTakes_ForceContentionWithToArray(Double seconds) + 0x110
   at System.Collections.Concurrent!<BaseAddress>+0x9fe65c
   at System.Reflection.DynamicInvokeInfo.Invoke(Object, IntPtr, Object[], BinderBundle, Boolean) + 0x148

@filipnavara filipnavara marked this pull request as ready for review September 30, 2022 15:48
@jkotas
Copy link
Member

jkotas commented Sep 30, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@filipnavara
Copy link
Member Author

The test failures seem unrelated (tvOS succeeded but failed to pass the XML results due to TCP port in use; Linux Bionic failed to find a device) 🎉

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thank you!

…ime.InteropServices.UnitTests/System/Runtime/InteropServices/ObjectiveC/MessageSendTests.cs
@jkotas jkotas merged commit c8f9f29 into dotnet:main Sep 30, 2022
@filipnavara
Copy link
Member Author

Thanks for all the help on this one!

@filipnavara filipnavara mentioned this pull request Sep 30, 2022
6 tasks
@VSadov
Copy link
Member

VSadov commented Sep 30, 2022

@filipnavara Thank you!!

@filipnavara filipnavara deleted the patch-7 branch October 1, 2022 08:36
@ghost ghost locked as resolved and limited conversation to collaborators Oct 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-libraries community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants