Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittend failures in ThreadPoolTests on Mono Linux ARM64 #85960

Closed
jkoritzinsky opened this issue May 9, 2023 · 9 comments
Closed

Intermittend failures in ThreadPoolTests on Mono Linux ARM64 #85960

jkoritzinsky opened this issue May 9, 2023 · 9 comments
Assignees
Labels
arch-arm64 area-Diagnostics-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Milestone

Comments

@jkoritzinsky
Copy link
Member

jkoritzinsky commented May 9, 2023

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460
Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest
Pull request:

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "Exit code was 139 but it should have been 42",
  "BuildRetry": false,
  "ExcludeConsoleLog": true
}

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

Known issue validation

Build: 🔎
Result validation: ⚠️ Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: 🔎" line.

@jkoritzinsky jkoritzinsky added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab labels May 9, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 9, 2023
@ghost
Copy link

ghost commented May 9, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460
Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest
Pull request:

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "Exit code was 139 but it should have been 42",
  "BuildRetry": false,
  "ExcludeConsoleLog": true
}
Author: jkoritzinsky
Assignees: -
Labels:

area-System.Threading, blocking-clean-ci, Known Build Error

Milestone: -

@ghost
Copy link

ghost commented May 9, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=266460
Build error leg or test failing: System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest
Pull request:

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "Exit code was 139 but it should have been 42",
  "BuildRetry": false,
  "ExcludeConsoleLog": true
}

Report

Build Definition Test Pull Request
266460 dotnet/runtime System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest #85941

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
1 1 1
Author: jkoritzinsky
Assignees: -
Labels:

arch-arm64, area-System.Threading, os-linux, blocking-clean-ci, untriaged, Known Build Error, area-Threading-mono

Milestone: -

@SamMonoRT
Copy link
Member

@lambdageek - didn't we have a similar intermittent threading issue somewhere?

@SamMonoRT SamMonoRT added this to the 9.0.0 milestone Jul 25, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 25, 2023
@build-analysis build-analysis bot removed this from the 9.0.0 milestone Nov 15, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 15, 2023
@akoeplinger akoeplinger added this to the 9.0.0 milestone Nov 24, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Nov 24, 2023
@pavelsavara
Copy link
Member

here I have one on x64, with mono_dump_native_crash_info

=================================================================
	External Debugger Dump:
=================================================================
/tmp/mono-gdb-commands.209:1: Error in sourced command file:
ptrace: Operation not permitted.
[New LWP 210]
[New LWP 211]
[New LWP 212]
[New LWP 213]
Missing separate debuginfo for /root/helix/work/correlation/dotnet
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9a/ee12662a8f8c3b14915491ceba0663c152d89a.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /root/helix/work/correlation/host/fxr/9.0.0/libhostfxr.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/85/5812b12a0494863911e4320a48b9c4d4450817.debug
Missing separate debuginfo for /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libhostpolicy.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ea/2c76e0d159d038a7eab2f0cc9f07155aa572ef.debug
Dwarf Error: Cannot handle DW_FORM_strx1 in DWARF reader [in module /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so.dbg]
Dwarf Error: Cannot handle DW_FORM_strx1 in DWARF reader [in module /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libSystem.Native.so.dbg]
0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
  Id   Target Id                                         Frame 
* 1    Thread 0x7fe2b0ab5740 (LWP 209) "dotnet"          0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
  2    Thread 0x7fe2ae7ff700 (LWP 210) "SGen worker"     0x00007fe2b067f45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3    Thread 0x7fe2aee11700 (LWP 211) ".NET EventPipe"  0x00007fe2af89d301 in poll () from /lib64/libc.so.6
  4    Thread 0x7fe2ac9ff700 (LWP 212) "Finalizer"       0x00007fe2b0681da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  5    Thread 0x7fe2a5f94700 (LWP 213) ".NET Long Runni" 0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0

Thread 5 (Thread 0x7fe2a5f94700 (LWP 213)):
#0  0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fe2af3f2212 in mono_dump_native_crash_info () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2af39946e in mono_handle_native_crash () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af3001e1 in mono_sigsegv_signal_handler_debug () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  <signal handler called>
#5  0x00007fe2af52f412 in mono_profiler_set_jit_begin_callback () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#6  0x00007fe2af468a08 in EventPipeEtwCallbackDotNETRuntime () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#7  0x00007fe2af45a3e9 in provider_invoke_callback () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#8  0x00007fe2af45ab34 in disable_helper () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#9  0x00000000418b9cfa in ?? ()
#10 0x000000000000000a in ?? ()
#11 0x00007fe2af617150 in ?? ()
#12 0x00007fe2a5f93848 in ?? ()
#13 0x0000000000000001 in ?? ()
#14 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7fe2ac9ff700 (LWP 212)):
#0  0x00007fe2b0681da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fe2b0681e98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fe2af55f9e8 in finalizer_thread () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af538669 in start_wrapper () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe2aee11700 (LWP 211)):
#0  0x00007fe2af89d301 in poll () from /lib64/libc.so.6
#1  0x00007fe2af484ece in ds_ipc_poll () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2af481de8 in ds_ipc_stream_factory_get_next_available_stream () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af48044f in server_thread () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  0x00007fe2af484341 in ep_rt_thread_mono_start_func () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#5  0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#6  0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe2ae7ff700 (LWP 210)):
#0  0x00007fe2b067f45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe2af5c7323 in thread_func () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2b06791ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe2af7b2e73 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fe2b0ab5740 (LWP 209)):
#0  0x00007fe2b0683672 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fe2af3f2212 in mono_dump_native_crash_info () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#2  0x00007fe2af39946e in mono_handle_native_crash () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#3  0x00007fe2af3f1981 in sigabrt_signal_handler () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#4  <signal handler called>
#5  0x00007fe2af7c7acf in raise () from /lib64/libc.so.6
#6  0x00007fe2af79aea5 in abort () from /lib64/libc.so.6
#7  0x00007fe2af48fb05 in monoeg_assert_abort () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#8  0x00007fe2af4a0106 in mono_log_write_logfile () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#9  0x00007fe2af48fec6 in monoeg_g_logv () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#10 0x00007fe2af48fffe in monoeg_g_log () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#11 0x00007fe2af467b5b in ep_rt_mono_fini () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#12 0x00007fe2af304601 in mini_cleanup () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#13 0x00007fe2af4e5dbb in ves_icall_System_Environment_Exit () from /root/helix/work/correlation/shared/Microsoft.NETCore.App/9.0.0/libcoreclr.so
#14 0x00000000418b8bba in ?? ()
#15 0x0000000000000000 in ?? ()
[Inferior 1 (process 209) detached]

Log

@lambdageek
Copy link
Member

@lateralusX some kind of race condition in eventpipe shutdown, looks like.

@vargaz
Copy link
Contributor

vargaz commented Dec 6, 2023

Note that we don't shut down threads in Environment.Exit () so there is a possibility of races.

@lateralusX
Copy link
Member

lateralusX commented Dec 7, 2023

ep_rt_mono_fini assumes that all threads that might run EventPipe code has been stopped, so if there are still threads that can call into EventPipe at that point, it will race with shutdown logic. In this case Thread 5 is disabling a managed EventSource provider that will use the profiler callbacks, that is reset by ep_rt_mono_fini since it assumes no other threads are running at that point, the assert in Thread 1 happens since it will clean up a mutex, that is currently held by Thread 5.

On CoreClr/NativeAot we don't have any cleanup done in ep_rt_shutdown and those runtimes will leak resources, but on Mono we do cleanup of runtime resources. We probably need to detect if the shutdown is triggered in a way where other managed threads migth still be running when calling ep_rt_shutdown, if so we would probably need to leak these resources.

@tommcdon
Copy link
Member

This should be fixed on #99332

@mdh1418
Copy link
Member

mdh1418 commented May 21, 2024

Looks like the test passed on the linux-arm64 Debug Mono_MiniJIT_LibrariesTests lane on a mono related PR in the general runtime pipeline. Didn't find the testResults.xml, since the helix job only had a console, but I was able to filter the tests tab on the azdo build and found that ThreadPoolMinMaxThreadsEventTest passed on both the Debian 11 and Ubuntu helix machines https://dev.azure.com/dnceng-public/public/_build/results?buildId=682588&view=ms.vss-test-web.build-test-results-tab&runId=16939684&paneView=debug.

@mdh1418 mdh1418 closed this as completed May 21, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jun 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-Diagnostics-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Projects
None yet
Development

No branches or pull requests