gcstress=3 timeout failures #68529

BruceForstall · 2022-04-26T00:31:06Z

Recent run of runtime-coreclr gcstress0x3-gcstress0xc pipeline shows timeout failures after one hour in many different gcstress=3 tests:

https://dev.azure.com/dnceng/public/_build/results?buildId=1735353&view=ms.vss-test-web.build-test-results-tab&runId=46978262&resultId=111084&paneView=debug

e.g., the Methodical_* tests in:

coreclr Linux x64 Checked gcstress0x3 @ Ubuntu.1804.Amd64.Open
coreclr Linux arm64 Checked gcstress0x3 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
coreclr OSX arm64 Checked gcstress0x3 @ OSX.1200.ARM64.Open

@trylek @jkoritzinsky did the recent test "batching" change effectively change the per-job timeouts? Does gcstress need to be handled differently?

The text was updated successfully, but these errors were encountered:

ghost · 2022-04-26T00:31:10Z

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

Recent run of runtime-coreclr gcstress0x3-gcstress0xc pipeline shows timeout failures after one hour in many different gcstress=3 tests:

https://dev.azure.com/dnceng/public/_build/results?buildId=1735353&view=ms.vss-test-web.build-test-results-tab&runId=46978262&resultId=111084&paneView=debug

e.g., the Methodical_* tests in:

coreclr Linux x64 Checked gcstress0x3 @ Ubuntu.1804.Amd64.Open
coreclr Linux arm64 Checked gcstress0x3 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
coreclr OSX arm64 Checked gcstress0x3 @ OSX.1200.ARM64.Open

@trylek @jkoritzinsky did the recent test "batching" change effectively change the per-job timeouts? Does gcstress need to be handled differently?

Author:	BruceForstall
Assignees:	-
Labels:	`GCStress`, `area-Infrastructure-coreclr`
Milestone:	-

trylek · 2022-04-26T00:34:49Z

I've just been discussing GCStress timeouts with JanV who hit this locally and I think I have a fix or at least a mitigation in

#68523

Please let me know if you're aware of yet another place where the timeouts are getting hard-coded, there are never enough duplicates.

BruceForstall · 2022-04-26T00:58:27Z

A separate question, besides upping the timeouts, would be whether the gcstress jobs should split into more Helix jobs to do the work with greater parallelism, or at least not miss so much work beyond a timeout if a timeout is actually hit. Also, we never see Helix work progress until the job is finished, so it's advantageous to not have a Helix job run too long.

trylek · 2022-04-26T21:37:52Z

@BruceForstall - thanks for pointing that out. I think there are two different aspects to all of this. I looked at multiple jobs in the run and all timeouts in the Methodical_* newly merged tests happened after just a few minutes of activity, I don't think they're "real" timeouts. I suspect that the GCStress=0x3 level has probably similar characteristics as GCStress=0xF we saw previously causing arbitrary timeouts in the r2r extra runs. According to @cshung's comments in the issue

#68060 (comment)

it might be in fact a test issue. In all failed tests I looked at the failure happened either in the arrres_* test that Andrew explicitly commented on in the thread or another test copy_prop_byref_to_native_int. I don't have sufficient GC expertise to assess whether this has the same reason but apparently it also explicitly calls GC.WaitForPendingFinalizers so it might be the case that the underlying problem for both tests is the same.

For the actual running time of the individual items, that is something these failures tell us nothing about. In JanV's local testing the Methodical merged wrappers timed out after the pre-existing 20-minute timeout in GCStress mode and the TypeGeneratorTests just made it in about 10-15 minutes. In Pri1 mode each of the Methodical wrappers comprises about 200-300 tests i.o.w. if they theoretically took about 45 minutes to run, further splitting would reduce the total job running time at the expense of spinning more Helix VM's and more overhead w.r.t. repeated managed runtime initialization that is much slower in the GC stress mode. I'm trying to mine Kusto for data to let me better estimate the trade-off.

BruceForstall · 2022-04-26T22:51:48Z

I looked at multiple jobs in the run and all timeouts in the Methodical_* newly merged tests happened after just a few minutes of activity, I don't think they're "real" timeouts. I suspect that the GCStress=0x3 level has probably similar characteristics as GCStress=0xF we saw previously causing arbitrary timeouts in the r2r extra runs.

Does this mean that maybe the way some tests are written makes them incompatible with a run enabling both GCStress and merged tests? If so, can we identify those incompatible patterns and disallow merging for those tests?

trylek · 2022-04-26T22:56:31Z

Yes, that more or less matches my understanding. In general the easiest way to single out these tests is by marking them as RequiresProcessIsolation, that is basically identical to the legacy out-of-proc way XUnit console runs the non-converted tests. It's a no-brainer to mark the arrres and copy_prop_byref_to_native_int as out-of-proc, the only interesting question is whether by doing that we're really fixing something and not just papering over a real problem.

BruceForstall · 2022-06-06T19:13:49Z

@trylek A weekend run shows we're still hitting 4 hour timeouts with Linux-arm64 GCStress=3 in a couple Methodical_* tests:

https://dev.azure.com/dnceng/public/_build/results?buildId=1807767&view=ms.vss-test-web.build-test-results-tab&runId=48108702&resultId=110196&paneView=debug

trylek · 2022-06-06T19:47:38Z

Hmm, sadly this is not a previously overlooked case - there are no explicit GC management calls in the test; I'm afraid I don't see any way forward without some form of diagnosability, either by enabling dumps (e.g. just for a few instrumented runs if it's problematic for all tests) or by reproing locally or on a lab VM.

markples · 2023-03-31T22:10:33Z

baseservices/threading/regressions/30032/30032.cs has the same behavior when merged. It calls Sleep, Collect, and WaitForPendingFinalizers. I don't think the test or wrapper directly adds any finalizers, but the test does use Timer, and TimerHolder has one.

See #68529 for theories. Should solve the first timeouts in #83961 (baseservices/threading), Directed_1, Regression_1, and Regression_2 by marking tests as RequiresProcessIsolation. Adds striping to Regression_1, Regression_4, threading_group1.

BruceForstall added GCStress area-Infrastructure-coreclr labels Apr 26, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 26, 2022

BruceForstall mentioned this issue Jun 6, 2022

Optimize multi-dimensional array access #70271

Merged

aromaa mentioned this issue Jun 16, 2022

Unpin locals, attempt 2 #70655

Merged

agocke added this to the 7.0.0 milestone Jul 1, 2022

agocke removed the untriaged New issue has not been triaged by the area owner label Jul 1, 2022

agocke modified the milestones: 7.0.0, Future Jul 26, 2022

trylek mentioned this issue Mar 31, 2023

Test merged baseservices/threading time out in gcstress #83961

Closed

markples mentioned this issue Mar 31, 2023

Address timeouts in merged test groups #84193

Merged

markples mentioned this issue Feb 11, 2024

Test failure: HandleRefTest.Validate_NoGC() #97334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcstress=3 timeout failures #68529

gcstress=3 timeout failures #68529

BruceForstall commented Apr 26, 2022

ghost commented Apr 26, 2022

trylek commented Apr 26, 2022

BruceForstall commented Apr 26, 2022

trylek commented Apr 26, 2022 •

edited

Loading

BruceForstall commented Apr 26, 2022

trylek commented Apr 26, 2022

BruceForstall commented Jun 6, 2022

trylek commented Jun 6, 2022

markples commented Mar 31, 2023

gcstress=3 timeout failures #68529

gcstress=3 timeout failures #68529

Comments

BruceForstall commented Apr 26, 2022

ghost commented Apr 26, 2022

trylek commented Apr 26, 2022

BruceForstall commented Apr 26, 2022

trylek commented Apr 26, 2022 • edited Loading

BruceForstall commented Apr 26, 2022

trylek commented Apr 26, 2022

BruceForstall commented Jun 6, 2022

trylek commented Jun 6, 2022

markples commented Mar 31, 2023

trylek commented Apr 26, 2022 •

edited

Loading