Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcstress=3 timeout failures #68529

Open
BruceForstall opened this issue Apr 26, 2022 · 9 comments
Open

gcstress=3 timeout failures #68529

BruceForstall opened this issue Apr 26, 2022 · 9 comments

Comments

@BruceForstall
Copy link
Member

Recent run of runtime-coreclr gcstress0x3-gcstress0xc pipeline shows timeout failures after one hour in many different gcstress=3 tests:

https://dev.azure.com/dnceng/public/_build/results?buildId=1735353&view=ms.vss-test-web.build-test-results-tab&runId=46978262&resultId=111084&paneView=debug

e.g., the Methodical_* tests in:

  • coreclr Linux x64 Checked gcstress0x3 @ Ubuntu.1804.Amd64.Open
  • coreclr Linux arm64 Checked gcstress0x3 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
  • coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
  • coreclr OSX arm64 Checked gcstress0x3 @ OSX.1200.ARM64.Open

@trylek @jkoritzinsky did the recent test "batching" change effectively change the per-job timeouts? Does gcstress need to be handled differently?

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 26, 2022
@ghost
Copy link

ghost commented Apr 26, 2022

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

Recent run of runtime-coreclr gcstress0x3-gcstress0xc pipeline shows timeout failures after one hour in many different gcstress=3 tests:

https://dev.azure.com/dnceng/public/_build/results?buildId=1735353&view=ms.vss-test-web.build-test-results-tab&runId=46978262&resultId=111084&paneView=debug

e.g., the Methodical_* tests in:

  • coreclr Linux x64 Checked gcstress0x3 @ Ubuntu.1804.Amd64.Open
  • coreclr Linux arm64 Checked gcstress0x3 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
  • coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
  • coreclr OSX arm64 Checked gcstress0x3 @ OSX.1200.ARM64.Open

@trylek @jkoritzinsky did the recent test "batching" change effectively change the per-job timeouts? Does gcstress need to be handled differently?

Author: BruceForstall
Assignees: -
Labels:

GCStress, area-Infrastructure-coreclr

Milestone: -

@trylek
Copy link
Member

trylek commented Apr 26, 2022

I've just been discussing GCStress timeouts with JanV who hit this locally and I think I have a fix or at least a mitigation in

#68523

Please let me know if you're aware of yet another place where the timeouts are getting hard-coded, there are never enough duplicates.

@BruceForstall
Copy link
Member Author

A separate question, besides upping the timeouts, would be whether the gcstress jobs should split into more Helix jobs to do the work with greater parallelism, or at least not miss so much work beyond a timeout if a timeout is actually hit. Also, we never see Helix work progress until the job is finished, so it's advantageous to not have a Helix job run too long.

@trylek
Copy link
Member

trylek commented Apr 26, 2022

@BruceForstall - thanks for pointing that out. I think there are two different aspects to all of this. I looked at multiple jobs in the run and all timeouts in the Methodical_* newly merged tests happened after just a few minutes of activity, I don't think they're "real" timeouts. I suspect that the GCStress=0x3 level has probably similar characteristics as GCStress=0xF we saw previously causing arbitrary timeouts in the r2r extra runs. According to @cshung's comments in the issue

#68060 (comment)

it might be in fact a test issue. In all failed tests I looked at the failure happened either in the arrres_* test that Andrew explicitly commented on in the thread or another test copy_prop_byref_to_native_int. I don't have sufficient GC expertise to assess whether this has the same reason but apparently it also explicitly calls GC.WaitForPendingFinalizers so it might be the case that the underlying problem for both tests is the same.

For the actual running time of the individual items, that is something these failures tell us nothing about. In JanV's local testing the Methodical merged wrappers timed out after the pre-existing 20-minute timeout in GCStress mode and the TypeGeneratorTests just made it in about 10-15 minutes. In Pri1 mode each of the Methodical wrappers comprises about 200-300 tests i.o.w. if they theoretically took about 45 minutes to run, further splitting would reduce the total job running time at the expense of spinning more Helix VM's and more overhead w.r.t. repeated managed runtime initialization that is much slower in the GC stress mode. I'm trying to mine Kusto for data to let me better estimate the trade-off.

@BruceForstall
Copy link
Member Author

I looked at multiple jobs in the run and all timeouts in the Methodical_* newly merged tests happened after just a few minutes of activity, I don't think they're "real" timeouts. I suspect that the GCStress=0x3 level has probably similar characteristics as GCStress=0xF we saw previously causing arbitrary timeouts in the r2r extra runs.

Does this mean that maybe the way some tests are written makes them incompatible with a run enabling both GCStress and merged tests? If so, can we identify those incompatible patterns and disallow merging for those tests?

@trylek
Copy link
Member

trylek commented Apr 26, 2022

Yes, that more or less matches my understanding. In general the easiest way to single out these tests is by marking them as RequiresProcessIsolation, that is basically identical to the legacy out-of-proc way XUnit console runs the non-converted tests. It's a no-brainer to mark the arrres and copy_prop_byref_to_native_int as out-of-proc, the only interesting question is whether by doing that we're really fixing something and not just papering over a real problem.

@BruceForstall
Copy link
Member Author

@trylek A weekend run shows we're still hitting 4 hour timeouts with Linux-arm64 GCStress=3 in a couple Methodical_* tests:

https://dev.azure.com/dnceng/public/_build/results?buildId=1807767&view=ms.vss-test-web.build-test-results-tab&runId=48108702&resultId=110196&paneView=debug

@trylek
Copy link
Member

trylek commented Jun 6, 2022

Hmm, sadly this is not a previously overlooked case - there are no explicit GC management calls in the test; I'm afraid I don't see any way forward without some form of diagnosability, either by enabling dumps (e.g. just for a few instrumented runs if it's problematic for all tests) or by reproing locally or on a lab VM.

@agocke agocke added this to the 7.0.0 milestone Jul 1, 2022
@agocke agocke removed the untriaged New issue has not been triaged by the area owner label Jul 1, 2022
@agocke agocke modified the milestones: 7.0.0, Future Jul 26, 2022
@markples
Copy link
Member

baseservices/threading/regressions/30032/30032.cs has the same behavior when merged. It calls Sleep, Collect, and WaitForPendingFinalizers. I don't think the test or wrapper directly adds any finalizers, but the test does use Timer, and TimerHolder has one.

markples added a commit that referenced this issue Apr 1, 2023
See #68529 for theories.

Should solve the first timeouts in #83961 (baseservices/threading), Directed_1, Regression_1, and Regression_2 by marking tests as RequiresProcessIsolation.
Adds striping to Regression_1, Regression_4, threading_group1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants