Extend stress timeouts for merged tests #68523

trylek · 2022-04-25T22:05:08Z

As we're intentionally running larger merged tests in-proc to reduce
runtime overhead, we've become prone to hitting various infra
timeouts. As part of the switchover I extended the default test
timeout from 15 to 30 minutes; I however overlooked the fact that
the run.py script manually overrides this value for GC stress and
other special test build / execution modes. This change doubles
all these timeouts to cater for the larger merged test apps.

Thanks

Tomas

/cc @dotnet/runtime-infrastructure

As we're intentionally running larger merged tests in-proc to reduce runtime overhead, we've become prone to hitting various infra timeouts. As part of the switchover I extended the default test timeout from 15 to 30 minutes; I hovewer overlooked the fact that the run.py script manually overrides this value for GC stress and other special test build / execution flavors. This change doubles all these timeouts to cater for the larger merged test apps. Thanks Tomas

ghost · 2022-04-25T22:05:13Z

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

As we're intentionally running larger merged tests in-proc to reduce
runtime overhead, we've become prone to hitting various infra
timeouts. As part of the switchover I extended the default test
timeout from 15 to 30 minutes; I however overlooked the fact that
the run.py script manually overrides this value for GC stress and
other special test build / execution modes. This change doubles
all these timeouts to cater for the larger merged test apps.

Thanks

Tomas

/cc @dotnet/runtime-infrastructure

Author:	trylek
Assignees:	-
Labels:	`area-Infrastructure-coreclr`
Milestone:	-

janvorli · 2022-04-25T22:20:45Z

src/tests/run.py

@@ -875,8 +875,8 @@ def run_tests(args,
        os.environ["LargeVersionBubble"] = "true"

    if gc_stress:
-        print("Running GCStress, extending timeout to 120 minutes.")
-        per_test_timeout = 120*60*1000
+        print("Running GCStress, extending timeout to 240 minutes.")


It seems it would make sense to apply the gc_stress timeout as the last one as it is by far the slowest mode. And maybe jut multiply the previously picked timeout by a constant instead of setting a fixed timeout. I guess that any "type" of test would become somehow proportionally slower with GC stress enabled.

Thanks for the suggestion, addressed in 2nd commit, can you please take another look?

janvorli

LGTM, thank you!

BruceForstall

LGTM

Note that eng\pipelines\common\templates\runtimes\run-test-job.yml specifies timeoutPerTestInMinutes, timeoutPerTestCollectionInMinutes, and timeoutInMinutes (which I barely understand the difference between), and I'm sure other YML files also specify timeouts...

trylek · 2022-04-26T16:59:28Z

Thanks Bruce for the heads-up. I believe the timeout values in yml scripts haven't been updated in a while so I'll continue monitoring them and adjusting them as necessary.

BruceForstall · 2022-04-26T17:10:28Z

fwiw, it seems like we could use Kusto to find the longest successful time for every Helix job for each pipeline, add 25%, and then set the timeout to that. But that would be a lot of work...

trylek · 2022-04-26T18:24:06Z

Good point Bruce, I'm actually right now working with Kusto to try to give substance to my perf claims regarding the merged tests, I think reusing similar queries for calibrating timeouts should be simple.

BruceForstall · 2022-06-07T00:15:47Z

@trylek This is at least partially related:

Is the test partitioning static?

Ideally we would create many more partitions for (slow) stress jobs, to keep the maximum timeout for any Helix partition under an hour (or two). I've been told many times by MattGal that long timeouts don't work well in Helix. One problem of course is you can't rerun part of a partition, and you get no incremental feedback about progress.

trylek · 2022-06-07T01:06:29Z

According to our design the test partitioning is "intentional" i.o.w. "static" (as opposed to "automated" or "implicit") in the sense that it's up to us to decide what gets tested where. One of the main motivations for this decision was the fact that the pre-existing runtime test logic employs three implicit different test partitionings that are super-hard to reason about and maintain (the build partitioning aimed at overcoming msbuild limitations in the presence of a large number of projects, XUnit wrapper partitioning and Helix sub-partitioning of the XUnit wrappers).

In our initial design discussions I was originally proposing somewhat more conservative partitioning involving hundreds of wrappers tens or so test projects each while @jkotas suggested more aggressive merging ending up with just tens of the final merged test groups and ultimately we adopted his approach in the actual merge process and it seems to me that in most cases it works just fine.

As we're discussing ways of running the complete merged test suites even in the presence of native runtime assertion failures and crashes on the other thread, my proposal to have the full merged test wrapper launch itself as a child process for the purpose of being able to pick up and run the remaining tests once the child has crashed should make it equally simple to run the child process in batches to watch out for particular tests taking too long - having said that, it doesn't fully solve the "long Helix item" problem as it's still within the same work item.

BruceForstall · 2022-06-07T01:10:43Z

it doesn't fully solve the "long Helix item" problem as it's still within the same work item.

Also, with a lot of Helix work items, we can scale out testing as machine resources are available. (I seem to remember there can be starvation type of problems when there are long-lived Helix work items.) I'm sure you've thought through all of this already so I'm not adding anything :-)

trylek added the area-Infrastructure-coreclr label Apr 25, 2022

trylek requested a review from janvorli April 25, 2022 22:05

ghost assigned trylek Apr 25, 2022

janvorli reviewed Apr 25, 2022

View reviewed changes

Address JanV's PR feedback

e565d07

trylek mentioned this pull request Apr 26, 2022

gcstress=3 timeout failures #68529

Open

janvorli approved these changes Apr 26, 2022

View reviewed changes

BruceForstall approved these changes Apr 26, 2022

View reviewed changes

trylek merged commit 054954c into dotnet:main Apr 26, 2022

trylek deleted the MergedTestGCStressTimeout branch April 26, 2022 18:22

ghost locked as resolved and limited conversation to collaborators May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend stress timeouts for merged tests #68523

Extend stress timeouts for merged tests #68523

trylek commented Apr 25, 2022

ghost commented Apr 25, 2022

janvorli Apr 25, 2022

trylek Apr 25, 2022

janvorli left a comment

BruceForstall left a comment

trylek commented Apr 26, 2022

BruceForstall commented Apr 26, 2022

trylek commented Apr 26, 2022

BruceForstall commented Jun 7, 2022

trylek commented Jun 7, 2022

BruceForstall commented Jun 7, 2022

Extend stress timeouts for merged tests #68523

Extend stress timeouts for merged tests #68523

Conversation

trylek commented Apr 25, 2022

ghost commented Apr 25, 2022

janvorli Apr 25, 2022

Choose a reason for hiding this comment

trylek Apr 25, 2022

Choose a reason for hiding this comment

janvorli left a comment

Choose a reason for hiding this comment

BruceForstall left a comment

Choose a reason for hiding this comment

trylek commented Apr 26, 2022

BruceForstall commented Apr 26, 2022

trylek commented Apr 26, 2022

BruceForstall commented Jun 7, 2022

trylek commented Jun 7, 2022

BruceForstall commented Jun 7, 2022