-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend stress timeouts for merged tests #68523
Conversation
As we're intentionally running larger merged tests in-proc to reduce runtime overhead, we've become prone to hitting various infra timeouts. As part of the switchover I extended the default test timeout from 15 to 30 minutes; I hovewer overlooked the fact that the run.py script manually overrides this value for GC stress and other special test build / execution flavors. This change doubles all these timeouts to cater for the larger merged test apps. Thanks Tomas
Tagging subscribers to this area: @hoyosjs Issue DetailsAs we're intentionally running larger merged tests in-proc to reduce Thanks Tomas /cc @dotnet/runtime-infrastructure
|
src/tests/run.py
Outdated
@@ -875,8 +875,8 @@ def run_tests(args, | |||
os.environ["LargeVersionBubble"] = "true" | |||
|
|||
if gc_stress: | |||
print("Running GCStress, extending timeout to 120 minutes.") | |||
per_test_timeout = 120*60*1000 | |||
print("Running GCStress, extending timeout to 240 minutes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it would make sense to apply the gc_stress timeout as the last one as it is by far the slowest mode. And maybe jut multiply the previously picked timeout by a constant instead of setting a fixed timeout. I guess that any "type" of test would become somehow proportionally slower with GC stress enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, addressed in 2nd commit, can you please take another look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Note that eng\pipelines\common\templates\runtimes\run-test-job.yml specifies timeoutPerTestInMinutes
, timeoutPerTestCollectionInMinutes
, and timeoutInMinutes
(which I barely understand the difference between), and I'm sure other YML files also specify timeouts...
Thanks Bruce for the heads-up. I believe the timeout values in yml scripts haven't been updated in a while so I'll continue monitoring them and adjusting them as necessary. |
fwiw, it seems like we could use Kusto to find the longest successful time for every Helix job for each pipeline, add 25%, and then set the timeout to that. But that would be a lot of work... |
Good point Bruce, I'm actually right now working with Kusto to try to give substance to my perf claims regarding the merged tests, I think reusing similar queries for calibrating timeouts should be simple. |
@trylek This is at least partially related: Is the test partitioning static? Ideally we would create many more partitions for (slow) stress jobs, to keep the maximum timeout for any Helix partition under an hour (or two). I've been told many times by MattGal that long timeouts don't work well in Helix. One problem of course is you can't rerun part of a partition, and you get no incremental feedback about progress. |
According to our design the test partitioning is "intentional" i.o.w. "static" (as opposed to "automated" or "implicit") in the sense that it's up to us to decide what gets tested where. One of the main motivations for this decision was the fact that the pre-existing runtime test logic employs three implicit different test partitionings that are super-hard to reason about and maintain (the build partitioning aimed at overcoming msbuild limitations in the presence of a large number of projects, XUnit wrapper partitioning and Helix sub-partitioning of the XUnit wrappers). In our initial design discussions I was originally proposing somewhat more conservative partitioning involving hundreds of wrappers tens or so test projects each while @jkotas suggested more aggressive merging ending up with just tens of the final merged test groups and ultimately we adopted his approach in the actual merge process and it seems to me that in most cases it works just fine. As we're discussing ways of running the complete merged test suites even in the presence of native runtime assertion failures and crashes on the other thread, my proposal to have the full merged test wrapper launch itself as a child process for the purpose of being able to pick up and run the remaining tests once the child has crashed should make it equally simple to run the child process in batches to watch out for particular tests taking too long - having said that, it doesn't fully solve the "long Helix item" problem as it's still within the same work item. |
Also, with a lot of Helix work items, we can scale out testing as machine resources are available. (I seem to remember there can be starvation type of problems when there are long-lived Helix work items.) I'm sure you've thought through all of this already so I'm not adding anything :-) |
As we're intentionally running larger merged tests in-proc to reduce
runtime overhead, we've become prone to hitting various infra
timeouts. As part of the switchover I extended the default test
timeout from 15 to 30 minutes; I however overlooked the fact that
the run.py script manually overrides this value for GC stress and
other special test build / execution modes. This change doubles
all these timeouts to cater for the larger merged test apps.
Thanks
Tomas
/cc @dotnet/runtime-infrastructure