Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend stress timeouts for merged tests #68523

Merged
merged 2 commits into from
Apr 26, 2022

Conversation

trylek
Copy link
Member

@trylek trylek commented Apr 25, 2022

As we're intentionally running larger merged tests in-proc to reduce
runtime overhead, we've become prone to hitting various infra
timeouts. As part of the switchover I extended the default test
timeout from 15 to 30 minutes; I however overlooked the fact that
the run.py script manually overrides this value for GC stress and
other special test build / execution modes. This change doubles
all these timeouts to cater for the larger merged test apps.

Thanks

Tomas

/cc @dotnet/runtime-infrastructure

As we're intentionally running larger merged tests in-proc to reduce
runtime overhead, we've become prone to hitting various infra
timeouts. As part of the switchover I extended the default test
timeout from 15 to 30 minutes; I hovewer overlooked the fact that
the run.py script manually overrides this value for GC stress and
other special test build / execution flavors. This change doubles
all these timeouts to cater for the larger merged test apps.

Thanks

Tomas
@ghost
Copy link

ghost commented Apr 25, 2022

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

As we're intentionally running larger merged tests in-proc to reduce
runtime overhead, we've become prone to hitting various infra
timeouts. As part of the switchover I extended the default test
timeout from 15 to 30 minutes; I however overlooked the fact that
the run.py script manually overrides this value for GC stress and
other special test build / execution modes. This change doubles
all these timeouts to cater for the larger merged test apps.

Thanks

Tomas

/cc @dotnet/runtime-infrastructure

Author: trylek
Assignees: -
Labels:

area-Infrastructure-coreclr

Milestone: -

src/tests/run.py Outdated
@@ -875,8 +875,8 @@ def run_tests(args,
os.environ["LargeVersionBubble"] = "true"

if gc_stress:
print("Running GCStress, extending timeout to 120 minutes.")
per_test_timeout = 120*60*1000
print("Running GCStress, extending timeout to 240 minutes.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would make sense to apply the gc_stress timeout as the last one as it is by far the slowest mode. And maybe jut multiply the previously picked timeout by a constant instead of setting a fixed timeout. I guess that any "type" of test would become somehow proportionally slower with GC stress enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, addressed in 2nd commit, can you please take another look?

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Note that eng\pipelines\common\templates\runtimes\run-test-job.yml specifies timeoutPerTestInMinutes, timeoutPerTestCollectionInMinutes, and timeoutInMinutes (which I barely understand the difference between), and I'm sure other YML files also specify timeouts...

@trylek
Copy link
Member Author

trylek commented Apr 26, 2022

Thanks Bruce for the heads-up. I believe the timeout values in yml scripts haven't been updated in a while so I'll continue monitoring them and adjusting them as necessary.

@trylek trylek merged commit 054954c into dotnet:main Apr 26, 2022
@BruceForstall
Copy link
Member

fwiw, it seems like we could use Kusto to find the longest successful time for every Helix job for each pipeline, add 25%, and then set the timeout to that. But that would be a lot of work...

@trylek trylek deleted the MergedTestGCStressTimeout branch April 26, 2022 18:22
@trylek
Copy link
Member Author

trylek commented Apr 26, 2022

Good point Bruce, I'm actually right now working with Kusto to try to give substance to my perf claims regarding the merged tests, I think reusing similar queries for calibrating timeouts should be simple.

@ghost ghost locked as resolved and limited conversation to collaborators May 26, 2022
@BruceForstall
Copy link
Member

@trylek This is at least partially related:

Is the test partitioning static?

Ideally we would create many more partitions for (slow) stress jobs, to keep the maximum timeout for any Helix partition under an hour (or two). I've been told many times by MattGal that long timeouts don't work well in Helix. One problem of course is you can't rerun part of a partition, and you get no incremental feedback about progress.

@trylek
Copy link
Member Author

trylek commented Jun 7, 2022

According to our design the test partitioning is "intentional" i.o.w. "static" (as opposed to "automated" or "implicit") in the sense that it's up to us to decide what gets tested where. One of the main motivations for this decision was the fact that the pre-existing runtime test logic employs three implicit different test partitionings that are super-hard to reason about and maintain (the build partitioning aimed at overcoming msbuild limitations in the presence of a large number of projects, XUnit wrapper partitioning and Helix sub-partitioning of the XUnit wrappers).

In our initial design discussions I was originally proposing somewhat more conservative partitioning involving hundreds of wrappers tens or so test projects each while @jkotas suggested more aggressive merging ending up with just tens of the final merged test groups and ultimately we adopted his approach in the actual merge process and it seems to me that in most cases it works just fine.

As we're discussing ways of running the complete merged test suites even in the presence of native runtime assertion failures and crashes on the other thread, my proposal to have the full merged test wrapper launch itself as a child process for the purpose of being able to pick up and run the remaining tests once the child has crashed should make it equally simple to run the child process in batches to watch out for particular tests taking too long - having said that, it doesn't fully solve the "long Helix item" problem as it's still within the same work item.

@BruceForstall
Copy link
Member

it doesn't fully solve the "long Helix item" problem as it's still within the same work item.

Also, with a lot of Helix work items, we can scale out testing as machine resources are available. (I seem to remember there can be starvation type of problems when there are long-lived Helix work items.) I'm sure you've thought through all of this already so I'm not adding anything :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants