-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: enable parallel redpanda startup by default #5972
Conversation
Let's see what the total runtime delta is vs. tip of dev |
8967806
to
3e92bb2
Compare
tests/rptest/services/redpanda.py
Outdated
parallel: bool = False): | ||
parallel: bool = True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. at one point we had observed that the basic start-up cost of a empty test was like about 10 seconds or so. x500 tests and anything little thing helps!
3e92bb2
to
96da09b
Compare
The actual parallelism didn't make a lot of difference for most of the 3 node tests. I think to get the benefit we also need to tune the wait loops + also stop in parallel: latest version of this does that. This will all also benefit from #6003 once it's ready. |
That last run exposed a couple pre-existing bugs:
|
3401c98
to
7c6eceb
Compare
7c6eceb
to
713ae5d
Compare
713ae5d
to
9b8c316
Compare
6cdfb6b
to
571ec53
Compare
/ci-repeat 5 |
This just had a clean run through, with release build runtime 1hr 50m, so we're saving about 15 minutes. That's nice, although the main benefit is for developers running subsets of tests locally: some test suites with many small tests are about twice as fast now.. |
/ci-repeat 5 |
Pretty stable in last CI repeat:
|
e9e9c37
to
c704346
Compare
Last CI run had only one failure: One more repeat run, as we might as well let this shake out any more failures it's going to while it's still in PR. |
/ci-repeat 5 |
5x runs with one test failure (#4702) This is good to go. |
with concurrent.futures.ThreadPoolExecutor( | ||
max_workers=len(nodes)) as executor: | ||
list( | ||
executor.map(lambda n: self.stop_node(n, timeout=stop_timeout), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen if executor thread throws ? i am wondering if we would lose timeout exception in this case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From map() docs:
If a func call raises an exception, then that exception will be raised when its value is retrieved from the iterator.
The call to list()
is the trick here: it pumps the iterator and makes sure that if any of the results were exceptional they are raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's worth adding a comment, without it list(x)
looks like it's an artifact of refactoring and it almost asks to be reduced to x
with concurrent.futures.ThreadPoolExecutor( | ||
max_workers=len(nodes)) as executor: | ||
list( | ||
executor.map(lambda n: self.stop_node(n, timeout=stop_timeout), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's worth adding a comment, without it list(x)
looks like it's an artifact of refactoring and it almost asks to be reduced to x
This shaves a few seconds of the startup of every test that uses multiple nodes.
- on initial start, use a smaller retry interval while waiting for nodes to come up, to avoid spurious full-second waits - when stopping and restarting, do all nodes in parallel
It is not necessary to wait for the mtls feature, as it is no longer used to gate enabling the functionality in 22.2. The sleep(5) hack for waiting for ACLs is replaced by writing a phony user after the acls, and then waiting for that user to be visible everywhere, as evidence that the controller log has proceeded past the ACLs.
This is quicker than waiting for startup to timeout.
This was previously waiting for a timeout on normal redpanda startup. That's unnecessary: when redpanda fails to start up it does it very quickly, we can just wait for the PIDs to go away.
Two issues: - Calls to get_cluster_metrics(self.redpanda.controller()) right after a restart, where controller might be None, and fail the assertion that the node argument is in the _started list - assert_reported_by_controller assumes that controller leadership is stable between its initial call to controller() and its subsequent assertion that only that initially reported node returns stats (or no nodes if that initial call returned None). Fix with strategically placed waits for controller leadership.
It's not obvious to me why this failure wasn't happening more before, but since speeding up test startup/teardown, this test is injecting errors and then failing on lines in the log like: ERROR 2022-08-17 09:31:05,974 [shard 0] rpc - Service handler threw an exception: std::runtime_error (FailureInjector: raft::raftgen::vote) ...which are perfectly expected, as the test is intentionally doing failure injection.
This was checking the errno of IOError (parent class) rather than the status code of the HTTP response.
This could fail on the assertion that there were cache files open after the consumer finished: i.e. the consumer managed to do all its reads without hitting SI cache. This can happen a couple of ways: - random reader happens to hit all latest (un-evicted) segment on local disk - reads all hit batch cache - retention enforcement happens to run slow so that local segments remain available for all reader's reads. Fix by not stopping the consumer until we see some SI cache files open.
To clarify for future generations why there is a spurious-looking `list()` around executor.map calls.
5cb0c02
to
f60073b
Compare
Added comments -- yes, this is something that was a bit of a trap for the unwary without a comment. Latest force-push is just those two refactors from denis's review. |
CI failure is a resurgence of close super-rare issue #3032 -- I don't think there's much of a signal that the failure is from this PR: it happened just once in ~20 CI runs on this PR, and it happened also very rarely on the baseline code. |
Cover letter
This change should be functionally equivalent but provide a small per-test speedup.
Backport Required
UX changes
None
Release notes