Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: enable parallel redpanda startup by default #5972

Merged
merged 11 commits into from
Aug 20, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Aug 11, 2022

Cover letter

This change should be functionally equivalent but provide a small per-test speedup.

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

  • none

@jcsp
Copy link
Contributor Author

jcsp commented Aug 11, 2022

Let's see what the total runtime delta is vs. tip of dev

Comment on lines 709 to 703
parallel: bool = False):
parallel: bool = True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. at one point we had observed that the basic start-up cost of a empty test was like about 10 seconds or so. x500 tests and anything little thing helps!

@jcsp
Copy link
Contributor Author

jcsp commented Aug 15, 2022

The actual parallelism didn't make a lot of difference for most of the 3 node tests. I think to get the benefit we also need to tune the wait loops + also stop in parallel: latest version of this does that.

This will all also benefit from #6003 once it's ready.

@jcsp
Copy link
Contributor Author

jcsp commented Aug 15, 2022

That last run exposed a couple pre-existing bugs:

  • rpk not retrying well on 503s in maintenance_test -- this was much more likely now that restarts were faster
  • KafkaCliConsumer wait() bug, where consumers could still be running after end of test. This was much more likely to cause a problem now that redpanda stop is much faster.

@jcsp
Copy link
Contributor Author

jcsp commented Aug 17, 2022

/ci-repeat 5

@jcsp jcsp marked this pull request as ready for review August 17, 2022 07:44
@jcsp
Copy link
Contributor Author

jcsp commented Aug 17, 2022

This just had a clean run through, with release build runtime 1hr 50m, so we're saving about 15 minutes.

That's nice, although the main benefit is for developers running subsets of tests locally: some test suites with many small tests are about twice as fast now..

@jcsp
Copy link
Contributor Author

jcsp commented Aug 17, 2022

/ci-repeat 5

@jcsp
Copy link
Contributor Author

jcsp commented Aug 18, 2022

Pretty stable in last CI repeat:

@jcsp
Copy link
Contributor Author

jcsp commented Aug 18, 2022

ShadowIndexingCacheSpaceLeakTest.test_si_cache looked like a pre-existing test bug that I've added a commit to fix.

Last CI run had only one failure:

One more repeat run, as we might as well let this shake out any more failures it's going to while it's still in PR.

@jcsp
Copy link
Contributor Author

jcsp commented Aug 18, 2022

/ci-repeat 5

@jcsp
Copy link
Contributor Author

jcsp commented Aug 19, 2022

5x runs with one test failure (#4702)

This is good to go.

with concurrent.futures.ThreadPoolExecutor(
max_workers=len(nodes)) as executor:
list(
executor.map(lambda n: self.stop_node(n, timeout=stop_timeout),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if executor thread throws ? i am wondering if we would lose timeout exception in this case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From map() docs:

If a func call raises an exception, then that exception will be raised when its value is retrieved from the iterator.

The call to list() is the trick here: it pumps the iterator and makes sure that if any of the results were exceptional they are raised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worth adding a comment, without it list(x) looks like it's an artifact of refactoring and it almost asks to be reduced to x

with concurrent.futures.ThreadPoolExecutor(
max_workers=len(nodes)) as executor:
list(
executor.map(lambda n: self.stop_node(n, timeout=stop_timeout),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worth adding a comment, without it list(x) looks like it's an artifact of refactoring and it almost asks to be reduced to x

tests/rptest/services/redpanda.py Outdated Show resolved Hide resolved
jcsp added 11 commits August 19, 2022 17:26
This shaves a few seconds of the startup of every
test that uses multiple nodes.
- on initial start, use a smaller retry interval
  while waiting for nodes to come up, to avoid
  spurious full-second waits
- when stopping and restarting, do all nodes in parallel
It is not necessary to wait for the mtls feature,
as it is no longer used to gate enabling the functionality
in 22.2.

The sleep(5) hack for waiting for ACLs is replaced
by writing a phony user after the acls, and then waiting
for that user to be visible everywhere, as evidence that
the controller log has proceeded past the ACLs.
This is quicker than waiting for startup to timeout.
This was previously waiting for a timeout
on normal redpanda startup.

That's unnecessary: when redpanda fails to start up
it does it very quickly, we can just wait for the
PIDs to go away.
Two issues:
- Calls to get_cluster_metrics(self.redpanda.controller()) right
  after a restart, where controller might be None, and fail
  the assertion that the node argument is in the _started list
- assert_reported_by_controller assumes that controller leadership
  is stable between its initial call to controller() and its subsequent
  assertion that only that initially reported node returns stats (or
  no nodes if that initial call returned None).

Fix with strategically placed waits for controller leadership.
It's not obvious to me why this failure wasn't happening more before,
but since speeding up test startup/teardown, this test is
injecting errors and then failing on lines in the log like:

ERROR 2022-08-17 09:31:05,974 [shard 0] rpc - Service handler threw an exception: std::runtime_error (FailureInjector: raft::raftgen::vote)

...which are perfectly expected, as the test is intentionally
doing failure injection.
This was checking the errno of IOError (parent
class) rather than the status code of the HTTP response.
This could fail on the assertion that there
were cache files open after the consumer finished: i.e.
the consumer managed to do all its reads without
hitting SI cache.

This can happen a couple of ways:
- random reader happens to hit all latest
  (un-evicted) segment on local disk
- reads all hit batch cache
- retention enforcement happens to run slow
  so that local segments remain available
  for all reader's reads.

Fix by not stopping the consumer until
we see some SI cache files open.
To clarify for future generations why there
is a spurious-looking `list()` around executor.map
calls.
@jcsp
Copy link
Contributor Author

jcsp commented Aug 19, 2022

it's worth adding a comment, without it list(x) looks like it's an artifact of refactoring and it almost asks to be reduced to x

Added comments -- yes, this is something that was a bit of a trap for the unwary without a comment.

Latest force-push is just those two refactors from denis's review.

@mmaslankaprv mmaslankaprv self-requested a review August 19, 2022 16:36
@jcsp
Copy link
Contributor Author

jcsp commented Aug 20, 2022

CI failure is a resurgence of close super-rare issue #3032 -- I don't think there's much of a signal that the failure is from this PR: it happened just once in ~20 CI runs on this PR, and it happened also very rarely on the baseline code.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests kind/bug Something isn't working kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants