test: enable parallel redpanda startup by default #5972

jcsp · 2022-08-11T14:50:13Z

Cover letter

This change should be functionally equivalent but provide a small per-test speedup.

Backport Required

UX changes

None

Release notes

none

jcsp · 2022-08-11T14:53:47Z

Let's see what the total runtime delta is vs. tip of dev

dotnwat · 2022-08-12T04:07:17Z

tests/rptest/services/redpanda.py

-              parallel: bool = False):
+              parallel: bool = True):


+1. at one point we had observed that the basic start-up cost of a empty test was like about 10 seconds or so. x500 tests and anything little thing helps!

jcsp · 2022-08-15T12:01:20Z

The actual parallelism didn't make a lot of difference for most of the 3 node tests. I think to get the benefit we also need to tune the wait loops + also stop in parallel: latest version of this does that.

This will all also benefit from #6003 once it's ready.

jcsp · 2022-08-15T21:41:59Z

That last run exposed a couple pre-existing bugs:

rpk not retrying well on 503s in maintenance_test -- this was much more likely now that restarts were faster
KafkaCliConsumer wait() bug, where consumers could still be running after end of test. This was much more likely to cause a problem now that redpanda stop is much faster.

jcsp · 2022-08-17T07:44:11Z

/ci-repeat 5

jcsp · 2022-08-17T07:46:06Z

This just had a clean run through, with release build runtime 1hr 50m, so we're saving about 15 minutes.

That's nice, although the main benefit is for developers running subsets of tests locally: some test suites with many small tests are about twice as fast now..

jcsp · 2022-08-17T21:26:50Z

/ci-repeat 5

jcsp · 2022-08-18T08:58:23Z

Pretty stable in last CI repeat:

One instance of CI Failure in partition_balancer_test.PartitionBalancerTest.test_full_nodes #5884
An assertion in ShadowIndexingCacheSpaceLeakTest.test_si_cache which doesn't immediately look related to this change, but needs investigating.

jcsp · 2022-08-18T20:04:09Z

ShadowIndexingCacheSpaceLeakTest.test_si_cache looked like a pre-existing test bug that I've added a commit to fix.

Last CI run had only one failure:

CI Failure in controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations #5886

One more repeat run, as we might as well let this shake out any more failures it's going to while it's still in PR.

jcsp · 2022-08-18T20:06:52Z

/ci-repeat 5

jcsp · 2022-08-19T09:12:44Z

5x runs with one test failure (#4702)

This is good to go.

mmaslankaprv · 2022-08-19T10:00:43Z

tests/rptest/services/redpanda.py

+        with concurrent.futures.ThreadPoolExecutor(
+                max_workers=len(nodes)) as executor:
+            list(
+                executor.map(lambda n: self.stop_node(n, timeout=stop_timeout),


what will happen if executor thread throws ? i am wondering if we would lose timeout exception in this case ?

From map() docs:

If a func call raises an exception, then that exception will be raised when its value is retrieved from the iterator.

The call to list() is the trick here: it pumps the iterator and makes sure that if any of the results were exceptional they are raised.

it's worth adding a comment, without it list(x) looks like it's an artifact of refactoring and it almost asks to be reduced to x

rystsov · 2022-08-19T15:45:23Z

tests/rptest/services/redpanda.py

+        with concurrent.futures.ThreadPoolExecutor(
+                max_workers=len(nodes)) as executor:
+            list(
+                executor.map(lambda n: self.stop_node(n, timeout=stop_timeout),


it's worth adding a comment, without it list(x) looks like it's an artifact of refactoring and it almost asks to be reduced to x

tests/rptest/services/redpanda.py

This shaves a few seconds of the startup of every test that uses multiple nodes.

- on initial start, use a smaller retry interval while waiting for nodes to come up, to avoid spurious full-second waits - when stopping and restarting, do all nodes in parallel

It is not necessary to wait for the mtls feature, as it is no longer used to gate enabling the functionality in 22.2. The sleep(5) hack for waiting for ACLs is replaced by writing a phony user after the acls, and then waiting for that user to be visible everywhere, as evidence that the controller log has proceeded past the ACLs.

This is quicker than waiting for startup to timeout.

This was previously waiting for a timeout on normal redpanda startup. That's unnecessary: when redpanda fails to start up it does it very quickly, we can just wait for the PIDs to go away.

Two issues: - Calls to get_cluster_metrics(self.redpanda.controller()) right after a restart, where controller might be None, and fail the assertion that the node argument is in the _started list - assert_reported_by_controller assumes that controller leadership is stable between its initial call to controller() and its subsequent assertion that only that initially reported node returns stats (or no nodes if that initial call returned None). Fix with strategically placed waits for controller leadership.

It's not obvious to me why this failure wasn't happening more before, but since speeding up test startup/teardown, this test is injecting errors and then failing on lines in the log like: ERROR 2022-08-17 09:31:05,974 [shard 0] rpc - Service handler threw an exception: std::runtime_error (FailureInjector: raft::raftgen::vote) ...which are perfectly expected, as the test is intentionally doing failure injection.

This was checking the errno of IOError (parent class) rather than the status code of the HTTP response.

This could fail on the assertion that there were cache files open after the consumer finished: i.e. the consumer managed to do all its reads without hitting SI cache. This can happen a couple of ways: - random reader happens to hit all latest (un-evicted) segment on local disk - reads all hit batch cache - retention enforcement happens to run slow so that local segments remain available for all reader's reads. Fix by not stopping the consumer until we see some SI cache files open.

Fixes redpanda-data#5886

To clarify for future generations why there is a spurious-looking `list()` around executor.map calls.

jcsp · 2022-08-19T16:28:54Z

it's worth adding a comment, without it list(x) looks like it's an artifact of refactoring and it almost asks to be reduced to x

Added comments -- yes, this is something that was a bit of a trap for the unwary without a comment.

Latest force-push is just those two refactors from denis's review.

jcsp · 2022-08-20T08:58:02Z

CI failure is a resurgence of close super-rare issue #3032 -- I don't think there's much of a signal that the failure is from this PR: it happened just once in ~20 CI runs on this PR, and it happened also very rarely on the baseline code.

jcsp force-pushed the tests-parallel-startup branch from 8967806 to 3e92bb2 Compare August 11, 2022 18:59

dotnwat reviewed Aug 12, 2022

View reviewed changes

jcsp force-pushed the tests-parallel-startup branch from 3e92bb2 to 96da09b Compare August 15, 2022 11:31

github-actions bot added the area/rpk label Aug 15, 2022

jcsp force-pushed the tests-parallel-startup branch from 3401c98 to 7c6eceb Compare August 15, 2022 21:43

This was referenced Aug 15, 2022

CI Failure in ClusterConfigTest.test_rpk_lint #6008

Closed

CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

Closed

jcsp force-pushed the tests-parallel-startup branch from 7c6eceb to 713ae5d Compare August 15, 2022 22:00

ztlpn mentioned this pull request Aug 15, 2022

RpkException in PartitionBalancerTest.test_maintenance_mode #6033

Closed

jcsp force-pushed the tests-parallel-startup branch from 713ae5d to 9b8c316 Compare August 15, 2022 22:52

jcsp added kind/bug Something isn't working kind/enhance New feature or request area/tests labels Aug 15, 2022

jcsp mentioned this pull request Aug 16, 2022

tests: various bugfixes #6048

Merged

5 tasks

jcsp force-pushed the tests-parallel-startup branch from 6cdfb6b to 571ec53 Compare August 16, 2022 20:04

github-actions bot removed the area/rpk label Aug 16, 2022

jcsp marked this pull request as ready for review August 17, 2022 07:44

jcsp force-pushed the tests-parallel-startup branch from e9e9c37 to c704346 Compare August 18, 2022 16:30

mmaslankaprv reviewed Aug 19, 2022

View reviewed changes

rystsov reviewed Aug 19, 2022

View reviewed changes

jcsp added 11 commits August 19, 2022 17:26

test: enable parallel redpanda startup by default

17188f9

This shaves a few seconds of the startup of every test that uses multiple nodes.

tests/services: speed up setup/teardown

66370fa

- on initial start, use a smaller retry interval while waiting for nodes to come up, to avoid spurious full-second waits - when stopping and restarting, do all nodes in parallel

tests/services: add expect_fail mode to RedpandaService.start

f32db98

This is quicker than waiting for startup to timeout.

tests: faster failure detection in acl test

a39822d

This was previously waiting for a timeout on normal redpanda startup. That's unnecessary: when redpanda fails to start up it does it very quickly, we can just wait for the PIDs to go away.

tests: fix retry in partition_movement_test

ac03fc1

This was checking the errno of IOError (parent class) rather than the status code of the HTTP response.

tests: permit std::exception in logs from old Redpanda

dacdb3b

Fixes redpanda-data#5886

test: add comments to RedpandaService

f60073b

To clarify for future generations why there is a spurious-looking `list()` around executor.map calls.

jcsp force-pushed the tests-parallel-startup branch from 5cb0c02 to f60073b Compare August 19, 2022 16:28

mmaslankaprv self-requested a review August 19, 2022 16:36

mmaslankaprv approved these changes Aug 19, 2022

View reviewed changes

jcsp merged commit a7404bd into redpanda-data:dev Aug 20, 2022

jcsp deleted the tests-parallel-startup branch August 20, 2022 08:58

jcsp mentioned this pull request Aug 20, 2022

Failure in tx_abort_index_test.TxAbortSnaphotTest #6134

Closed

mmaslankaprv mentioned this pull request Sep 19, 2022

[V22.2.x] Backport of 'Fixed handling of offsets_for_leader_epoch request' #6470

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: enable parallel redpanda startup by default #5972

test: enable parallel redpanda startup by default #5972

jcsp commented Aug 11, 2022

jcsp commented Aug 11, 2022

dotnwat Aug 12, 2022

jcsp commented Aug 15, 2022

jcsp commented Aug 15, 2022

jcsp commented Aug 17, 2022

jcsp commented Aug 17, 2022

jcsp commented Aug 17, 2022

jcsp commented Aug 18, 2022

jcsp commented Aug 18, 2022

jcsp commented Aug 18, 2022

jcsp commented Aug 19, 2022

mmaslankaprv Aug 19, 2022

jcsp Aug 19, 2022

mmaslankaprv Aug 19, 2022

rystsov Aug 19, 2022

rystsov Aug 19, 2022

jcsp commented Aug 19, 2022

jcsp commented Aug 20, 2022

test: enable parallel redpanda startup by default #5972

test: enable parallel redpanda startup by default #5972

Conversation

jcsp commented Aug 11, 2022

Cover letter

Backport Required

UX changes

Release notes

jcsp commented Aug 11, 2022

dotnwat Aug 12, 2022

Choose a reason for hiding this comment

jcsp commented Aug 15, 2022

jcsp commented Aug 15, 2022

jcsp commented Aug 17, 2022

jcsp commented Aug 17, 2022

jcsp commented Aug 17, 2022

jcsp commented Aug 18, 2022

jcsp commented Aug 18, 2022

jcsp commented Aug 18, 2022

jcsp commented Aug 19, 2022

mmaslankaprv Aug 19, 2022

Choose a reason for hiding this comment

jcsp Aug 19, 2022

Choose a reason for hiding this comment

mmaslankaprv Aug 19, 2022

Choose a reason for hiding this comment

rystsov Aug 19, 2022

Choose a reason for hiding this comment

rystsov Aug 19, 2022

Choose a reason for hiding this comment

jcsp commented Aug 19, 2022

jcsp commented Aug 20, 2022