tests: start using RedpandaInstaller in more tests #5282

andrwng · 2022-06-30T02:30:15Z

Cover letter

This PR starts using the RedpandaInstaller to extend some existing tests with upgrades, also adding a means to perform rolling restarts. Along the way, I fixed some test infra issues found in testing.

Also fixes #5378

Release notes

none

tests/rptest/services/redpanda.py

tests/rptest/tests/upgrade_test.py

tests/rptest/tests/partition_movement_test.py

tests/rptest/tests/end_to_end.py

tests/rptest/tests/upgrade_test.py

jcsp · 2022-07-11T13:00:15Z

Slack thread about cloudsmith download caps: https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1657544219673169?thread_ts=1657541425.043679&cid=C01ND4SVB6Z

I think we need to make RedpandaInstaller smarter about re-using downloads before using it in more tests:

It should download once for N nodes rather than to each node individually
It should keep packages around between tests, assuming there will be tests that want to install similar versions to one another.
Maybe we should also skip tests that use RedpandaInstaller when running in debug mode, since these presumably won't actually be downloading debug binaries, they end up being redundant runs of the release mode code.

andrwng · 2022-07-11T16:50:12Z

It should download once for N nodes rather than to each node individually

I suppose each install() call could download to the ducktape container and rsync/scp the binaries into each node. I'll have to experiment a little to see how this affects test runtimes (though at worst I think it'd only add some seconds to an upgrade test).

It should keep packages around between tests, assuming there will be tests that want to install similar versions to one another.

This seems reasonable. The clean_node methods only really need to worry about making sure each tests starts out on the correct head version. There isn't too much harm in leaving around downloaded bits, I don't think. Mileage may vary depending on how tests get scheduled.

Maybe we should also skip tests that use RedpandaInstaller when running in debug mode, since these presumably won't actually be downloading debug binaries, they end up being redundant runs of the release mode code.

I had this thought too when first implementing the infra, though opted to still use debug mode because it presumably does still add coverage for the post-upgrade debug bits of the head version. I'd be okay with doing this short term, but I do think there is value to running in debug mode (whether it's worth 2x the downloads, maybe not, but we should probably address that by improving infra rather than not testing).

andrwng · 2022-07-19T20:27:52Z

It should download once for N nodes rather than to each node individually

I suppose each install() call could download to the ducktape container and rsync/scp the binaries into each node. I'll have to experiment a little to see how this affects test runtimes (though at worst I think it'd only add some seconds to an upgrade test).

It should keep packages around between tests, assuming there will be tests that want to install similar versions to one another.

This seems reasonable. The clean_node methods only really need to worry about making sure each tests starts out on the correct head version. There isn't too much harm in leaving around downloaded bits, I don't think. Mileage may vary depending on how tests get scheduled.

Maybe we should also skip tests that use RedpandaInstaller when running in debug mode, since these presumably won't actually be downloading debug binaries, they end up being redundant runs of the release mode code.

I had this thought too when first implementing the infra, though opted to still use debug mode because it presumably does still add coverage for the post-upgrade debug bits of the head version. I'd be okay with doing this short term, but I do think there is value to running in debug mode (whether it's worth 2x the downloads, maybe not, but we should probably address that by improving infra rather than not testing).

Took the first couple of suggestions. Leaving the test around in debug mode since I think it's still valuable, though we can see how things go. With the improvements from #5459, this PR shouldn't introduce any new downloads that weren't already there.

NyaliaLui

Looking good to me. I only have nits for you.

If this PR is blocking something important, feel free to setup a 15min meeting with me so you can help me understand the other comments and we can get this approved quicker :)

tests/rptest/tests/controller_upgrade_test.py

tests/rptest/tests/end_to_end.py

tests/rptest/tests/upgrade_test.py

tests/rptest/tests/partition_movement_test.py

The method used when catching failures uses the wrong method name to get the path for the original binary.

The interval was previously hardcoded at a low value that resulted in some flakiness when running locally.

If the 'CI' environment variable is unset, we save the executable by default. The comment around not saving the executable was a bit confusing, since it expects this to only be the case when running on a developer workstation. This isn't the case, e.g. if running `ducktape` against a remote cluster without setting the `CI` environment variable. This updates the comment to avoid that confusion.

andrwng

Most recent push is just a rebase on dev

tests/rptest/tests/controller_upgrade_test.py

tests/rptest/tests/end_to_end.py

tests/rptest/tests/partition_movement_test.py

tests/rptest/tests/upgrade_test.py

andrwng · 2022-07-22T16:39:40Z

I ran this with the repeat=5 option and it one test was flaky, which is addressed in the latest revision.

Other failures were #4772, #5473, #5510, #5531, #4848

andrwng · 2022-07-22T19:45:27Z

Test failures were #5471 and #4772

dotnwat

looks good. just some minor feedback.

Regarding John's comment:

I think we need to make RedpandaInstaller smarter about re-using downloads before using it in more tests:

Is this something that is now done via the installer and therefore caching is now happening in this update which adds more users of the redpanda installer?

If not, what's the itmeline for that or has it become a non-issue.

tests/rptest/tests/controller_upgrade_test.py

tests/rptest/services/redpanda.py

tests/rptest/tests/upgrade_test.py

tests/rptest/tests/partition_movement_test.py

dotnwat · 2022-07-23T00:21:16Z

tests/rptest/tests/partition_movement_test.py

+        def has_offsets_for_all_partitions(out):
+            # NOTE: partitions may not be returned if their fields can't be
+            # populated, e.g. during leadership changes.
+            partitions = list(rpk.describe_topic(spec.name))


nit: something to keep in mind: i don't think that users of rpk.describe_topic shouldn't have to deal with this. a better solution would be to either make describe_topics be more robust, or signal to the caller that the results may be incomplete, or some other reasonable protocol.

Yeah, overall I agree, gonna punt on this for now because I don't want it to block these tests

andrwng · 2022-07-23T01:49:26Z

Is this something that is now done via the installer and therefore caching is now happening in this update which adds more users of the redpanda installer?

Caching is happening now with #5459 in place.

With local ducktape, every test will install to and use binaries from a single shared bind mount. We are still pulling from Cloudsmith, but the inclusion of that PR means we're downloading each binary once per run of all tests instead of once per node per test case.
With clustered ducktape (and local ducktape for that matter), tests will leave binaries alone when tearing down and starting up a cluster, to encourage tests reusing binaries.

dotnwat · 2022-07-24T16:38:58Z

tests/rptest/services/redpanda.py

@@ -1382,8 +1382,17 @@ def restart_nodes(self,
                      nodes,
                      override_cfg_params=None,
                      start_timeout=None,
-                      stop_timeout=None):
+                      stop_timeout=None,
+                      rolling_restart_nodes=False):


rolling_restart parameter was fine. i was thinking that we'd add a new method and have restart_nodes(self) and restart_nodes_rolling(self). delegating immediately to some other code based on the parameter makes it seem like it already is a separate method. wdyt?

Yea I guess there is enough of a disparity between a hard restart and rolling restart that it warrants its own method. I originally thought it might be useful to keep them together eg for the sake of parameterizing tests or somesuch, but I think the more likely future parameterizations would be on arguments to a rolling restart (e.g. number to restart at a time, etc), rather than to a generic restart method.

dotnwat

lgtm once CI is passing

andrwng · 2022-07-25T15:53:49Z

Hm, CI failed on ControllerUpgradeTest.test_updating_cluster_when_executing_operations in a way similar to #3811

ERROR 2022-07-24 19:11:32,110 [shard 1] cluster - controller_backend.cc:521 - exception while executing partition operation: {type: deletion, ntp: {kafka/fuzzy-operator-2712-fkvGFw/3}, offset: 131, new_assignment: { id: 3, group_id: 29, replicas: {{node_id: 4, shard: 1}} }, previous_assignment: {nullopt}} - std::__1::__fs::filesystem::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/var/lib/redpanda/data/kafka/fuzzy-operator-2712-fkvGFw/3_107])

The fix for that should have been in in 22.1.1, and this test currently upgrades from 22.1.4, so I'm not sure what's happening. The log message is also in our list of allowed logs for chaos tests, but I don't think we're doing anything unclean in this test. Digging in

We previously relied on logical versions to partially simulate different Redpanda versions. This commit uses the newly added RedpandaInstaller to allow performing a more organic upgrade. This also makes the test to wait for a bit in between restarting to ensure we actually run through ops during the upgrade.

These will be useful to define the initial starting state of a cluster.

This commit adds a new rolling_restart_nodes() method in RedpandaService to orchestrate bits of a rolling restart. The actual bits to exercise the rolling restart are placed in a separate file so we avoid ballooning redpanda.py as our rolling restart capabilities and test requirements expand. It also adds a basic test to test this with a workload running in the background.

This adds a basic test that rolls back a Redpanda cluster after a partial upgrade.

Given the test moves replicas from node to node, it seems like a good candidate to test compatibility between Redpanda versions. The test currently relies on selecting random nodes for the sake of moving replicas around, so I wasn't very intentional in deciding what nodes to upgrade -- this commit just selects the first couple in index order.

Occassionally we would hit errors like the following: RunnerClient: rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2: Summary: KeyError(1) Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run data = self.run_test() File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test return self.test_context.function(self.test) File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) File "/root/tests/rptest/services/cluster.py", line 35, in wrapped r = f(self, *args, **kwargs) File "/root/tests/rptest/tests/partition_movement_test.py", line 410, in test_bootstrapping_after_move wait_until(offsets_are_recovered, 30, 2) File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until raise e File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until if condition(): File "/root/tests/rptest/tests/partition_movement_test.py", line 405, in offsets_are_recovered return all([ File "/root/tests/rptest/tests/partition_movement_test.py", line 406, in <listcomp> offset_map[p.id] == p.high_watermark KeyError: 1 It's possible that the command run by `describe_topic()` results in lines that don't match the partition line regex. The regex expects numeric values for each field, which isn't always the case e.g. if there's a leadership change. This commit adjusts the test to ensure there are three partitions in the returned output before proceeding.

andrwng · 2022-07-25T22:52:46Z

Added a workaround for #5629 though it points to a real bug that probably existed in the test before (this PR just makes the test run for longer).

Diff https://github.com/redpanda-data/redpanda/compare/29b8f35163da1166287f5f65ec48d33b3554ebfe..57b91efd3c5b078538dfedd04519e5555da6edbc

NyaliaLui

Consider bringing up #5629 in standup or somewhere. My reasoning: If #5629 is a real bug (which it seems so), we should have a test to automatically check that instead of continuing to add to the log allow list.

andrwng · 2022-07-27T21:44:15Z

Consider bringing up #5629 in standup or somewhere. My reasoning: If #5629 is a real bug (which it seems so), we should have a test to automatically check that instead of continuing to add to the log allow list.

Yeah, I'll bring this up more broadly. Opted to work around it for now since I don't want to block testing on bug fixes that aren't directly related to the upcoming release.

Thanks for the review!

andrwng requested a review from graphcareful June 30, 2022 02:30

andrwng requested review from dotnwat and NyaliaLui as code owners June 30, 2022 02:30

jcsp reviewed Jun 30, 2022

View reviewed changes

tests/rptest/services/redpanda.py Outdated Show resolved Hide resolved

tests/rptest/tests/upgrade_test.py Show resolved Hide resolved

tests/rptest/tests/upgrade_test.py Outdated Show resolved Hide resolved

tests/rptest/tests/partition_movement_test.py Outdated Show resolved Hide resolved

andrwng force-pushed the upgrade-tests branch 2 times, most recently from fca5b43 to 2b1e20b Compare July 7, 2022 01:57

graphcareful reviewed Jul 7, 2022

View reviewed changes

tests/rptest/tests/end_to_end.py Outdated Show resolved Hide resolved

tests/rptest/tests/upgrade_test.py Show resolved Hide resolved

andrwng force-pushed the upgrade-tests branch 4 times, most recently from 78dbf10 to d9625f5 Compare July 8, 2022 16:50

andrwng mentioned this pull request Jul 13, 2022

Failure in UpgradeFromSpecificVersion.test_basic_upgrade #5417

Closed

andrwng force-pushed the upgrade-tests branch 3 times, most recently from 013c309 to f3d9114 Compare July 15, 2022 22:46

andrwng requested a review from graphcareful July 18, 2022 19:13

NyaliaLui reviewed Jul 20, 2022

View reviewed changes

andrwng force-pushed the upgrade-tests branch from f3d9114 to 5ca101e Compare July 20, 2022 17:48

andrwng added 3 commits July 20, 2022 10:48

redpanda: fix bad installer method reference

7f9254e

The method used when catching failures uses the wrong method name to get the path for the original binary.

tests: fix admin_ops_fuzzer retry interval

9afde1f

The interval was previously hardcoded at a low value that resulted in some flakiness when running locally.

andrwng force-pushed the upgrade-tests branch from 5ca101e to 398f28e Compare July 20, 2022 17:48

andrwng commented Jul 20, 2022

View reviewed changes

andrwng added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 20, 2022

vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 20, 2022

andrwng added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022

mmedenjak added kind/bug Something isn't working area/tests ci-failure labels Jul 22, 2022

andrwng force-pushed the upgrade-tests branch from 398f28e to 0b4e4e6 Compare July 22, 2022 16:25

andrwng requested a review from NyaliaLui July 22, 2022 19:45

dotnwat reviewed Jul 23, 2022

View reviewed changes

andrwng force-pushed the upgrade-tests branch from 0b4e4e6 to beae647 Compare July 23, 2022 04:46

dotnwat reviewed Jul 24, 2022

View reviewed changes

andrwng force-pushed the upgrade-tests branch from beae647 to 29b8f35 Compare July 24, 2022 18:27

dotnwat previously approved these changes Jul 24, 2022

View reviewed changes

andrwng mentioned this pull request Jul 25, 2022

Toggling cleanup policy can lead us to not clean up existing compaction indexes #5629

Open

andrwng dismissed dotnwat’s stale review via 18b610c July 25, 2022 22:47

andrwng force-pushed the upgrade-tests branch from 29b8f35 to 18b610c Compare July 25, 2022 22:47

andrwng added 6 commits July 25, 2022 15:51

tests: add installation options to EndToEndTest

b5374a9

These will be useful to define the initial starting state of a cluster.

tests: add rollback test

4ef077d

This adds a basic test that rolls back a Redpanda cluster after a partial upgrade.

andrwng force-pushed the upgrade-tests branch from afadac8 to 57b91ef Compare July 25, 2022 22:51

andrwng requested a review from dotnwat July 27, 2022 00:15

NyaliaLui approved these changes Jul 27, 2022

View reviewed changes

andrwng merged commit 0b51519 into redpanda-data:dev Jul 27, 2022

andrwng deleted the upgrade-tests branch July 27, 2022 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: start using RedpandaInstaller in more tests #5282

tests: start using RedpandaInstaller in more tests #5282

andrwng commented Jun 30, 2022 •

edited

Loading

jcsp commented Jul 11, 2022

andrwng commented Jul 11, 2022

andrwng commented Jul 19, 2022

NyaliaLui left a comment

andrwng left a comment

andrwng commented Jul 22, 2022

andrwng commented Jul 22, 2022

dotnwat left a comment

dotnwat Jul 23, 2022

andrwng Jul 23, 2022

andrwng commented Jul 23, 2022

dotnwat Jul 24, 2022

andrwng Jul 24, 2022 •

edited

Loading

dotnwat left a comment

andrwng commented Jul 25, 2022

andrwng commented Jul 25, 2022

NyaliaLui left a comment

andrwng commented Jul 27, 2022

tests: start using RedpandaInstaller in more tests #5282

tests: start using RedpandaInstaller in more tests #5282

Conversation

andrwng commented Jun 30, 2022 • edited Loading

Cover letter

Release notes

jcsp commented Jul 11, 2022

andrwng commented Jul 11, 2022

andrwng commented Jul 19, 2022

NyaliaLui left a comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

andrwng commented Jul 22, 2022

andrwng commented Jul 22, 2022

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat Jul 23, 2022

Choose a reason for hiding this comment

andrwng Jul 23, 2022

Choose a reason for hiding this comment

andrwng commented Jul 23, 2022

dotnwat Jul 24, 2022

Choose a reason for hiding this comment

andrwng Jul 24, 2022 • edited Loading

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

andrwng commented Jul 25, 2022

andrwng commented Jul 25, 2022

NyaliaLui left a comment

Choose a reason for hiding this comment

andrwng commented Jul 27, 2022

andrwng commented Jun 30, 2022 •

edited

Loading

andrwng Jul 24, 2022 •

edited

Loading