-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: start using RedpandaInstaller in more tests #5282
Conversation
fca5b43
to
2b1e20b
Compare
78dbf10
to
d9625f5
Compare
Slack thread about cloudsmith download caps: https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1657544219673169?thread_ts=1657541425.043679&cid=C01ND4SVB6Z I think we need to make RedpandaInstaller smarter about re-using downloads before using it in more tests:
|
I suppose each
This seems reasonable. The
I had this thought too when first implementing the infra, though opted to still use debug mode because it presumably does still add coverage for the post-upgrade debug bits of the head version. I'd be okay with doing this short term, but I do think there is value to running in debug mode (whether it's worth 2x the downloads, maybe not, but we should probably address that by improving infra rather than not testing). |
013c309
to
f3d9114
Compare
Took the first couple of suggestions. Leaving the test around in debug mode since I think it's still valuable, though we can see how things go. With the improvements from #5459, this PR shouldn't introduce any new downloads that weren't already there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me. I only have nits for you.
If this PR is blocking something important, feel free to setup a 15min meeting with me so you can help me understand the other comments and we can get this approved quicker :)
The method used when catching failures uses the wrong method name to get the path for the original binary.
The interval was previously hardcoded at a low value that resulted in some flakiness when running locally.
If the 'CI' environment variable is unset, we save the executable by default. The comment around not saving the executable was a bit confusing, since it expects this to only be the case when running on a developer workstation. This isn't the case, e.g. if running `ducktape` against a remote cluster without setting the `CI` environment variable. This updates the comment to avoid that confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most recent push is just a rebase on dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. just some minor feedback.
Regarding John's comment:
I think we need to make RedpandaInstaller smarter about re-using downloads before using it in more tests:
Is this something that is now done via the installer and therefore caching is now happening in this update which adds more users of the redpanda installer?
If not, what's the itmeline for that or has it become a non-issue.
def has_offsets_for_all_partitions(out): | ||
# NOTE: partitions may not be returned if their fields can't be | ||
# populated, e.g. during leadership changes. | ||
partitions = list(rpk.describe_topic(spec.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: something to keep in mind: i don't think that users of rpk.describe_topic shouldn't have to deal with this. a better solution would be to either make describe_topics be more robust, or signal to the caller that the results may be incomplete, or some other reasonable protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, overall I agree, gonna punt on this for now because I don't want it to block these tests
Caching is happening now with #5459 in place.
|
tests/rptest/services/redpanda.py
Outdated
@@ -1382,8 +1382,17 @@ def restart_nodes(self, | |||
nodes, | |||
override_cfg_params=None, | |||
start_timeout=None, | |||
stop_timeout=None): | |||
stop_timeout=None, | |||
rolling_restart_nodes=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolling_restart
parameter was fine. i was thinking that we'd add a new method and have restart_nodes(self)
and restart_nodes_rolling(self)
. delegating immediately to some other code based on the parameter makes it seem like it already is a separate method. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I guess there is enough of a disparity between a hard restart and rolling restart that it warrants its own method. I originally thought it might be useful to keep them together eg for the sake of parameterizing tests or somesuch, but I think the more likely future parameterizations would be on arguments to a rolling restart (e.g. number to restart at a time, etc), rather than to a generic restart method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm once CI is passing
Hm, CI failed on
The fix for that should have been in in 22.1.1, and this test currently upgrades from 22.1.4, so I'm not sure what's happening. The log message is also in our list of allowed logs for chaos tests, but I don't think we're doing anything unclean in this test. Digging in |
We previously relied on logical versions to partially simulate different Redpanda versions. This commit uses the newly added RedpandaInstaller to allow performing a more organic upgrade. This also makes the test to wait for a bit in between restarting to ensure we actually run through ops during the upgrade.
These will be useful to define the initial starting state of a cluster.
This commit adds a new rolling_restart_nodes() method in RedpandaService to orchestrate bits of a rolling restart. The actual bits to exercise the rolling restart are placed in a separate file so we avoid ballooning redpanda.py as our rolling restart capabilities and test requirements expand. It also adds a basic test to test this with a workload running in the background.
This adds a basic test that rolls back a Redpanda cluster after a partial upgrade.
Given the test moves replicas from node to node, it seems like a good candidate to test compatibility between Redpanda versions. The test currently relies on selecting random nodes for the sake of moving replicas around, so I wasn't very intentional in deciding what nodes to upgrade -- this commit just selects the first couple in index order.
Occassionally we would hit errors like the following: RunnerClient: rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2: Summary: KeyError(1) Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run data = self.run_test() File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test return self.test_context.function(self.test) File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) File "/root/tests/rptest/services/cluster.py", line 35, in wrapped r = f(self, *args, **kwargs) File "/root/tests/rptest/tests/partition_movement_test.py", line 410, in test_bootstrapping_after_move wait_until(offsets_are_recovered, 30, 2) File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until raise e File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until if condition(): File "/root/tests/rptest/tests/partition_movement_test.py", line 405, in offsets_are_recovered return all([ File "/root/tests/rptest/tests/partition_movement_test.py", line 406, in <listcomp> offset_map[p.id] == p.high_watermark KeyError: 1 It's possible that the command run by `describe_topic()` results in lines that don't match the partition line regex. The regex expects numeric values for each field, which isn't always the case e.g. if there's a leadership change. This commit adjusts the test to ensure there are three partitions in the returned output before proceeding.
Added a workaround for #5629 though it points to a real bug that probably existed in the test before (this PR just makes the test run for longer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll bring this up more broadly. Opted to work around it for now since I don't want to block testing on bug fixes that aren't directly related to the upcoming release. Thanks for the review! |
Cover letter
This PR starts using the
RedpandaInstaller
to extend some existing tests with upgrades, also adding a means to perform rolling restarts. Along the way, I fixed some test infra issues found in testing.Also fixes #5378
Release notes