Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: start using RedpandaInstaller in more tests #5282

Merged
merged 9 commits into from
Jul 27, 2022

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jun 30, 2022

Cover letter

This PR starts using the RedpandaInstaller to extend some existing tests with upgrades, also adding a means to perform rolling restarts. Along the way, I fixed some test infra issues found in testing.

Also fixes #5378

Release notes

  • none

tests/rptest/services/redpanda.py Outdated Show resolved Hide resolved
tests/rptest/tests/upgrade_test.py Show resolved Hide resolved
tests/rptest/tests/upgrade_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/partition_movement_test.py Outdated Show resolved Hide resolved
@andrwng andrwng force-pushed the upgrade-tests branch 2 times, most recently from fca5b43 to 2b1e20b Compare July 7, 2022 01:57
tests/rptest/tests/end_to_end.py Outdated Show resolved Hide resolved
tests/rptest/tests/upgrade_test.py Show resolved Hide resolved
@andrwng andrwng force-pushed the upgrade-tests branch 4 times, most recently from 78dbf10 to d9625f5 Compare July 8, 2022 16:50
@jcsp
Copy link
Contributor

jcsp commented Jul 11, 2022

Slack thread about cloudsmith download caps: https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1657544219673169?thread_ts=1657541425.043679&cid=C01ND4SVB6Z

I think we need to make RedpandaInstaller smarter about re-using downloads before using it in more tests:

  • It should download once for N nodes rather than to each node individually
  • It should keep packages around between tests, assuming there will be tests that want to install similar versions to one another.
  • Maybe we should also skip tests that use RedpandaInstaller when running in debug mode, since these presumably won't actually be downloading debug binaries, they end up being redundant runs of the release mode code.

@andrwng
Copy link
Contributor Author

andrwng commented Jul 11, 2022

It should download once for N nodes rather than to each node individually

I suppose each install() call could download to the ducktape container and rsync/scp the binaries into each node. I'll have to experiment a little to see how this affects test runtimes (though at worst I think it'd only add some seconds to an upgrade test).

It should keep packages around between tests, assuming there will be tests that want to install similar versions to one another.

This seems reasonable. The clean_node methods only really need to worry about making sure each tests starts out on the correct head version. There isn't too much harm in leaving around downloaded bits, I don't think. Mileage may vary depending on how tests get scheduled.

Maybe we should also skip tests that use RedpandaInstaller when running in debug mode, since these presumably won't actually be downloading debug binaries, they end up being redundant runs of the release mode code.

I had this thought too when first implementing the infra, though opted to still use debug mode because it presumably does still add coverage for the post-upgrade debug bits of the head version. I'd be okay with doing this short term, but I do think there is value to running in debug mode (whether it's worth 2x the downloads, maybe not, but we should probably address that by improving infra rather than not testing).

@andrwng
Copy link
Contributor Author

andrwng commented Jul 19, 2022

It should download once for N nodes rather than to each node individually

I suppose each install() call could download to the ducktape container and rsync/scp the binaries into each node. I'll have to experiment a little to see how this affects test runtimes (though at worst I think it'd only add some seconds to an upgrade test).

It should keep packages around between tests, assuming there will be tests that want to install similar versions to one another.

This seems reasonable. The clean_node methods only really need to worry about making sure each tests starts out on the correct head version. There isn't too much harm in leaving around downloaded bits, I don't think. Mileage may vary depending on how tests get scheduled.

Maybe we should also skip tests that use RedpandaInstaller when running in debug mode, since these presumably won't actually be downloading debug binaries, they end up being redundant runs of the release mode code.

I had this thought too when first implementing the infra, though opted to still use debug mode because it presumably does still add coverage for the post-upgrade debug bits of the head version. I'd be okay with doing this short term, but I do think there is value to running in debug mode (whether it's worth 2x the downloads, maybe not, but we should probably address that by improving infra rather than not testing).

Took the first couple of suggestions. Leaving the test around in debug mode since I think it's still valuable, though we can see how things go. With the improvements from #5459, this PR shouldn't introduce any new downloads that weren't already there.

Copy link
Contributor

@NyaliaLui NyaliaLui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me. I only have nits for you.

If this PR is blocking something important, feel free to setup a 15min meeting with me so you can help me understand the other comments and we can get this approved quicker :)

tests/rptest/tests/controller_upgrade_test.py Show resolved Hide resolved
tests/rptest/tests/controller_upgrade_test.py Show resolved Hide resolved
tests/rptest/tests/end_to_end.py Show resolved Hide resolved
tests/rptest/tests/upgrade_test.py Show resolved Hide resolved
tests/rptest/tests/partition_movement_test.py Outdated Show resolved Hide resolved
The method used when catching failures uses the wrong method name to get
the path for the original binary.
The interval was previously hardcoded at a low value that resulted in
some flakiness when running locally.
If the 'CI' environment variable is unset, we save the executable by
default. The comment around not saving the executable was a bit
confusing, since it expects this to only be the case when running on a
developer workstation. This isn't the case, e.g. if running `ducktape`
against a remote cluster without setting the `CI` environment variable.

This updates the comment to avoid that confusion.
Copy link
Contributor Author

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most recent push is just a rebase on dev

tests/rptest/tests/controller_upgrade_test.py Show resolved Hide resolved
tests/rptest/tests/controller_upgrade_test.py Show resolved Hide resolved
tests/rptest/tests/end_to_end.py Show resolved Hide resolved
tests/rptest/tests/partition_movement_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/upgrade_test.py Show resolved Hide resolved
@andrwng andrwng added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 20, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 20, 2022
@andrwng andrwng added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022
@andrwng
Copy link
Contributor Author

andrwng commented Jul 22, 2022

I ran this with the repeat=5 option and it one test was flaky, which is addressed in the latest revision.

Other failures were #4772, #5473, #5510, #5531, #4848

@andrwng
Copy link
Contributor Author

andrwng commented Jul 22, 2022

Test failures were #5471 and #4772

@andrwng andrwng requested a review from NyaliaLui July 22, 2022 19:45
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just some minor feedback.

Regarding John's comment:

I think we need to make RedpandaInstaller smarter about re-using downloads before using it in more tests:

Is this something that is now done via the installer and therefore caching is now happening in this update which adds more users of the redpanda installer?

If not, what's the itmeline for that or has it become a non-issue.

tests/rptest/tests/controller_upgrade_test.py Show resolved Hide resolved
tests/rptest/services/redpanda.py Outdated Show resolved Hide resolved
tests/rptest/tests/upgrade_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/partition_movement_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/partition_movement_test.py Show resolved Hide resolved
tests/rptest/tests/partition_movement_test.py Outdated Show resolved Hide resolved
def has_offsets_for_all_partitions(out):
# NOTE: partitions may not be returned if their fields can't be
# populated, e.g. during leadership changes.
partitions = list(rpk.describe_topic(spec.name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: something to keep in mind: i don't think that users of rpk.describe_topic shouldn't have to deal with this. a better solution would be to either make describe_topics be more robust, or signal to the caller that the results may be incomplete, or some other reasonable protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, overall I agree, gonna punt on this for now because I don't want it to block these tests

@andrwng
Copy link
Contributor Author

andrwng commented Jul 23, 2022

Is this something that is now done via the installer and therefore caching is now happening in this update which adds more users of the redpanda installer?

Caching is happening now with #5459 in place.

  • With local ducktape, every test will install to and use binaries from a single shared bind mount. We are still pulling from Cloudsmith, but the inclusion of that PR means we're downloading each binary once per run of all tests instead of once per node per test case.
  • With clustered ducktape (and local ducktape for that matter), tests will leave binaries alone when tearing down and starting up a cluster, to encourage tests reusing binaries.

@@ -1382,8 +1382,17 @@ def restart_nodes(self,
nodes,
override_cfg_params=None,
start_timeout=None,
stop_timeout=None):
stop_timeout=None,
rolling_restart_nodes=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolling_restart parameter was fine. i was thinking that we'd add a new method and have restart_nodes(self) and restart_nodes_rolling(self). delegating immediately to some other code based on the parameter makes it seem like it already is a separate method. wdyt?

Copy link
Contributor Author

@andrwng andrwng Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I guess there is enough of a disparity between a hard restart and rolling restart that it warrants its own method. I originally thought it might be useful to keep them together eg for the sake of parameterizing tests or somesuch, but I think the more likely future parameterizations would be on arguments to a rolling restart (e.g. number to restart at a time, etc), rather than to a generic restart method.

dotnwat
dotnwat previously approved these changes Jul 24, 2022
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm once CI is passing

@andrwng
Copy link
Contributor Author

andrwng commented Jul 25, 2022

Hm, CI failed on ControllerUpgradeTest.test_updating_cluster_when_executing_operations in a way similar to #3811

ERROR 2022-07-24 19:11:32,110 [shard 1] cluster - controller_backend.cc:521 - exception while executing partition operation: {type: deletion, ntp: {kafka/fuzzy-operator-2712-fkvGFw/3}, offset: 131, new_assignment: { id: 3, group_id: 29, replicas: {{node_id: 4, shard: 1}} }, previous_assignment: {nullopt}} - std::__1::__fs::filesystem::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/var/lib/redpanda/data/kafka/fuzzy-operator-2712-fkvGFw/3_107])

The fix for that should have been in in 22.1.1, and this test currently upgrades from 22.1.4, so I'm not sure what's happening. The log message is also in our list of allowed logs for chaos tests, but I don't think we're doing anything unclean in this test. Digging in

We previously relied on logical versions to partially simulate different
Redpanda versions. This commit uses the newly added RedpandaInstaller to
allow performing a more organic upgrade.

This also makes the test to wait for a bit in between restarting to
ensure we actually run through ops during the upgrade.
These will be useful to define the initial starting state of a cluster.
This commit adds a new rolling_restart_nodes() method in RedpandaService
to orchestrate bits of a rolling restart. The actual bits to exercise
the rolling restart are placed in a separate file so we avoid ballooning
redpanda.py as our rolling restart capabilities and test requirements
expand.

It also adds a basic test to test this with a workload running in the
background.
This adds a basic test that rolls back a Redpanda cluster after a
partial upgrade.
Given the test moves replicas from node to node, it seems like a good
candidate to test compatibility between Redpanda versions.

The test currently relies on selecting random nodes for the sake of
moving replicas around, so I wasn't very intentional in deciding what
nodes to upgrade -- this commit just selects the first couple in index
order.
Occassionally we would hit errors like the following:

RunnerClient: rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2: Summary: KeyError(1)
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 410, in test_bootstrapping_after_move
    wait_until(offsets_are_recovered, 30, 2)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/tests/partition_movement_test.py", line 405, in offsets_are_recovered
    return all([
  File "/root/tests/rptest/tests/partition_movement_test.py", line 406, in <listcomp>
    offset_map[p.id] == p.high_watermark
KeyError: 1

It's possible that the command run by `describe_topic()` results in
lines that don't match the partition line regex. The regex expects
numeric values for each field, which isn't always the case e.g. if
there's a leadership change.

This commit adjusts the test to ensure there are three partitions in the
returned output before proceeding.
@andrwng
Copy link
Contributor Author

andrwng commented Jul 25, 2022

Added a workaround for #5629 though it points to a real bug that probably existed in the test before (this PR just makes the test run for longer).

Diff https://github.com/redpanda-data/redpanda/compare/29b8f35163da1166287f5f65ec48d33b3554ebfe..57b91efd3c5b078538dfedd04519e5555da6edbc

@andrwng andrwng requested a review from dotnwat July 27, 2022 00:15
Copy link
Contributor

@NyaliaLui NyaliaLui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider bringing up #5629 in standup or somewhere. My reasoning: If #5629 is a real bug (which it seems so), we should have a test to automatically check that instead of continuing to add to the log allow list.

@andrwng
Copy link
Contributor Author

andrwng commented Jul 27, 2022

Consider bringing up #5629 in standup or somewhere. My reasoning: If #5629 is a real bug (which it seems so), we should have a test to automatically check that instead of continuing to add to the log allow list.

Yeah, I'll bring this up more broadly. Opted to work around it for now since I don't want to block testing on bug fixes that aren't directly related to the upcoming release.

Thanks for the review!

@andrwng andrwng merged commit 0b51519 into redpanda-data:dev Jul 27, 2022
@andrwng andrwng deleted the upgrade-tests branch July 27, 2022 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations
7 participants