-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale test for recovery from S3 #5818
Conversation
81d6933
to
76feeda
Compare
76feeda
to
f2144dc
Compare
|
||
# Get current bucket usage | ||
pre_usage = self._bucket_usage() | ||
self.logger.info(f"pre_usage {pre_usage}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe move this to debug?
self.run_consumer_validation() | ||
|
||
post_usage = self._bucket_usage() | ||
self.logger.info(f"post_usage {post_usage}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe move this to debug?
@@ -1080,10 +1086,13 @@ def do_run(self, test_case: BaseCase): | |||
test_case.generate_baseline() | |||
|
|||
# Give time to finish all uploads | |||
time.sleep(10) | |||
self.logger.info(f"Waiting {upload_delay_sec} sec for S3 uploads...") | |||
time.sleep(upload_delay_sec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should wait on a condition here instead of a fixed time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree. This is an existing test so I left it for now. One good thing about a sleep here is it saves us some $$ on S3 metadata queries. The bigger problem with this code is that it doesn't seem to reliably find all the segments it expects, even with very long (over an hour) wait times. This is with this scale test though--which creates a ton of segments.
def tearDown(self): | ||
super(ExtremeRecoveryTest, self).tearDown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed unless you intend to add something specific for this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so far it is just a reminder that "any shutdown code you need goes here".
f2144dc
to
6b44d41
Compare
6b44d41
to
d9a2409
Compare
@ajfabbri looks like a merge conflict |
d9a2409
to
1639b74
Compare
1639b74
to
641ccf1
Compare
Force-push:
|
641ccf1
to
d59dc0d
Compare
e27f8b7
to
ab03495
Compare
Force-push: Rebase on latest dev and fix conflicts. Address two nits from @abhijat |
ab03495
to
353a4d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
One question. As I understand the test uses the same size based logic as old recovery test to validate results. But also it uses verifiable consumer. Is it correct? If this is the case, maybe we should get rid of size based validation and keep only verifiable consumer validation.
@@ -469,18 +469,18 @@ ss::future<ntp_archiver::scheduled_upload> ntp_archiver::schedule_single_upload( | |||
// invariant: | |||
// - A == C (because the name contains base offset) | |||
// cases: | |||
// - B < C: | |||
// - B < D: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
|
||
def after_restart_validation(self): | ||
"""Check that topic is writable after restart""" | ||
# XXX TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like syntax error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in this commit, it's probably fixed in following commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will add a pass
so the body won't be empty.
353a4d4
to
ec61474
Compare
Force push: address nit (empty python method--we should add this to our linter).. CI failure is k8s operator (not related). |
Thank you @Lazin .. I plan on doing some more refactoring on the test in future (to avoid subclassing the main recovery test directly).. and I think this is a good idea. |
k8s operator test is failing again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In preparation for a scale test which uses parts of topic_recovery_test.py; make some tweaks to TopicRecoveryTest: - Plumb extra config options through constructor. - Allow passing in upload delay time. - Raise max s3 uploads 10 -> 20 when running on dedicated nodes, to help deal with larger scale tests (e.g. many partitions).
Adds type annotations for redpanda service's segment checksum utility, including helper methods in its caller, topic_recovery_test. Also update a comment about the sleep waiting for cloud uploads.
In extreme_recovery_test.py, the large-scale version of topic recovery test, a significant part of test runtime was this computation of each nodes' segments' checksums. It was done node by node, serially. To speed this up, parallelize this operation over all nodes.
So we can integrate with nightly testing, while we work on setting up less-frequent schedules.
ec61474
to
2025e03
Compare
Force-push: rebase to latest dev in hopes of resolving k8s tCI est issue. |
Create a scale test which exercises cluster recovery from S3 with larger amounts of data and partitions.
Note: Based on #5667, so ignore duplicate commits here until that is merged.