Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale test for recovery from S3 #5818

Merged
merged 7 commits into from
Sep 13, 2022
Merged

Conversation

ajfabbri
Copy link
Contributor

@ajfabbri ajfabbri commented Aug 3, 2022

Create a scale test which exercises cluster recovery from S3 with larger amounts of data and partitions.

Note: Based on #5667, so ignore duplicate commits here until that is merged.

@ajfabbri ajfabbri force-pushed the extreme-recovery branch 3 times, most recently from 81d6933 to 76feeda Compare August 5, 2022 06:42
@ajfabbri ajfabbri marked this pull request as ready for review August 5, 2022 06:42
@piyushredpanda piyushredpanda requested review from abhijat, bharathv, Lazin and jcsp and removed request for dotnwat and NyaliaLui August 5, 2022 06:57
tests/rptest/tests/read_replica_e2e_test.py Outdated Show resolved Hide resolved

# Get current bucket usage
pre_usage = self._bucket_usage()
self.logger.info(f"pre_usage {pre_usage}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe move this to debug?

self.run_consumer_validation()

post_usage = self._bucket_usage()
self.logger.info(f"post_usage {post_usage}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe move this to debug?

@@ -1080,10 +1086,13 @@ def do_run(self, test_case: BaseCase):
test_case.generate_baseline()

# Give time to finish all uploads
time.sleep(10)
self.logger.info(f"Waiting {upload_delay_sec} sec for S3 uploads...")
time.sleep(upload_delay_sec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should wait on a condition here instead of a fixed time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree. This is an existing test so I left it for now. One good thing about a sleep here is it saves us some $$ on S3 metadata queries. The bigger problem with this code is that it doesn't seem to reliably find all the segments it expects, even with very long (over an hour) wait times. This is with this scale test though--which creates a ton of segments.

tests/rptest/scale_tests/extreme_recovery_test.py Outdated Show resolved Hide resolved
Comment on lines +148 to +170
def tearDown(self):
super(ExtremeRecoveryTest, self).tearDown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed unless you intend to add something specific for this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so far it is just a reminder that "any shutdown code you need goes here".

tests/rptest/scale_tests/extreme_recovery_test.py Outdated Show resolved Hide resolved
@dotnwat
Copy link
Member

dotnwat commented Aug 14, 2022

@ajfabbri looks like a merge conflict

@ajfabbri
Copy link
Contributor Author

Force-push:

  • Latest test code, and fixing merge conflicts.
  • Remove scale test suite.. work towards getting into nightlies for now.
  • Parallelize file checksum computation (was ~30% of test runtime)

@ajfabbri ajfabbri force-pushed the extreme-recovery branch 2 times, most recently from e27f8b7 to ab03495 Compare August 23, 2022 23:43
@ajfabbri
Copy link
Contributor Author

Force-push: Rebase on latest dev and fix conflicts. Address two nits from @abhijat

abhijat
abhijat previously approved these changes Aug 27, 2022
Copy link
Contributor

@abhijat abhijat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Lazin
Lazin previously approved these changes Aug 30, 2022
Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
One question. As I understand the test uses the same size based logic as old recovery test to validate results. But also it uses verifiable consumer. Is it correct? If this is the case, maybe we should get rid of size based validation and keep only verifiable consumer validation.

@@ -469,18 +469,18 @@ ss::future<ntp_archiver::scheduled_upload> ntp_archiver::schedule_single_upload(
// invariant:
// - A == C (because the name contains base offset)
// cases:
// - B < C:
// - B < D:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!


def after_restart_validation(self):
"""Check that topic is writable after restart"""
# XXX TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like syntax error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in this commit, it's probably fixed in following commits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will add a pass so the body won't be empty.

@ajfabbri
Copy link
Contributor Author

ajfabbri commented Sep 6, 2022

Force push: address nit (empty python method--we should add this to our linter).. CI failure is k8s operator (not related).

@ajfabbri ajfabbri requested a review from Lazin September 6, 2022 21:42
@ajfabbri
Copy link
Contributor Author

ajfabbri commented Sep 7, 2022

LGTM One question. As I understand the test uses the same size based logic as old recovery test to validate results. But also it uses verifiable consumer. Is it correct? If this is the case, maybe we should get rid of size based validation and keep only verifiable consumer validation.

Thank you @Lazin .. I plan on doing some more refactoring on the test in future (to avoid subclassing the main recovery test directly).. and I think this is a good idea.

@Lazin
Copy link
Contributor

Lazin commented Sep 12, 2022

k8s operator test is failing again

Lazin
Lazin previously approved these changes Sep 12, 2022
Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Aaron Fabbri added 7 commits September 12, 2022 17:50
In preparation for a scale test which uses parts of
topic_recovery_test.py; make some tweaks to TopicRecoveryTest:

- Plumb extra config options through constructor.
- Allow passing in upload delay time.
- Raise max s3 uploads 10 -> 20 when running on dedicated nodes, to help
  deal with larger scale tests (e.g. many partitions).
Adds type annotations for redpanda service's segment checksum utility,
including helper methods in its caller, topic_recovery_test.

Also update a comment about the sleep waiting for cloud uploads.
In extreme_recovery_test.py, the large-scale version of topic recovery
test, a significant part of test runtime was this computation of each
nodes' segments' checksums. It was done node by node, serially.

To speed this up, parallelize this operation over all nodes.
So we can integrate with nightly testing, while we work on setting up
less-frequent schedules.
@ajfabbri
Copy link
Contributor Author

Force-push: rebase to latest dev in hopes of resolving k8s tCI est issue.

@ajfabbri ajfabbri merged commit 232ce37 into redpanda-data:dev Sep 13, 2022
@ajfabbri ajfabbri deleted the extreme-recovery branch September 13, 2022 18:45
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda area/tests kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants