Scale test for recovery from S3 #5818

ajfabbri · 2022-08-03T18:53:16Z

Create a scale test which exercises cluster recovery from S3 with larger amounts of data and partitions.

~~Note: Based on #5667, so ignore duplicate commits here until that is merged.~~

tests/rptest/scale_tests/extreme_recovery_test.py

tests/rptest/tests/read_replica_e2e_test.py

VladLazar · 2022-08-05T11:02:32Z

tests/rptest/tests/read_replica_e2e_test.py

+
+        # Get current bucket usage
+        pre_usage = self._bucket_usage()
+        self.logger.info(f"pre_usage {pre_usage}")


Nit: Maybe move this to debug?

VladLazar · 2022-08-05T11:02:46Z

tests/rptest/tests/read_replica_e2e_test.py

+        self.run_consumer_validation()
+
+        post_usage = self._bucket_usage()
+        self.logger.info(f"post_usage {post_usage}")


Nit: maybe move this to debug?

VladLazar · 2022-08-05T11:05:58Z

tests/rptest/tests/topic_recovery_test.py

@@ -1080,10 +1086,13 @@ def do_run(self, test_case: BaseCase):
        test_case.generate_baseline()

        # Give time to finish all uploads
-        time.sleep(10)
+        self.logger.info(f"Waiting {upload_delay_sec} sec for S3 uploads...")
+        time.sleep(upload_delay_sec)


I feel like we should wait on a condition here instead of a fixed time.

I generally agree. This is an existing test so I left it for now. One good thing about a sleep here is it saves us some $$ on S3 metadata queries. The bigger problem with this code is that it doesn't seem to reliably find all the segments it expects, even with very long (over an hour) wait times. This is with this scale test though--which creates a ton of segments.

tests/rptest/scale_tests/extreme_recovery_test.py

VladLazar · 2022-08-05T11:40:22Z

tests/rptest/scale_tests/extreme_recovery_test.py

+    def tearDown(self):
+        super(ExtremeRecoveryTest, self).tearDown()


This isn't needed unless you intend to add something specific for this class.

Yeah, so far it is just a reminder that "any shutdown code you need goes here".

tests/rptest/scale_tests/extreme_recovery_test.py

tests/rptest/test_suite_scale.yml

dotnwat · 2022-08-14T18:38:10Z

@ajfabbri looks like a merge conflict

ajfabbri · 2022-08-15T03:37:22Z

Force-push:

Latest test code, and fixing merge conflicts.
Remove scale test suite.. work towards getting into nightlies for now.
Parallelize file checksum computation (was ~30% of test runtime)

ajfabbri · 2022-08-23T23:44:38Z

Force-push: Rebase on latest dev and fix conflicts. Address two nits from @abhijat

abhijat

lgtm

Lazin

LGTM
One question. As I understand the test uses the same size based logic as old recovery test to validate results. But also it uses verifiable consumer. Is it correct? If this is the case, maybe we should get rid of size based validation and keep only verifiable consumer validation.

Lazin · 2022-08-18T08:47:42Z

src/v/archival/ntp_archiver_service.cc

@@ -469,18 +469,18 @@ ss::future<ntp_archiver::scheduled_upload> ntp_archiver::schedule_single_upload(
        // invariant:
        // - A == C (because the name contains base offset)
        // cases:
-        // - B < C:
+        // - B < D:


Good catch!

Lazin · 2022-08-30T07:40:53Z

tests/rptest/scale_tests/extreme_recovery_test.py

+
+    def after_restart_validation(self):
+        """Check that topic is writable after restart"""
+        # XXX TODO


Looks like syntax error

At least in this commit, it's probably fixed in following commits

Thanks. I will add a pass so the body won't be empty.

ajfabbri · 2022-09-06T21:41:11Z

Force push: address nit (empty python method--we should add this to our linter).. CI failure is k8s operator (not related).

ajfabbri · 2022-09-07T03:17:40Z

LGTM One question. As I understand the test uses the same size based logic as old recovery test to validate results. But also it uses verifiable consumer. Is it correct? If this is the case, maybe we should get rid of size based validation and keep only verifiable consumer validation.

Thank you @Lazin .. I plan on doing some more refactoring on the test in future (to avoid subclassing the main recovery test directly).. and I think this is a good idea.

Lazin · 2022-09-12T11:22:51Z

k8s operator test is failing again

Lazin

LGTM

In preparation for a scale test which uses parts of topic_recovery_test.py; make some tweaks to TopicRecoveryTest: - Plumb extra config options through constructor. - Allow passing in upload delay time. - Raise max s3 uploads 10 -> 20 when running on dedicated nodes, to help deal with larger scale tests (e.g. many partitions).

Adds type annotations for redpanda service's segment checksum utility, including helper methods in its caller, topic_recovery_test. Also update a comment about the sleep waiting for cloud uploads.

In extreme_recovery_test.py, the large-scale version of topic recovery test, a significant part of test runtime was this computation of each nodes' segments' checksums. It was done node by node, serially. To speed this up, parallelize this operation over all nodes.

So we can integrate with nightly testing, while we work on setting up less-frequent schedules.

ajfabbri · 2022-09-13T00:51:21Z

Force-push: rebase to latest dev in hopes of resolving k8s tCI est issue.

ajfabbri force-pushed the extreme-recovery branch 3 times, most recently from 81d6933 to 76feeda Compare August 5, 2022 06:42

ajfabbri marked this pull request as ready for review August 5, 2022 06:42

ajfabbri requested review from dotnwat and NyaliaLui as code owners August 5, 2022 06:42

piyushredpanda requested review from abhijat, bharathv, Lazin and jcsp and removed request for dotnwat and NyaliaLui August 5, 2022 06:57

ajfabbri force-pushed the extreme-recovery branch from 76feeda to f2144dc Compare August 5, 2022 07:01

abhijat reviewed Aug 5, 2022

View reviewed changes

tests/rptest/scale_tests/extreme_recovery_test.py Show resolved Hide resolved

VladLazar reviewed Aug 5, 2022

View reviewed changes

ajfabbri force-pushed the extreme-recovery branch from f2144dc to 6b44d41 Compare August 5, 2022 20:22

jcsp reviewed Aug 8, 2022

View reviewed changes

tests/rptest/test_suite_scale.yml Outdated Show resolved Hide resolved

ajfabbri mentioned this pull request Aug 9, 2022

extreme_recovery_test.py fails to upload one manifest #5928

Closed

ajfabbri force-pushed the extreme-recovery branch from 6b44d41 to d9a2409 Compare August 10, 2022 19:47

ajfabbri force-pushed the extreme-recovery branch from d9a2409 to 1639b74 Compare August 15, 2022 03:04

github-actions bot added the area/redpanda label Aug 15, 2022

ajfabbri force-pushed the extreme-recovery branch from 1639b74 to 641ccf1 Compare August 15, 2022 03:10

mmedenjak added kind/enhance New feature or request area/tests area/cloud-storage Shadow indexing subsystem and removed area/redpanda labels Aug 15, 2022

ajfabbri force-pushed the extreme-recovery branch from 641ccf1 to d59dc0d Compare August 17, 2022 01:04

github-actions bot added the area/redpanda label Aug 17, 2022

ajfabbri force-pushed the extreme-recovery branch 2 times, most recently from e27f8b7 to ab03495 Compare August 23, 2022 23:43

ajfabbri force-pushed the extreme-recovery branch from ab03495 to 353a4d4 Compare August 26, 2022 02:00

jcsp mentioned this pull request Aug 26, 2022

HighThroughputPartitionMovementTest doesn't drive load throughout test #6245

Closed

ajfabbri requested a review from abhijat August 27, 2022 03:22

piyushredpanda removed the request for review from bharathv August 27, 2022 03:23

abhijat previously approved these changes Aug 27, 2022

View reviewed changes

Lazin previously approved these changes Aug 30, 2022

View reviewed changes

ajfabbri dismissed stale reviews from Lazin and abhijat via ec61474 September 6, 2022 17:51

ajfabbri force-pushed the extreme-recovery branch from 353a4d4 to ec61474 Compare September 6, 2022 17:51

ajfabbri requested a review from Lazin September 6, 2022 21:42

Lazin previously approved these changes Sep 12, 2022

View reviewed changes

Aaron Fabbri added 7 commits September 12, 2022 17:50

tests/scale: add larger-scale recovery test

690260c

tests/topic_recovery: add type hints

dbc5787

ntp_archiver_service: fix comment on upload conditions

5971660

tests: add typing to redpanda service checksum code

88b2c1d

Adds type annotations for redpanda service's segment checksum utility, including helper methods in its caller, topic_recovery_test. Also update a comment about the sleep waiting for cloud uploads.

scale_tests: reduce scale in extreme_recovery_test.

2025e03

So we can integrate with nightly testing, while we work on setting up less-frequent schedules.

ajfabbri dismissed Lazin’s stale review via 2025e03 September 13, 2022 00:50

ajfabbri force-pushed the extreme-recovery branch from ec61474 to 2025e03 Compare September 13, 2022 00:50

Lazin approved these changes Sep 13, 2022

View reviewed changes

ajfabbri merged commit 232ce37 into redpanda-data:dev Sep 13, 2022

ajfabbri deleted the extreme-recovery branch September 13, 2022 18:45

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale test for recovery from S3 #5818

Scale test for recovery from S3 #5818

ajfabbri commented Aug 3, 2022 •

edited

Loading

VladLazar Aug 5, 2022

VladLazar Aug 5, 2022

VladLazar Aug 5, 2022

ajfabbri Aug 5, 2022

VladLazar Aug 5, 2022

ajfabbri Aug 5, 2022

dotnwat commented Aug 14, 2022

ajfabbri commented Aug 15, 2022

ajfabbri commented Aug 23, 2022

abhijat left a comment

Lazin left a comment

Lazin Aug 18, 2022

Lazin Aug 30, 2022

Lazin Aug 30, 2022

ajfabbri Sep 6, 2022

ajfabbri commented Sep 6, 2022 •

edited

Loading

ajfabbri commented Sep 7, 2022 •

edited

Loading

Lazin commented Sep 12, 2022

Lazin left a comment

ajfabbri commented Sep 13, 2022

		def tearDown(self):
		super(ExtremeRecoveryTest, self).tearDown()

Scale test for recovery from S3 #5818

Scale test for recovery from S3 #5818

Conversation

ajfabbri commented Aug 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented Aug 14, 2022

ajfabbri commented Aug 15, 2022

ajfabbri commented Aug 23, 2022

abhijat left a comment

Choose a reason for hiding this comment

Lazin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfabbri commented Sep 6, 2022 • edited Loading

ajfabbri commented Sep 7, 2022 • edited Loading

Lazin commented Sep 12, 2022

Lazin left a comment

Choose a reason for hiding this comment

ajfabbri commented Sep 13, 2022

ajfabbri commented Aug 3, 2022 •

edited

Loading

ajfabbri commented Sep 6, 2022 •

edited

Loading

ajfabbri commented Sep 7, 2022 •

edited

Loading