cloud_storage/tests: adjust delta for restored partitions #5372

abhijat · 2022-07-07T08:24:27Z

when asserting that the size of a restored partition is close to the
original partition in topic recovery tests, we use the default segment
size and a multiplier of 1.5 to tolerate errors where the last segment
of the ntp is not uploaded to tiered storage because it is open.

sometimes the segment is larger than 1.5x the default size causing an
assertion failure in the test. this change stores the last segment size
for each partition and uses it as the tolerance size in assertions.

fixes #4972

jcsp · 2022-07-08T07:10:08Z

tests/rptest/tests/topic_recovery_test.py

@@ -164,10 +186,10 @@ def get_ntp_sizes(fdata_per_host, hosts_can_vary=True):
        rest_ntp_size = restored_ntps[ntp]
        assert rest_ntp_size <= orig_ntp_size, f"NTP {ntp} the restored partition is larger {rest_ntp_size} than the original one {orig_ntp_size}."
        delta = orig_ntp_size - rest_ntp_size
+        tolerance = int(1.5 * default_log_segment_size)
+        tolerance = max(tolerance, baseline_last_segment_sizes.get(ntp, 0))


Now that we have the exact last segment size, do we still need the 1.5 tolerance?

Maybe there is still some tolerance needed (if the restored system has e.g. created an extra empty segment), but probably not the full 1.5?

We probably don't need the default any more because we filter out the empty segments when getting ntp sizes. I had added it as a fallback but I think we should be okay removing it completely.

yep, that should be OK

it looks like this test is consistently failing on CI when we rely on the last segment size. Sometimes the 1.5x default segment size is greater than last segment. Will look through the logs.

jcsp · 2022-07-08T07:10:14Z

I'm guessing the reason we're overshooting so much + nondeterministically is that the batch size is large relative to the segment size, and we're getting different batch sizes depending on client timing -- is that right?

when asserting that the size of a restored partition is close to the original partition in topic recovery tests, we use the default segment size and a multiplier of 1.5 to tolerate errors where the last segment of the ntp is not uploaded to tiered storage because it is open. sometimes the segment is larger than 1.5x the default size causing an assertion failure in the test. this change stores the last segment size for each partition and uses it as the tolerance size in assertions.

abhijat · 2022-07-08T09:50:56Z

I'm guessing the reason we're overshooting so much + nondeterministically is that the batch size is large relative to the segment size, and we're getting different batch sizes depending on client timing -- is that right?

This issue became much more frequent after switching to rpk producer, so it might be related to the client. I have not been able to look yet into the exact reason why we generate segments much larger than the default size but what you mentioned sounds plausible, I will look into why this could be happening.

jcsp · 2022-07-11T09:29:46Z

@abhijat latest push is failing test_fast1+test_fast3

abhijat · 2022-07-11T14:13:13Z

Looking at one instance of failure:

AssertionError: NTP NTP(ns='kafka', topic='panda-topic-1', partition=1) the restored partition is too small 5260482. The original is 5260941 bytes which 459 bytes larger.

original segment sizes for this ntp were:
1947153+1946694+1367094+455 = 5261396

we skip segments < 4096, so effectively: 1947153+1946694+1367094 = 5260941

restored sizes are:
1946694+1946694+1367094+4096 = 5264578

again, we skip < 4096, so: 1946694+1946694+1367094 = 5260482

diff = 459 bytes. this is mostly due to the very first segment, original was 1947153 bytes and restored is 1946694 bytes.

looking at segment download log during recovery for this segment:

DEBUG 2022-07-08 11:17:54,576 [shard 1] cloud_storage - [fiber0~0|0|299981ms [kafka/panda-topic-1/0, rev: 20]] - partition_recovery_manager.cc:633 - Log segment downloaded. 1947153 bytes expected, 1946694 bytes after pre-processing.
...
DEBUG 2022-07-08 11:17:54,563 [shard 1] cloud_storage - [fiber0~0|0|299994ms] - offset_translation_layer.cc:42 - skipping batch {header_crc:4147633031, size_bytes:459, base_offset:{0}, type:batch_type::raft_configuration, crc:-839928373, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:0, first_timestamp:{timestamp: 1657279044052}, max_timestamp:{timestamp: 1657279044052}, producer_id:-1, producer_epoch:-1, base_sequence:-1, record_count:1, ctx:{term:{-9223372036854775808}, owner_shard:{1}}}

offset translation skips a 459 bytes batch which accounts for the diff. in sizes.

But this is as expected where we remove the config batches from restored segment? So using the exact last segment size does not cover all cases, we also need to account for the config batches being removed.

cc @Lazin

Notably, recording the last segment size in this case did not help as it was too small to be accounted for.

abhijat · 2022-07-21T16:37:31Z

trying a new approach here to count delta offset in partition manifest and use that to create tolerable error limits

files on disk can be larger than actual data uploaded to cloud storage. it appears this is due to block allocation of segments, in this case the maximum diff can be 4095 bytes, if there is an extra block allocated for one byte on disk. so we adjust the margin of error to 4096 bytes to account for these differences.

abhijat · 2022-07-28T10:06:12Z

errors are #5699 (new) and #5695

abhijat · 2022-07-28T10:27:00Z

It was tricky to predict difference between baseline and S3 accurately, for now I have used the following:

accept a difference of one segment per partition (this should not happen in most cases as we upload every 10 seconds but sometimes the diff is larger than expected)
wait until the difference in baseline vs s3 is stable, we maintain a sequence of last N diff sizes and once they are all the same AND less than tolerance level we consider it to be stable.

abhijat · 2022-07-28T11:32:17Z

tried several iterations locally and it looks good

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.8
session_id:       2022-07-28--010
run time:         174 minutes 16.514 seconds
tests run:        100
passed:           80
failed:           0
ignored:          0
opassed:          20
ofailed:          0
================================================================================

Lazin · 2022-07-28T12:17:57Z

tests/rptest/tests/topic_recovery_test.py

+        non_data_batches_per_ntp = defaultdict(lambda: 0)
+        segments = [
+            item for item in self._s3.list_objects(self._bucket)
+            if not item.Key.endswith('manifest.json')


after filtering out all manifests you will have log segments and also transactional manifests (*.tx extension)
but here, if you don't use transactions no tx manifests will be created

abhijat · 2022-07-29T04:57:43Z

/backport v22.1.x

vbotbuildovich · 2022-07-29T04:58:46Z

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 7fbda0e89c1a9ae0a31da5f6a64894b70574c84f 79e654b7253e480b61ba9c6b1b4bbd77c3096aab 07e87f8b7dab6d1d611425ed5561058064684723 3f5b81e3bb2f67f81ec1f253dc738bac41489b0a 4b52dd001c2da236575411f7dbd2b8ab838cd0ba 8e92bf49515cb99294447cdd2af349993cc9cc66 9f04de03954a20215fbf7833e2d9b426ba37efc4

Workflow run logs.

test_missing_segment was ok_to_fail, recent changes in pr redpanda-data#5372 should improve this test by frequently uploading segments to s3, the test failure had been because the sole segment in s3 was deleted resulting in an empty restored topic. however looking at pandaresults reveals this test has been OPASS in the last 30 days with no failures. unmarking this test - the issue will be closed in a few days if no further failures are seen.

abhijat · 2022-07-29T06:14:13Z

need to backport a prev. PR for rpk producer before this one can be backported.

jcsp · 2022-07-29T10:16:49Z

Let's resolve #5711 before backporting this

abhijat · 2022-07-29T10:21:03Z

Let's resolve #5711 before backporting this

#5711 is due to incorrect manifest being deleted, I created PR for it #5721

abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 54d5da9 to 686ac05 Compare July 7, 2022 08:28

abhijat marked this pull request as ready for review July 7, 2022 08:58

abhijat requested review from dotnwat and NyaliaLui as code owners July 7, 2022 08:58

abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 7, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 7, 2022

abhijat requested review from Lazin and LenaAn July 7, 2022 09:59

piyushredpanda requested a review from jcsp July 8, 2022 03:34

jcsp reviewed Jul 8, 2022

View reviewed changes

abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 686ac05 to 7fbda0e Compare July 8, 2022 09:45

abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022

abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022

mmedenjak added kind/bug Something isn't working area/tests ci-failure labels Jul 8, 2022

Lazin previously approved these changes Jul 8, 2022

View reviewed changes

abhijat dismissed Lazin’s stale review via 9405635 July 21, 2022 16:35

abhijat added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022

vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022

abhijat added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022

vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022

abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 6bda7f3 to 4b52dd0 Compare July 26, 2022 11:51

abhijat added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 26, 2022

vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 26, 2022

abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch 2 times, most recently from 3523993 to fd40c3a Compare July 28, 2022 03:47

abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch 3 times, most recently from 2bc743e to 2d193b4 Compare July 28, 2022 05:11

abhijat mentioned this pull request Jul 28, 2022

Failure in TopicRecoveryTest.test_missing_segment #4849

Closed

abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch 3 times, most recently from e1aa4cf to 7aec931 Compare July 28, 2022 07:15

cloud_storage/tests: refactor topic_recovery_test utils to own module

9f04de0

abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 7aec931 to 9f04de0 Compare July 28, 2022 08:00

abhijat requested a review from jcsp July 28, 2022 11:32

Lazin reviewed Jul 28, 2022

View reviewed changes

Lazin approved these changes Jul 28, 2022

View reviewed changes

abhijat merged commit 677155c into redpanda-data:dev Jul 28, 2022

abhijat mentioned this pull request Jul 29, 2022

cloud_storage/tests: unmarks test ok to fail #5717

Merged

mmedenjak added the area/cloud-storage Shadow indexing subsystem label Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage/tests: adjust delta for restored partitions #5372

cloud_storage/tests: adjust delta for restored partitions #5372

abhijat commented Jul 7, 2022 •

edited

Loading

jcsp Jul 8, 2022

abhijat Jul 8, 2022

Lazin Jul 8, 2022

abhijat Jul 8, 2022

jcsp commented Jul 8, 2022

abhijat commented Jul 8, 2022

jcsp commented Jul 11, 2022

abhijat commented Jul 11, 2022 •

edited

Loading

abhijat commented Jul 21, 2022

abhijat commented Jul 28, 2022

abhijat commented Jul 28, 2022

abhijat commented Jul 28, 2022

Lazin Jul 28, 2022

abhijat commented Jul 29, 2022

vbotbuildovich commented Jul 29, 2022

abhijat commented Jul 29, 2022

jcsp commented Jul 29, 2022

abhijat commented Jul 29, 2022

cloud_storage/tests: adjust delta for restored partitions #5372

cloud_storage/tests: adjust delta for restored partitions #5372

Conversation

abhijat commented Jul 7, 2022 • edited Loading

jcsp Jul 8, 2022

Choose a reason for hiding this comment

abhijat Jul 8, 2022

Choose a reason for hiding this comment

Lazin Jul 8, 2022

Choose a reason for hiding this comment

abhijat Jul 8, 2022

Choose a reason for hiding this comment

jcsp commented Jul 8, 2022

abhijat commented Jul 8, 2022

jcsp commented Jul 11, 2022

abhijat commented Jul 11, 2022 • edited Loading

abhijat commented Jul 21, 2022

abhijat commented Jul 28, 2022

abhijat commented Jul 28, 2022

abhijat commented Jul 28, 2022

Lazin Jul 28, 2022

Choose a reason for hiding this comment

abhijat commented Jul 29, 2022

vbotbuildovich commented Jul 29, 2022

abhijat commented Jul 29, 2022

jcsp commented Jul 29, 2022

abhijat commented Jul 29, 2022

abhijat commented Jul 7, 2022 •

edited

Loading

abhijat commented Jul 11, 2022 •

edited

Loading