Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage/tests: adjust delta for restored partitions #5372

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Jul 7, 2022

when asserting that the size of a restored partition is close to the
original partition in topic recovery tests, we use the default segment
size and a multiplier of 1.5 to tolerate errors where the last segment
of the ntp is not uploaded to tiered storage because it is open.

sometimes the segment is larger than 1.5x the default size causing an
assertion failure in the test. this change stores the last segment size
for each partition and uses it as the tolerance size in assertions.

fixes #4972

@abhijat abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 54d5da9 to 686ac05 Compare July 7, 2022 08:28
@abhijat abhijat marked this pull request as ready for review July 7, 2022 08:58
@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 7, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 7, 2022
@abhijat abhijat requested review from Lazin and LenaAn July 7, 2022 09:59
@piyushredpanda piyushredpanda requested a review from jcsp July 8, 2022 03:34
@@ -164,10 +186,10 @@ def get_ntp_sizes(fdata_per_host, hosts_can_vary=True):
rest_ntp_size = restored_ntps[ntp]
assert rest_ntp_size <= orig_ntp_size, f"NTP {ntp} the restored partition is larger {rest_ntp_size} than the original one {orig_ntp_size}."
delta = orig_ntp_size - rest_ntp_size
tolerance = int(1.5 * default_log_segment_size)
tolerance = max(tolerance, baseline_last_segment_sizes.get(ntp, 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the exact last segment size, do we still need the 1.5 tolerance?

Maybe there is still some tolerance needed (if the restored system has e.g. created an extra empty segment), but probably not the full 1.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need the default any more because we filter out the empty segments when getting ntp sizes. I had added it as a fallback but I think we should be okay removing it completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that should be OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this test is consistently failing on CI when we rely on the last segment size. Sometimes the 1.5x default segment size is greater than last segment. Will look through the logs.

@jcsp
Copy link
Contributor

jcsp commented Jul 8, 2022

I'm guessing the reason we're overshooting so much + nondeterministically is that the batch size is large relative to the segment size, and we're getting different batch sizes depending on client timing -- is that right?

when asserting that the size of a restored partition is close to the
original partition in topic recovery tests, we use the default segment
size and a multiplier of 1.5 to tolerate errors where the last segment
of the ntp is not uploaded to tiered storage because it is open.

sometimes the segment is larger than 1.5x the default size causing an
assertion failure in the test. this change stores the last segment size
for each partition and uses it as the tolerance size in assertions.
@abhijat abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 686ac05 to 7fbda0e Compare July 8, 2022 09:45
@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 8, 2022
@abhijat
Copy link
Contributor Author

abhijat commented Jul 8, 2022

I'm guessing the reason we're overshooting so much + nondeterministically is that the batch size is large relative to the segment size, and we're getting different batch sizes depending on client timing -- is that right?

This issue became much more frequent after switching to rpk producer, so it might be related to the client. I have not been able to look yet into the exact reason why we generate segments much larger than the default size but what you mentioned sounds plausible, I will look into why this could be happening.

@mmedenjak mmedenjak added kind/bug Something isn't working area/tests ci-failure labels Jul 8, 2022
Lazin
Lazin previously approved these changes Jul 8, 2022
@jcsp
Copy link
Contributor

jcsp commented Jul 11, 2022

@abhijat latest push is failing test_fast1+test_fast3

@abhijat
Copy link
Contributor Author

abhijat commented Jul 11, 2022

Looking at one instance of failure:

AssertionError: NTP NTP(ns='kafka', topic='panda-topic-1', partition=1) the restored partition is too small 5260482. The original is 5260941 bytes which 459 bytes larger.

original segment sizes for this ntp were:
1947153+1946694+1367094+455 = 5261396

we skip segments < 4096, so effectively: 1947153+1946694+1367094 = 5260941

restored sizes are:
1946694+1946694+1367094+4096 = 5264578

again, we skip < 4096, so: 1946694+1946694+1367094 = 5260482

diff = 459 bytes. this is mostly due to the very first segment, original was 1947153 bytes and restored is 1946694 bytes.

looking at segment download log during recovery for this segment:

DEBUG 2022-07-08 11:17:54,576 [shard 1] cloud_storage - [fiber0~0|0|299981ms [kafka/panda-topic-1/0, rev: 20]] - partition_recovery_manager.cc:633 - Log segment downloaded. 1947153 bytes expected, 1946694 bytes after pre-processing.
...
DEBUG 2022-07-08 11:17:54,563 [shard 1] cloud_storage - [fiber0~0|0|299994ms] - offset_translation_layer.cc:42 - skipping batch {header_crc:4147633031, size_bytes:459, base_offset:{0}, type:batch_type::raft_configuration, crc:-839928373, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:0, first_timestamp:{timestamp: 1657279044052}, max_timestamp:{timestamp: 1657279044052}, producer_id:-1, producer_epoch:-1, base_sequence:-1, record_count:1, ctx:{term:{-9223372036854775808}, owner_shard:{1}}}

offset translation skips a 459 bytes batch which accounts for the diff. in sizes.

But this is as expected where we remove the config batches from restored segment? So using the exact last segment size does not cover all cases, we also need to account for the config batches being removed.

cc @Lazin

Notably, recording the last segment size in this case did not help as it was too small to be accounted for.

@abhijat abhijat added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022
@abhijat abhijat added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 21, 2022
@abhijat
Copy link
Contributor Author

abhijat commented Jul 21, 2022

trying a new approach here to count delta offset in partition manifest and use that to create tolerable error limits

@abhijat abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 6bda7f3 to 4b52dd0 Compare July 26, 2022 11:51
@abhijat abhijat added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 26, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jul 26, 2022
@abhijat abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch 2 times, most recently from 3523993 to fd40c3a Compare July 28, 2022 03:47
files on disk can be larger than actual data uploaded to cloud storage.
it appears this is due to block allocation of segments, in this case the
maximum diff can be 4095 bytes, if there is an extra block allocated for
one byte on disk. so we adjust the margin of error to 4096 bytes to
account for these differences.
@abhijat abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch 3 times, most recently from 2bc743e to 2d193b4 Compare July 28, 2022 05:11
@abhijat abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch 3 times, most recently from e1aa4cf to 7aec931 Compare July 28, 2022 07:15
@abhijat abhijat force-pushed the topic-recovery-test-fast2-adjust-assertion-delta-size branch from 7aec931 to 9f04de0 Compare July 28, 2022 08:00
@abhijat
Copy link
Contributor Author

abhijat commented Jul 28, 2022

errors are #5699 (new) and #5695

@abhijat
Copy link
Contributor Author

abhijat commented Jul 28, 2022

It was tricky to predict difference between baseline and S3 accurately, for now I have used the following:

  1. accept a difference of one segment per partition (this should not happen in most cases as we upload every 10 seconds but sometimes the diff is larger than expected)
  2. wait until the difference in baseline vs s3 is stable, we maintain a sequence of last N diff sizes and once they are all the same AND less than tolerance level we consider it to be stable.

@abhijat
Copy link
Contributor Author

abhijat commented Jul 28, 2022

tried several iterations locally and it looks good

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.8
session_id:       2022-07-28--010
run time:         174 minutes 16.514 seconds
tests run:        100
passed:           80
failed:           0
ignored:          0
opassed:          20
ofailed:          0
================================================================================

@abhijat abhijat requested a review from jcsp July 28, 2022 11:32
non_data_batches_per_ntp = defaultdict(lambda: 0)
segments = [
item for item in self._s3.list_objects(self._bucket)
if not item.Key.endswith('manifest.json')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after filtering out all manifests you will have log segments and also transactional manifests (*.tx extension)
but here, if you don't use transactions no tx manifests will be created

@abhijat abhijat merged commit 677155c into redpanda-data:dev Jul 28, 2022
@abhijat
Copy link
Contributor Author

abhijat commented Jul 29, 2022

/backport v22.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 7fbda0e89c1a9ae0a31da5f6a64894b70574c84f 79e654b7253e480b61ba9c6b1b4bbd77c3096aab 07e87f8b7dab6d1d611425ed5561058064684723 3f5b81e3bb2f67f81ec1f253dc738bac41489b0a 4b52dd001c2da236575411f7dbd2b8ab838cd0ba 8e92bf49515cb99294447cdd2af349993cc9cc66 9f04de03954a20215fbf7833e2d9b426ba37efc4

Workflow run logs.

abhijat added a commit to abhijat/redpanda that referenced this pull request Jul 29, 2022
test_missing_segment was ok_to_fail, recent changes in pr
redpanda-data#5372 should improve this
test by frequently uploading segments to s3, the test failure had been
because the sole segment in s3 was deleted resulting in an empty
restored topic.

however looking at pandaresults reveals this test has been OPASS in the
last 30 days with no failures. unmarking this test - the issue will be
closed in a few days if no further failures are seen.
@abhijat
Copy link
Contributor Author

abhijat commented Jul 29, 2022

need to backport a prev. PR for rpk producer before this one can be backported.

@mmedenjak mmedenjak added the area/cloud-storage Shadow indexing subsystem label Jul 29, 2022
@jcsp
Copy link
Contributor

jcsp commented Jul 29, 2022

Let's resolve #5711 before backporting this

@abhijat
Copy link
Contributor Author

abhijat commented Jul 29, 2022

Let's resolve #5711 before backporting this

#5711 is due to incorrect manifest being deleted, I created PR for it #5721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure in topic_recovery_test.TopicRecoveryTest.test_fast3
6 participants