Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in TopicRecoveryTest.test_size_based_retention #4887

Closed
ZeDRoman opened this issue May 23, 2022 · 18 comments · Fixed by #6797
Closed

Failure in TopicRecoveryTest.test_size_based_retention #4887

ZeDRoman opened this issue May 23, 2022 · 18 comments · Fixed by #6797
Assignees
Labels

Comments

@ZeDRoman
Copy link
Contributor

Build: https://buildkite.com/redpanda/redpanda/builds/10396#677124b6-8fb4-418b-bd49-d89e63578bd7

FAIL test: TopicRecoveryTest.test_size_based_retention (1/19 runs)
  failure at 2022-05-23T07:38:51.539Z: AssertionError('Too much or not enough data restored, expected 10485760 got 10209301')
      in job https://buildkite.com/redpanda/redpanda/builds/10396#677124b6-8fb4-418b-bd49-d89e63578bd7

Error:



test_id:    rptest.tests.topic_recovery_test.TopicRecoveryTest.test_size_based_retention
--
  | status:     FAIL
  | run time:   51.011 seconds
  |  
  |  
  | AssertionError('Too much or not enough data restored, expected 10485760 got 10209301')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1293, in test_size_based_retention
  | self.do_run(test_case)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1180, in do_run
  | test_case.validate_cluster(baseline, restored)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 776, in validate_cluster
  | assert is_close_size(size_bytes, self.restored_size_bytes), \
  | AssertionError: Too much or not enough data restored, expected 10485760 got 10209301


@ZeDRoman ZeDRoman added kind/bug Something isn't working ci-failure labels May 23, 2022
@ZeDRoman
Copy link
Contributor Author

@twmb
Copy link
Contributor

twmb commented May 25, 2022

@ZeDRoman
Copy link
Contributor Author

@jcsp
Copy link
Contributor

jcsp commented May 27, 2022

6/97 runs failed in last 72h -- this one is quite frequent.

@VadimPlh
Copy link
Contributor

@andrewhsu
Copy link
Member

@VadimPlh
Copy link
Contributor

VadimPlh commented Jun 1, 2022

@NyaliaLui
Copy link
Contributor

@VadimPlh
Copy link
Contributor

VadimPlh commented Jun 2, 2022

@NyaliaLui
Copy link
Contributor

@ztlpn
Copy link
Contributor

ztlpn commented Jun 6, 2022

@ztlpn
Copy link
Contributor

ztlpn commented Jun 7, 2022

@ajfabbri
Copy link
Contributor

ajfabbri commented Jun 7, 2022

@BenPope
Copy link
Member

BenPope commented Jun 15, 2022

@jcsp
Copy link
Contributor

jcsp commented Jul 4, 2022

4/738 failures in last 30 days.

Most recent failure on dev https://buildkite.com/redpanda/redpanda/builds/11002#01813cd4-e0b3-4e92-ac3e-681fe2d6e08b

@BenPope
Copy link
Member

BenPope commented Oct 11, 2022

@piyushredpanda piyushredpanda assigned ZeDRoman and unassigned andijcr Oct 11, 2022
@piyushredpanda
Copy link
Contributor

@ZeDRoman is helping pick this up. Thanks, Roman.

@ZeDRoman
Copy link
Contributor Author

ZeDRoman commented Oct 14, 2022

Reason of Failure:

In Shadow Indexing we have option to recover size more or equal to retention.bytes . So Shadow Indexing would download segments until sum of their sizes become more or equal to retention.bytes property. (partition_recovery_manager.cc download_log_with_capped_size)

In Disk log GC we start to delete segments if their total size more than retention.bytes . So after GC we would have total size less or equal to retention.bytes . (disk_log_impl.cc size_based_gc_max_offset)

So when they are working together we have such behavior: SI downloads segments more than retention.bytes then Disk log GC removes one segment because total size more than retention.bytes .

It turned out in TopicRecoveryTest.test_size_based_retention. SI downloads segments, then segments are automatically deleted by Disk log GC, then we check that SI downloaded more than retention.bytes and test fails (because segment was deleted).

Solution:
Evgeny Lazin proposed that we need to adjust this behavior to download strictly less than retention bytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.