Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in TopicRecoveryTest.test_missing_segment #4849

Closed
dimitriscruz opened this issue May 20, 2022 · 9 comments
Closed

Failure in TopicRecoveryTest.test_missing_segment #4849

dimitriscruz opened this issue May 20, 2022 · 9 comments
Assignees
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working

Comments

@dimitriscruz
Copy link
Contributor

Build: https://buildkite.com/redpanda/redpanda/builds/10338#3a55721a-05e6-4eeb-b891-d57d2b042e92

FAIL test: TopicRecoveryTest.test_missing_segment (1/48 runs)
  failure at 2022-05-20T13:02:16.384Z: AssertionError('High watermark has unexpected value 0, last offset: 1047')
      in job https://buildkite.com/redpanda/redpanda/builds/10338#3a55721a-05e6-4eeb-b891-d57d2b042e92

Error:

test_id:    rptest.tests.topic_recovery_test.TopicRecoveryTest.test_missing_segment
--
  | status:     FAIL
  | run time:   53.938 seconds
  |  
  |  
  | AssertionError('High watermark has unexpected value 0, last offset: 1047')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1228, in test_missing_segment
  | self.do_run(test_case)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1179, in do_run
  | test_case.validate_cluster(baseline, restored)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 615, in validate_cluster
  | self._validate_partition_last_offset()
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 308, in _validate_partition_last_offset
  | assert hw <= last_offset and hw >= last_offset - num_segments, \
  | AssertionError: High watermark has unexpected value 0, last offset: 1047
@andrwng
Copy link
Contributor

andrwng commented Jun 10, 2022

Addressed in #4858

@andrwng andrwng closed this as completed Jun 10, 2022
@jcsp
Copy link
Contributor

jcsp commented Jul 5, 2022

@andrwng it looks like this test is still marked @ok_to_fail -- can you open a PR to reenable it if it was fixed in #4858?

(I see no failures in last 30 days so I can believe it's fixed, although not obvious to me how fixing abandoned futures errors addressed a high watermark error)

@jcsp jcsp reopened this Jul 5, 2022
@andrwng
Copy link
Contributor

andrwng commented Jul 5, 2022

@andrwng it looks like this test is still marked @ok_to_fail -- can you open a PR to reenable it if it was fixed in #4858?

(I see no failures in last 30 days so I can believe it's fixed, although not obvious to me how fixing abandoned futures errors addressed a high watermark error)

Ooh, yikes. Thanks for calling this out. I misread the linked issue. Will look into this shortly.

@piyushredpanda piyushredpanda assigned abhijat and unassigned andrwng Jul 19, 2022
@mmedenjak mmedenjak added the area/cloud-storage Shadow indexing subsystem label Jul 21, 2022
@abhijat
Copy link
Contributor

abhijat commented Jul 27, 2022

looking at logs, it seems the segment was missing during restore process

WARN  2022-05-20 12:50:06,531 [shard 1] s3 - client.cc:524 - S3 replied with error: HTTP/1.1 404 Not Found
Accept-Ranges: bytes
Content-Length: 444
Content-Security-Policy: block-all-mixed-content
Content-Type: application/xml
Server: MinIO
Vary: Origin
X-Amz-Bucket-Region: panda-region
X-Amz-Request-Id: 16F0D121D7E96E94
X-Xss-Protection: 1; mode=block
Date: Fri, 20 May 2022 12:50:06 GMT


INFO  2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1~3|1|300000ms] - remote.cc:73 - NoSuchKey response received {"55b3ccae/kafka/panda-topic/0_18/0-1-v1.log.1"}
WARN  2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1~3|1|300000ms] - remote.cc:467 - Downloading segment from {panda-bucket-460b9fc6-d83b-11ec-b71b-0242ac120033}, {key_not_found}, segment at {"55b3ccae/kafka/panda-topic/0_18/0-1-v1.log.1"} not available
ERROR 2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1|0|300000ms [kafka/panda-topic/0, rev: 24]] - partition_recovery_manager.cc:643 - Failed segment download for {"55b3ccae/kafka/panda-topic/0_18/0-1-v1.log.1"}
INFO  2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1|0|300000ms [kafka/panda-topic/0, rev: 24]] - partition_recovery_manager.cc:658 - Renaming directory "/var/lib/redpanda/data/kafka/panda-topic/0_24_part" to "/var/lib/redpanda/data/kafka/panda-topic/0_24"

EDIT: this is in line with the actual test case, so expected behaviour.

@abhijat
Copy link
Contributor

abhijat commented Jul 28, 2022

this should improve with changes in #5372 where we are forcing regular segment uploads to s3, thus reducing chances of having just one segment in s3 and that being deleted during test, resulting in this failure.

What happened in this test failure was:

  1. just one segment uploaded to s3
  2. it was deleted during test
  3. topic restored had 0 data
  4. assertion wanted topic to have data because the first segment is deleted but other segments should be present in s3 with higher offsets

@abhijat
Copy link
Contributor

abhijat commented Jul 28, 2022

with #5372 merged, going to unmark this test as ok-to-fail and observe for a while.

@jcsp
Copy link
Contributor

jcsp commented Jul 28, 2022

@abhijat ok_to_fail tests still report their status in test results, as OPASS or OFAIL, and you can see these in pandaresults. So it's not necessary to remove the ok_to_fail to check that.

@abhijat
Copy link
Contributor

abhijat commented Jul 29, 2022

@abhijat ok_to_fail tests still report their status in test results, as OPASS or OFAIL, and you can see these in pandaresults. So it's not necessary to remove the ok_to_fail to check that.

Thanks for the tip

(venv) abhijat@terminus-est ~/dev/pandaresults/results
 $ grep test_missing_segment summary.csv | cut -d, -f2 | wc -l
887
(venv) abhijat@terminus-est ~/dev/pandaresults/results
 $ grep test_missing_segment summary.csv | cut -d, -f2 | sort -u
opass

these are last 30 days worth of tests. I think I can go ahead and unmark this and close this PR after waiting a couple of days, to see the recent changes have not negatively influenced test outcome.

@jcsp
Copy link
Contributor

jcsp commented Jul 29, 2022

Let's go ahead and close this, as the test is reinstated at https://github.com/redpanda-data/redpanda/pull/5717/files -- if failures happen we can always reopen.

@jcsp jcsp closed this as completed Jul 29, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants