Failure in `TopicRecoveryTest.test_missing_segment` #4849

dimitriscruz · 2022-05-20T23:35:28Z

Build: https://buildkite.com/redpanda/redpanda/builds/10338#3a55721a-05e6-4eeb-b891-d57d2b042e92

FAIL test: TopicRecoveryTest.test_missing_segment (1/48 runs)
  failure at 2022-05-20T13:02:16.384Z: AssertionError('High watermark has unexpected value 0, last offset: 1047')
      in job https://buildkite.com/redpanda/redpanda/builds/10338#3a55721a-05e6-4eeb-b891-d57d2b042e92

Error:

test_id:    rptest.tests.topic_recovery_test.TopicRecoveryTest.test_missing_segment
--
  | status:     FAIL
  | run time:   53.938 seconds
  |  
  |  
  | AssertionError('High watermark has unexpected value 0, last offset: 1047')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1228, in test_missing_segment
  | self.do_run(test_case)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1179, in do_run
  | test_case.validate_cluster(baseline, restored)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 615, in validate_cluster
  | self._validate_partition_last_offset()
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 308, in _validate_partition_last_offset
  | assert hw <= last_offset and hw >= last_offset - num_segments, \
  | AssertionError: High watermark has unexpected value 0, last offset: 1047

The text was updated successfully, but these errors were encountered:

andrwng · 2022-06-10T22:08:33Z

Addressed in #4858

jcsp · 2022-07-05T09:38:14Z

@andrwng it looks like this test is still marked @ok_to_fail -- can you open a PR to reenable it if it was fixed in #4858?

(I see no failures in last 30 days so I can believe it's fixed, although not obvious to me how fixing abandoned futures errors addressed a high watermark error)

andrwng · 2022-07-05T19:58:26Z

@andrwng it looks like this test is still marked @ok_to_fail -- can you open a PR to reenable it if it was fixed in #4858?

(I see no failures in last 30 days so I can believe it's fixed, although not obvious to me how fixing abandoned futures errors addressed a high watermark error)

Ooh, yikes. Thanks for calling this out. I misread the linked issue. Will look into this shortly.

abhijat · 2022-07-27T09:57:17Z

looking at logs, it seems the segment was missing during restore process

WARN  2022-05-20 12:50:06,531 [shard 1] s3 - client.cc:524 - S3 replied with error: HTTP/1.1 404 Not Found
Accept-Ranges: bytes
Content-Length: 444
Content-Security-Policy: block-all-mixed-content
Content-Type: application/xml
Server: MinIO
Vary: Origin
X-Amz-Bucket-Region: panda-region
X-Amz-Request-Id: 16F0D121D7E96E94
X-Xss-Protection: 1; mode=block
Date: Fri, 20 May 2022 12:50:06 GMT


INFO  2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1~3|1|300000ms] - remote.cc:73 - NoSuchKey response received {"55b3ccae/kafka/panda-topic/0_18/0-1-v1.log.1"}
WARN  2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1~3|1|300000ms] - remote.cc:467 - Downloading segment from {panda-bucket-460b9fc6-d83b-11ec-b71b-0242ac120033}, {key_not_found}, segment at {"55b3ccae/kafka/panda-topic/0_18/0-1-v1.log.1"} not available
ERROR 2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1|0|300000ms [kafka/panda-topic/0, rev: 24]] - partition_recovery_manager.cc:643 - Failed segment download for {"55b3ccae/kafka/panda-topic/0_18/0-1-v1.log.1"}
INFO  2022-05-20 12:50:06,531 [shard 1] cloud_storage - [fiber0~1|0|300000ms [kafka/panda-topic/0, rev: 24]] - partition_recovery_manager.cc:658 - Renaming directory "/var/lib/redpanda/data/kafka/panda-topic/0_24_part" to "/var/lib/redpanda/data/kafka/panda-topic/0_24"

EDIT: this is in line with the actual test case, so expected behaviour.

abhijat · 2022-07-28T05:32:20Z

this should improve with changes in #5372 where we are forcing regular segment uploads to s3, thus reducing chances of having just one segment in s3 and that being deleted during test, resulting in this failure.

What happened in this test failure was:

just one segment uploaded to s3
it was deleted during test
topic restored had 0 data
assertion wanted topic to have data because the first segment is deleted but other segments should be present in s3 with higher offsets

abhijat · 2022-07-28T14:15:23Z

with #5372 merged, going to unmark this test as ok-to-fail and observe for a while.

jcsp · 2022-07-28T14:36:29Z

@abhijat ok_to_fail tests still report their status in test results, as OPASS or OFAIL, and you can see these in pandaresults. So it's not necessary to remove the ok_to_fail to check that.

abhijat · 2022-07-29T05:12:13Z

@abhijat ok_to_fail tests still report their status in test results, as OPASS or OFAIL, and you can see these in pandaresults. So it's not necessary to remove the ok_to_fail to check that.

Thanks for the tip

(venv) abhijat@terminus-est ~/dev/pandaresults/results
 $ grep test_missing_segment summary.csv | cut -d, -f2 | wc -l
887
(venv) abhijat@terminus-est ~/dev/pandaresults/results
 $ grep test_missing_segment summary.csv | cut -d, -f2 | sort -u
opass

these are last 30 days worth of tests. I think I can go ahead and unmark this and close this PR after waiting a couple of days, to see the recent changes have not negatively influenced test outcome.

jcsp · 2022-07-29T10:09:48Z

Let's go ahead and close this, as the test is reinstated at https://github.com/redpanda-data/redpanda/pull/5717/files -- if failures happen we can always reopen.

dimitriscruz added kind/bug Something isn't working ci-failure labels May 20, 2022

piyushredpanda assigned andrwng May 22, 2022

dotnwat mentioned this issue May 22, 2022

chore: fix warning backgrounding futures #4858

Merged

andrwng closed this as completed Jun 10, 2022

jcsp reopened this Jul 5, 2022

mmedenjak added the area/tests label Jul 5, 2022

piyushredpanda assigned abhijat and unassigned andrwng Jul 19, 2022

mmedenjak added the area/cloud-storage Shadow indexing subsystem label Jul 21, 2022

jcsp closed this as completed Jul 29, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `TopicRecoveryTest.test_missing_segment` #4849

Failure in `TopicRecoveryTest.test_missing_segment` #4849

dimitriscruz commented May 20, 2022

andrwng commented Jun 10, 2022

jcsp commented Jul 5, 2022 •

edited

Loading

andrwng commented Jul 5, 2022

abhijat commented Jul 27, 2022 •

edited

Loading

abhijat commented Jul 28, 2022

abhijat commented Jul 28, 2022

jcsp commented Jul 28, 2022

abhijat commented Jul 29, 2022

jcsp commented Jul 29, 2022

Failure in TopicRecoveryTest.test_missing_segment #4849

Failure in TopicRecoveryTest.test_missing_segment #4849

Comments

dimitriscruz commented May 20, 2022

andrwng commented Jun 10, 2022

jcsp commented Jul 5, 2022 • edited Loading

andrwng commented Jul 5, 2022

abhijat commented Jul 27, 2022 • edited Loading

abhijat commented Jul 28, 2022

abhijat commented Jul 28, 2022

jcsp commented Jul 28, 2022

abhijat commented Jul 29, 2022

jcsp commented Jul 29, 2022

Failure in `TopicRecoveryTest.test_missing_segment` #4849

Failure in `TopicRecoveryTest.test_missing_segment` #4849

jcsp commented Jul 5, 2022 •

edited

Loading

abhijat commented Jul 27, 2022 •

edited

Loading