Timeout failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` #5390

NyaliaLui · 2022-07-07T18:48:08Z

https://buildkite.com/redpanda/redpanda/builds/12243#0181d907-c5e0-4ccd-ad4e-87f624f9ae5d/1595-8717

test_id:    rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
status:     FAIL
run time:   3 minutes 41.829 seconds
 
    TimeoutError('Segments were not removed')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 132, in test_write_with_node_failures
    wait_for_segments_removal(redpanda=self.redpanda,
  File "/root/tests/rptest/util.py", line 156, in wait_for_segments_removal
    wait_until(done,
  File "/root/tests/rptest/util.py", line 72, in wait_until
    raise TimeoutError(
ducktape.errors.TimeoutError: Segments were not removed

The text was updated successfully, but these errors were encountered:

abhijat · 2022-07-08T06:30:09Z

this is similar to #4639

jcsp · 2022-08-01T11:25:31Z

The last example of this failure mode was 3 weeks ago:
failure at 2022-07-07T15:47:23.138Z: TimeoutError('Segments were not removed')
in job https://buildkite.com/redpanda/redpanda/builds/12243#0181d907-c5e0-4ccd-ad4e-87f624f9ae5d

More recent failures of this test have all been #4639

LenaAn · 2022-08-02T09:55:56Z

Another one:
https://buildkite.com/redpanda/redpanda/builds/13412#018259c3-7471-4a71-955e-176f7a92cd52

jcsp · 2022-08-02T10:04:28Z

That last one is a "failed to consume up to", so #4639

VladLazar · 2022-08-02T13:50:53Z

I've figured this specific failure mode out (Segments were not removed).

The test waits for the number of segments to become lesser or equal to 6 while randomly killing nodes.
I think that the idea here is to ensure that some segments are removed in order to test that reads hit the
shadow indexing read path.

with random_process_kills(self.redpanda) as ctx:
    wait_for_segments_removal(redpanda=self.redpanda,
                              topic=self.topic,
                              partition_idx=0,
                              count=6)

For each nod being killed, the current segment is rolled and a new one is created.
In this particular instance of the test 6 node were randomly killed, which is a greater number of failures than this test inserts on average. The frequent restarts led to a large enough number of segments retained (while still obeying the retention policy) such that the test failed.

There's a number of things we can do in order to address this issue with the test:

Do nothing and ignore these failures. This failure mode is very rare. I've not been able to reproduce it in hundreds of runs.
Increase the minimum time between the failures being injected. It's currently set to 10 seconds.
Assert against the number of removed segments instead of asserting against the total number of current segments.

I could use a second opinion here. @LenaAn, @jcsp what do you think?

piyushredpanda · 2022-08-02T13:54:02Z

@Lazin and @abhijat as FYIs above. Amazing sleuthing, @VladLazar!

abhijat · 2022-08-02T14:06:24Z

Great find!

Increase the minimum time between the failures being injected. It's currently set to 10 seconds.

FWIW I tried to increase min time between failures in a test branch to 20 seconds IIRC and the failure became much rarer but was still present in 1 of 500 runs in my local setup. Your third option seems more viable to me specifically for this test.

jcsp · 2022-08-02T14:07:16Z

Option 3 please -- as you say, the test is wrong to assert on the segment count in the presence of failures.

self.redpanda.storage gets you a snapshot of segment on disk, you can use the list of segment names before/after to check that the earlier segments are removed, without depending on the total segment count.

NyaliaLui added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Jul 7, 2022

abhijat self-assigned this Jul 8, 2022

piyushredpanda assigned VladLazar and unassigned abhijat Jul 28, 2022

LenaAn mentioned this issue Aug 2, 2022

cloud_storage: use stable iterator for absl::btree #5704

Merged

VladLazar mentioned this issue Aug 3, 2022

test: use snapshots for detecting segment removal #5812

Merged

5 tasks

mmedenjak added the area/tests label Aug 29, 2022

VladLazar closed this as completed in #5812 Aug 31, 2022

vbotbuildovich mentioned this issue Aug 31, 2022

[v22.1.x] Timeout failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures #6286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` #5390

Timeout failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` #5390

NyaliaLui commented Jul 7, 2022 •

edited by jcsp

Loading

abhijat commented Jul 8, 2022

jcsp commented Aug 1, 2022

LenaAn commented Aug 2, 2022

jcsp commented Aug 2, 2022

VladLazar commented Aug 2, 2022

piyushredpanda commented Aug 2, 2022

abhijat commented Aug 2, 2022

jcsp commented Aug 2, 2022

Timeout failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures #5390

Timeout failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures #5390

Comments

NyaliaLui commented Jul 7, 2022 • edited by jcsp Loading

abhijat commented Jul 8, 2022

jcsp commented Aug 1, 2022

LenaAn commented Aug 2, 2022

jcsp commented Aug 2, 2022

VladLazar commented Aug 2, 2022

piyushredpanda commented Aug 2, 2022

abhijat commented Aug 2, 2022

jcsp commented Aug 2, 2022

Timeout failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` #5390

Timeout failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` #5390

NyaliaLui commented Jul 7, 2022 •

edited by jcsp

Loading