Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures #5390

Closed
NyaliaLui opened this issue Jul 7, 2022 · 8 comments · Fixed by #5812
Closed
Assignees
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working

Comments

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Jul 7, 2022

https://buildkite.com/redpanda/redpanda/builds/12243#0181d907-c5e0-4ccd-ad4e-87f624f9ae5d/1595-8717

test_id:    rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
status:     FAIL
run time:   3 minutes 41.829 seconds
 
    TimeoutError('Segments were not removed')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 132, in test_write_with_node_failures
    wait_for_segments_removal(redpanda=self.redpanda,
  File "/root/tests/rptest/util.py", line 156, in wait_for_segments_removal
    wait_until(done,
  File "/root/tests/rptest/util.py", line 72, in wait_until
    raise TimeoutError(
ducktape.errors.TimeoutError: Segments were not removed
@NyaliaLui NyaliaLui added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Jul 7, 2022
@abhijat abhijat self-assigned this Jul 8, 2022
@abhijat
Copy link
Contributor

abhijat commented Jul 8, 2022

this is similar to #4639

@piyushredpanda piyushredpanda assigned VladLazar and unassigned abhijat Jul 28, 2022
@jcsp
Copy link
Contributor

jcsp commented Aug 1, 2022

The last example of this failure mode was 3 weeks ago:
failure at 2022-07-07T15:47:23.138Z: TimeoutError('Segments were not removed')
in job https://buildkite.com/redpanda/redpanda/builds/12243#0181d907-c5e0-4ccd-ad4e-87f624f9ae5d

More recent failures of this test have all been #4639

@LenaAn
Copy link
Contributor

LenaAn commented Aug 2, 2022

@jcsp
Copy link
Contributor

jcsp commented Aug 2, 2022

That last one is a "failed to consume up to", so #4639

@VladLazar
Copy link
Contributor

I've figured this specific failure mode out (Segments were not removed).

The test waits for the number of segments to become lesser or equal to 6 while randomly killing nodes.
I think that the idea here is to ensure that some segments are removed in order to test that reads hit the
shadow indexing read path.

with random_process_kills(self.redpanda) as ctx:
    wait_for_segments_removal(redpanda=self.redpanda,
                              topic=self.topic,
                              partition_idx=0,
                              count=6)

For each nod being killed, the current segment is rolled and a new one is created.
In this particular instance of the test 6 node were randomly killed, which is a greater number of failures than this test inserts on average. The frequent restarts led to a large enough number of segments retained (while still obeying the retention policy) such that the test failed.

There's a number of things we can do in order to address this issue with the test:

  1. Do nothing and ignore these failures. This failure mode is very rare. I've not been able to reproduce it in hundreds of runs.
  2. Increase the minimum time between the failures being injected. It's currently set to 10 seconds.
  3. Assert against the number of removed segments instead of asserting against the total number of current segments.

I could use a second opinion here. @LenaAn, @jcsp what do you think?

@piyushredpanda
Copy link
Contributor

@Lazin and @abhijat as FYIs above. Amazing sleuthing, @VladLazar!

@abhijat
Copy link
Contributor

abhijat commented Aug 2, 2022

Great find!

  1. Increase the minimum time between the failures being injected. It's currently set to 10 seconds.

FWIW I tried to increase min time between failures in a test branch to 20 seconds IIRC and the failure became much rarer but was still present in 1 of 500 runs in my local setup. Your third option seems more viable to me specifically for this test.

@jcsp
Copy link
Contributor

jcsp commented Aug 2, 2022

Option 3 please -- as you say, the test is wrong to assert on the segment count in the presence of failures.

self.redpanda.storage gets you a snapshot of segment on disk, you can use the list of segment names before/after to check that the earlier segments are removed, without depending on the total segment count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants