Failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` (Failed to consume up to offsets) #4639

rystsov · 2022-05-10T05:58:46Z

https://buildkite.com/redpanda/redpanda/builds/9910#0ad95b8f-ad64-45eb-bd6d-081ecbbd9f81

test_id:    rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
status:     FAIL
run time:   2 minutes 47.183 seconds


    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 36217} after waiting 30s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 137, in test_write_with_node_failures
    self.run_validation()
  File "/root/tests/rptest/tests/end_to_end.py", line 188, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 154, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 36217} after waiting 30s.

The text was updated successfully, but these errors were encountered:

dimitriscruz · 2022-05-17T21:06:21Z

New CI instance:

https://buildkite.com/redpanda/redpanda/builds/10216#cfcbdbd3-a1bf-4f59-a54a-86a1fc79a653

FAIL test: EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (1/69 runs)
  failure at 2022-05-17T09:02:49.159Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 38051} after waiting 30s.")
      in job https://buildkite.com/redpanda/redpanda/builds/10216#cfcbdbd3-a1bf-4f59-a54a-86a1fc79a653

NyaliaLui · 2022-06-03T21:36:02Z

Seen again https://buildkite.com/redpanda/redpanda/builds/10901#0181274c-bc09-4b3f-8023-c39a635b4de1/1531-7975

ztlpn · 2022-06-07T12:29:47Z

https://buildkite.com/redpanda/redpanda/builds/10993#01813b29-e435-4607-8205-360f1351c227

ztlpn · 2022-06-09T09:01:43Z

https://buildkite.com/redpanda/redpanda/builds/11112#01814720-a6d9-41c8-933c-670ebfac956f

ztlpn · 2022-06-10T13:56:22Z

https://buildkite.com/redpanda/redpanda/builds/11163#01814c58-5f17-453f-b913-7d93c9e58e6b

BenPope · 2022-06-14T15:34:35Z

https://buildkite.com/redpanda/redpanda/builds/11225#01815692-c844-452e-bb33-b248b3c1be11

ZeDRoman · 2022-07-19T10:45:05Z

+1 https://buildkite.com/redpanda/redpanda/builds/12712#01821516-070e-49a6-bbcd-e59d4c677045

VladLazar · 2022-07-28T18:13:24Z

Most recent failure: https://buildkite.com/redpanda/redpanda/builds/13191#01824099-d72c-4811-b32a-13d178f6e02e.

VladLazar · 2022-07-29T16:05:28Z

This looks like a test failure to me, but I'm not sure what the root cause is yet.

All records produced by the verifiable producer are consumed by the verifiable consumer, yet the test fails while waiting for all record to be consumed.

Producer tail logs:

{"timestamp":1658946331245,"name":"producer_send_success","key":null,"value":"1.25155","offset":34101,"topic":"panda-topic","partition":0}
{"timestamp":1658946331245,"name":"producer_send_success","key":null,"value":"1.25156","offset":34102,"topic":"panda-topic","partition":0}
{"timestamp":1658946331250,"name":"shutdown_complete"}

Consumer tail logs:

{"timestamp":1658946374260,"name":"record_data","key":null,"value":"1.25155","topic":"panda-topic","partition":0,"offset":34101}
{"timestamp":1658946374260,"name":"record_data","key":null,"value":"1.25156","topic":"panda-topic","partition":0,"offset":34102}
{"timestamp":1658946374260,"name":"records_consumed","count":202,"partitions":[{"topic":"panda-topic","partition":0,"count":202,"minOffset":33901,"maxOffset":34102}]}
{"timestamp":1658946374290,"name":"offsets_committed","offsets":[{"topic":"panda-topic","partition":0,"offset":34103}],"success":true}

Failure:

ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 34102} after waiting 30s, last consumed offsets: [].

The fact that the "last consumed offsets" list is empty indicates the fact that none of the "record_data" events where processed by the python verifiable consumer wrapper.

The only interesting observation I've made is that only two brokers are used when starting the verifiable consumer (the third one was chaos killed by the test).

abhijat · 2022-07-29T17:57:26Z

One thing to look out for maybe that @LenaAn mentioned that there is some issue with the way producers are setup in this test suite, verifiable producer and normal cli producer are mixed and verifiable consumer consumes.

BenPope · 2022-08-11T10:25:56Z

This one looks the same: https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd3-4c4e-a58c-0d79399d2070

Module: rptest.tests.controller_upgrade_test
Class:  ControllerUpgradeTest
Method: test_updating_cluster_when_executing_operations

BenPope · 2022-08-12T10:04:53Z

https://buildkite.com/redpanda/redpanda/builds/14047#0182906f-6f65-42df-a4e1-ab6d8842f230

Module: rptest.tests.e2e_shadow_indexing_test
Class:  EndToEndShadowIndexingTestWithDisruptions
Method: test_write_with_node_failures

ztlpn · 2022-08-16T13:12:34Z

https://buildkite.com/redpanda/redpanda/builds/14191#0182a4e2-c10c-4bfd-8ad6-71f777cdf1e5

rystsov · 2022-08-21T15:48:41Z

Another instance https://buildkite.com/redpanda/redpanda/builds/14457#0182bf75-cf3b-44aa-9f5c-19a79832e8de

rystsov · 2022-08-22T19:10:42Z

This one looks the same: https://buildkite.com/redpanda/redpanda/builds/14467#0182c40a-b61d-4f43-81e4-69e6f3bd33ef

Module: rptest.tests.e2e_shadow_indexing_test
Class:  EndToEndShadowIndexingTestWithDisruptions
Method: test_write_with_node_failures

test_id:    rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
status:     FAIL
run time:   1 minute 31.525 seconds

    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 34338} after waiting 30s, last committed offsets: {TopicPartition(topic='panda-topic', partition=0): 26093}.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 137, in test_write_with_node_failures
    self.run_validation()
  File "/root/tests/rptest/tests/end_to_end.py", line 261, in run_validation
    self.run_consumer_validation(
  File "/root/tests/rptest/tests/end_to_end.py", line 280, in run_consumer_validation
    self.await_consumed_offsets(last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 220, in await_consumed_offsets
    wait_until(
  File "/root/tests/rptest/util.py", line 72, in wait_until
    raise TimeoutError(
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 34338} after waiting 30s, last committed offsets: {TopicPartition(topic='panda-topic', partition=0): 26093}.

jcsp · 2022-09-16T18:43:20Z

This test is killing random nodes, but only allowing up to 30 seconds for the consumer to read all records -- that is probably not enough to allow consumer groups to re-negotiate after failures reliably.

Looking at the logs from this example (https://buildkite.com/redpanda/redpanda/builds/14457#0182bf75-cf3b-44aa-9f5c-19a79832e8de) we can see the consumer repeatedly trying to join the group, and eventually succeeding moments after the test timeout hits. There are offset_committed messages in the consumer stdout moments after this happens, but too late for the test to be satisfied by them.

BenPope · 2022-09-16T19:05:15Z

Yep.
https://github.com/redpanda-data/redpanda/blob/dev/src/v/config/configuration.cc#L545-L550

VladLazar · 2022-09-20T11:45:25Z

That makes sense to me. I guess it depends on how "lucky" we get in terms of picking brokers that are consumer group coordinators. Increasing the timeout for fetching data decreases the chance of this happening.

rystsov added kind/bug Something isn't working area/tests ci-failure labels May 10, 2022

abhijat self-assigned this May 10, 2022

abhijat mentioned this issue Jun 9, 2022

tests: update internal topic replication factor #5077

Closed

jcsp mentioned this issue Jul 5, 2022

raft: require full leadership before leadership transfer #5333

Merged

abhijat mentioned this issue Jul 8, 2022

Timeout failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures #5390

Closed

andrwng mentioned this issue Jul 15, 2022

tests: re-use installs in upgrade tests #5459

Merged

mmedenjak added the area/cloud-storage Shadow indexing subsystem label Jul 21, 2022

jcsp changed the title ~~Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures~~ Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (Failed to consume up to offsets) Jul 25, 2022

jcsp mentioned this issue Jul 25, 2022

kafka: fix assertion failure in join_empty_group_static_member #5607

Merged

piyushredpanda assigned VladLazar and unassigned abhijat Jul 28, 2022

ZeDRoman mentioned this issue Jul 29, 2022

Support rack awareness on partition reallocation #5614

Merged

VladLazar mentioned this issue Aug 10, 2022

test: use snapshots for detecting segment removal #5812

Merged

5 tasks

jcsp mentioned this issue Aug 19, 2022

tests: make FeaturesMultiNodeTest more robust #6091

Merged

5 tasks

rystsov mentioned this issue Aug 21, 2022

ducky: pin to the latest version #6003

Closed

5 tasks

rystsov mentioned this issue Aug 22, 2022

Reduce duration of partition_movement_test from 25min to 8min #5238

Draft

rystsov added the ci-disabled-test label Sep 16, 2022

VladLazar mentioned this issue Sep 20, 2022

tests: extend consumer timeout in test with kills #6478

Merged

6 tasks

VladLazar closed this as completed in #6478 Sep 20, 2022

jcsp mentioned this issue Oct 18, 2022

tests: remove stale ok_to_fail markers on cloud storage tests #6813

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` (Failed to consume up to offsets) #4639

Failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` (Failed to consume up to offsets) #4639

rystsov commented May 10, 2022

dimitriscruz commented May 17, 2022 •

edited

Loading

NyaliaLui commented Jun 3, 2022

ztlpn commented Jun 7, 2022

ztlpn commented Jun 9, 2022

ztlpn commented Jun 10, 2022

BenPope commented Jun 14, 2022

ZeDRoman commented Jul 19, 2022

VladLazar commented Jul 28, 2022

VladLazar commented Jul 29, 2022 •

edited

Loading

abhijat commented Jul 29, 2022

BenPope commented Aug 11, 2022

BenPope commented Aug 12, 2022

ztlpn commented Aug 16, 2022

rystsov commented Aug 21, 2022

rystsov commented Aug 22, 2022

jcsp commented Sep 16, 2022

BenPope commented Sep 16, 2022 •

edited

Loading

VladLazar commented Sep 20, 2022

Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (Failed to consume up to offsets) #4639

Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (Failed to consume up to offsets) #4639

Comments

rystsov commented May 10, 2022

dimitriscruz commented May 17, 2022 • edited Loading

NyaliaLui commented Jun 3, 2022

ztlpn commented Jun 7, 2022

ztlpn commented Jun 9, 2022

ztlpn commented Jun 10, 2022

BenPope commented Jun 14, 2022

ZeDRoman commented Jul 19, 2022

VladLazar commented Jul 28, 2022

VladLazar commented Jul 29, 2022 • edited Loading

abhijat commented Jul 29, 2022

BenPope commented Aug 11, 2022

BenPope commented Aug 12, 2022

ztlpn commented Aug 16, 2022

rystsov commented Aug 21, 2022

rystsov commented Aug 22, 2022

jcsp commented Sep 16, 2022

BenPope commented Sep 16, 2022 • edited Loading

VladLazar commented Sep 20, 2022

Failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` (Failed to consume up to offsets) #4639

Failure in `EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures` (Failed to consume up to offsets) #4639

dimitriscruz commented May 17, 2022 •

edited

Loading

VladLazar commented Jul 29, 2022 •

edited

Loading

BenPope commented Sep 16, 2022 •

edited

Loading