Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (Failed to consume up to offsets) #4639

Closed
rystsov opened this issue May 10, 2022 · 18 comments · Fixed by #6478
Assignees
Labels

Comments

@rystsov
Copy link
Contributor

rystsov commented May 10, 2022

https://buildkite.com/redpanda/redpanda/builds/9910#0ad95b8f-ad64-45eb-bd6d-081ecbbd9f81

test_id:    rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
status:     FAIL
run time:   2 minutes 47.183 seconds


    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 36217} after waiting 30s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 137, in test_write_with_node_failures
    self.run_validation()
  File "/root/tests/rptest/tests/end_to_end.py", line 188, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 154, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 36217} after waiting 30s.
@rystsov rystsov added kind/bug Something isn't working area/tests ci-failure labels May 10, 2022
@abhijat abhijat self-assigned this May 10, 2022
@dimitriscruz
Copy link
Contributor

dimitriscruz commented May 17, 2022

New CI instance:

https://buildkite.com/redpanda/redpanda/builds/10216#cfcbdbd3-a1bf-4f59-a54a-86a1fc79a653

FAIL test: EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (1/69 runs)
  failure at 2022-05-17T09:02:49.159Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 38051} after waiting 30s.")
      in job https://buildkite.com/redpanda/redpanda/builds/10216#cfcbdbd3-a1bf-4f59-a54a-86a1fc79a653

@NyaliaLui
Copy link
Contributor

@ztlpn
Copy link
Contributor

ztlpn commented Jun 7, 2022

@ztlpn
Copy link
Contributor

ztlpn commented Jun 9, 2022

@ztlpn
Copy link
Contributor

ztlpn commented Jun 10, 2022

@BenPope
Copy link
Member

BenPope commented Jun 14, 2022

@ZeDRoman
Copy link
Contributor

@mmedenjak mmedenjak added the area/cloud-storage Shadow indexing subsystem label Jul 21, 2022
@jcsp jcsp changed the title Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (Failed to consume up to offsets) Jul 25, 2022
@piyushredpanda piyushredpanda assigned VladLazar and unassigned abhijat Jul 28, 2022
@VladLazar
Copy link
Contributor

@VladLazar
Copy link
Contributor

VladLazar commented Jul 29, 2022

This looks like a test failure to me, but I'm not sure what the root cause is yet.

All records produced by the verifiable producer are consumed by the verifiable consumer, yet the test fails while waiting for all record to be consumed.

Producer tail logs:

{"timestamp":1658946331245,"name":"producer_send_success","key":null,"value":"1.25155","offset":34101,"topic":"panda-topic","partition":0}
{"timestamp":1658946331245,"name":"producer_send_success","key":null,"value":"1.25156","offset":34102,"topic":"panda-topic","partition":0}
{"timestamp":1658946331250,"name":"shutdown_complete"}

Consumer tail logs:

{"timestamp":1658946374260,"name":"record_data","key":null,"value":"1.25155","topic":"panda-topic","partition":0,"offset":34101}
{"timestamp":1658946374260,"name":"record_data","key":null,"value":"1.25156","topic":"panda-topic","partition":0,"offset":34102}
{"timestamp":1658946374260,"name":"records_consumed","count":202,"partitions":[{"topic":"panda-topic","partition":0,"count":202,"minOffset":33901,"maxOffset":34102}]}
{"timestamp":1658946374290,"name":"offsets_committed","offsets":[{"topic":"panda-topic","partition":0,"offset":34103}],"success":true}

Failure:

ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 34102} after waiting 30s, last consumed offsets: [].

The fact that the "last consumed offsets" list is empty indicates the fact that none of the "record_data" events where processed by the python verifiable consumer wrapper.

The only interesting observation I've made is that only two brokers are used when starting the verifiable consumer (the third one was chaos killed by the test).

@abhijat
Copy link
Contributor

abhijat commented Jul 29, 2022

One thing to look out for maybe that @LenaAn mentioned that there is some issue with the way producers are setup in this test suite, verifiable producer and normal cli producer are mixed and verifiable consumer consumes.

@BenPope
Copy link
Member

BenPope commented Aug 11, 2022

This one looks the same: https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd3-4c4e-a58c-0d79399d2070

Module: rptest.tests.controller_upgrade_test
Class:  ControllerUpgradeTest
Method: test_updating_cluster_when_executing_operations

@BenPope
Copy link
Member

BenPope commented Aug 12, 2022

https://buildkite.com/redpanda/redpanda/builds/14047#0182906f-6f65-42df-a4e1-ab6d8842f230

Module: rptest.tests.e2e_shadow_indexing_test
Class:  EndToEndShadowIndexingTestWithDisruptions
Method: test_write_with_node_failures

@ztlpn
Copy link
Contributor

ztlpn commented Aug 16, 2022

@rystsov
Copy link
Contributor Author

rystsov commented Aug 21, 2022

@rystsov
Copy link
Contributor Author

rystsov commented Aug 22, 2022

This one looks the same: https://buildkite.com/redpanda/redpanda/builds/14467#0182c40a-b61d-4f43-81e4-69e6f3bd33ef

Module: rptest.tests.e2e_shadow_indexing_test
Class:  EndToEndShadowIndexingTestWithDisruptions
Method: test_write_with_node_failures
test_id:    rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
status:     FAIL
run time:   1 minute 31.525 seconds

    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 34338} after waiting 30s, last committed offsets: {TopicPartition(topic='panda-topic', partition=0): 26093}.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 137, in test_write_with_node_failures
    self.run_validation()
  File "/root/tests/rptest/tests/end_to_end.py", line 261, in run_validation
    self.run_consumer_validation(
  File "/root/tests/rptest/tests/end_to_end.py", line 280, in run_consumer_validation
    self.await_consumed_offsets(last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 220, in await_consumed_offsets
    wait_until(
  File "/root/tests/rptest/util.py", line 72, in wait_until
    raise TimeoutError(
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='panda-topic', partition=0): 34338} after waiting 30s, last committed offsets: {TopicPartition(topic='panda-topic', partition=0): 26093}.

@jcsp
Copy link
Contributor

jcsp commented Sep 16, 2022

This test is killing random nodes, but only allowing up to 30 seconds for the consumer to read all records -- that is probably not enough to allow consumer groups to re-negotiate after failures reliably.

Looking at the logs from this example (https://buildkite.com/redpanda/redpanda/builds/14457#0182bf75-cf3b-44aa-9f5c-19a79832e8de) we can see the consumer repeatedly trying to join the group, and eventually succeeding moments after the test timeout hits. There are offset_committed messages in the consumer stdout moments after this happens, but too late for the test to be satisfied by them.

@BenPope
Copy link
Member

BenPope commented Sep 16, 2022

@VladLazar
Copy link
Contributor

That makes sense to me. I guess it depends on how "lucky" we get in terms of picking brokers that are consumer group coordinators. Increasing the timeout for fetching data decreases the chance of this happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
10 participants