-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
(Failed to consume up to offsets)
#4639
Comments
New CI instance: https://buildkite.com/redpanda/redpanda/builds/10216#cfcbdbd3-a1bf-4f59-a54a-86a1fc79a653
|
EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
(Failed to consume up to offsets)
This looks like a test failure to me, but I'm not sure what the root cause is yet. All records produced by the verifiable producer are consumed by the verifiable consumer, yet the test fails while waiting for all record to be consumed. Producer tail logs:
Consumer tail logs:
Failure:
The fact that the "last consumed offsets" list is empty indicates the fact that none of the "record_data" events where processed by the python verifiable consumer wrapper. The only interesting observation I've made is that only two brokers are used when starting the verifiable consumer (the third one was chaos killed by the test). |
One thing to look out for maybe that @LenaAn mentioned that there is some issue with the way producers are setup in this test suite, verifiable producer and normal cli producer are mixed and verifiable consumer consumes. |
This one looks the same: https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd3-4c4e-a58c-0d79399d2070
|
https://buildkite.com/redpanda/redpanda/builds/14047#0182906f-6f65-42df-a4e1-ab6d8842f230
|
This one looks the same: https://buildkite.com/redpanda/redpanda/builds/14467#0182c40a-b61d-4f43-81e4-69e6f3bd33ef
|
This test is killing random nodes, but only allowing up to 30 seconds for the consumer to read all records -- that is probably not enough to allow consumer groups to re-negotiate after failures reliably. Looking at the logs from this example (https://buildkite.com/redpanda/redpanda/builds/14457#0182bf75-cf3b-44aa-9f5c-19a79832e8de) we can see the consumer repeatedly trying to join the group, and eventually succeeding moments after the test timeout hits. There are offset_committed messages in the consumer stdout moments after this happens, but too late for the test to be satisfied by them. |
That makes sense to me. I guess it depends on how "lucky" we get in terms of picking brokers that are consumer group coordinators. Increasing the timeout for fetching data decreases the chance of this happening. |
https://buildkite.com/redpanda/redpanda/builds/9910#0ad95b8f-ad64-45eb-bd6d-081ecbbd9f81
The text was updated successfully, but these errors were encountered: