-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer failed to consume up to offsets in SIPartitionMovementTest.test_shadow_indexing
#4702
Comments
The logs also have this record:
It comes from
If we look at the
The record which should have been written with 1879 offset failed with an indecisive error (timeout). The next successfully written offset is 1881. So it seems that the test is wrong and treats timeouts as the definite error and complains when it sees 1879 instead of 1881. |
Another instance
|
#5238 introduces a new workload & online verifier which isn't subject to such errors when it's it we switch to it and it will fix this issue |
When I looked into https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5 (technically #6076 but same failure mode as this issue) it looks a lot like the server really is leaving a gap in the kafka offsets. I don't think the consumer is paying any attention to what was/wasn't acked to the producer: it's just consuming the partition, and seeing the kafka offset jump forward around the same point in the log where we did a partition move. It's not clear to me that this is valid behavior in redpanda. |
@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets" |
What I saw in the consumer stdout was that its fetch responses were skipping offsets (this is in run https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5):
This isn't code where it's checking producer's claims, it's just consuming the topic naively, and seeing a jump in the offsets returned by the server. Looking on the server side, this jump corresponds with a hop between nodes. The last contiguous read response was from docker-rp-5:
Then on docker-rp-12 a few seconds later we see this:
It looks like during the movement the |
Interesting so we have several issues hidden behind the "Consumer failed to consume up to offsets" umbrella |
The root cause in this issue is exactly the same as in #6076 |
https://buildkite.com/redpanda/redpanda/builds/9985#74311a35-dfef-42fb-a669-5e752e0248a0
The text was updated successfully, but these errors were encountered: