-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Failure in rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation #4602
Comments
I wonder if these timeout ones are due to resource contention with dockerized runs and would just go away if these were run on CDT? WDYT, @VadimPlh ? |
In log I see only 2 But as I understand before it I should see After it I see a lot of |
The eventual success of the producer id allocation is happening after the test fails, and part of teardown is de-isolating the node. docker-rp-1 is the isolated node, and it is sending these not_coordinator responses. The client treats it as a retryable error, and keeps trying to ask the same broker to initialize the producer id. docker-rp-1 was isolated about the same time as the id_allocator partition got its leadership, so it doesn't have the leadership info in it's cache: it's hitting the "failed to allocate pid" path in init_producer_id.cc. docker-rp-1 initially knows the leadership of the id_allocator partition (06:04:58,558), but it sets it to null when it reaches its election timeout after isolation (06:05:05,765). This all happens before the producer sends its first request for id (06:05:06,831). So why isn't this happening all the time? I think this commit is the reason it passes today:
That changed the test to initialize the client once, very soon after isolating the node, and then use the same client throughout. So it isn't trying to allocate IDs again. It will only have an issue if the test runs slowly enough to delay the producer setup until after the election timeout. For the issue to happen, the client's startup needs to be delayed far enough for the isolated node to have set leader to null for id_allocator, but not so far that the client sees cluster metadata that excludes the isolated node (in the latter case it would just not send the init_producer_id to the isolated node to begin with). In the failure, franz-go is in use -- the test more recently uses rdkafka (via python bindings). I think they may have different retry behaviors, as I can get this issue to reproduce with rpk (by hacking the test to wait 1.5s after failure injection, and using an RPK producer), but not with the rdkafka producer (where I just insert the sleep). |
I don't think this test is unstable in practice (hasn't failed in 30 days and I couldn't get it to fail with strategically placed sleeps), so closing this ticket and moving discussion to #5648 |
https://buildkite.com/redpanda/redpanda/builds/9821#58ccc865-daf9-4429-bbe0-563e8835c70a/1527-7591
seen during build of #4404
rpk produce timed out after a minute, seems similar to #4360 which is closed.
The text was updated successfully, but these errors were encountered: