Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Failure in rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation #4602

Closed
abhijat opened this issue May 6, 2022 · 4 comments
Assignees

Comments

@abhijat
Copy link
Contributor

abhijat commented May 6, 2022

https://buildkite.com/redpanda/redpanda/builds/9821#58ccc865-daf9-4429-bbe0-563e8835c70a/1527-7591
seen during build of #4404

rpk produce timed out after a minute, seems similar to #4360 which is closed.

test_id:    rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation

RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-015916b4485b41996-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/bin/rpk topic --brokers docker-rp-1:9092,docker-rp-19:9092,docker-rp-14:9092 produce --key tkey -z none --delivery-timeout 60s -f %v topic-gdffgawotc timed out')
Traceback (most recent call last):
  File "/root/tests/rptest/clients/rpk.py", line 479, in _execute
    output, error = p.communicate(input=stdin, timeout=timeout)
  File "/usr/lib/python3.9/subprocess.py", line 1134, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.9/subprocess.py", line 1980, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.9/subprocess.py", line 1178, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '['/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-015916b4485b41996-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/bin/rpk', 'topic', '--brokers', 'docker-rp-1:9092,docker-rp-19:9092,docker-rp-14:9092', 'produce', '--key', 'tkey', '-z', 'none', '--delivery-timeout', '60s', '-f', '%v', 'topic-gdffgawotc']' timed out after 60.5 seconds
 
@piyushredpanda
Copy link
Contributor

I wonder if these timeout ones are due to resource contention with dockerized runs and would just go away if these were run on CDT? WDYT, @VadimPlh ?

@VadimPlh
Copy link
Contributor

In log I see only 2 processing name:produce, key:0, version:7 for rpk

But as I understand before it I should see producer_id request without error.
On first node I see sending 22:init producer_id response {throttle_time_ms=0 error_code={ error_code: none [0] } producer_id=1 producer_epoch=0} so it is for first produce request for rpk
on second node I see init producer_id response {throttle_time_ms=0 error_code={ error_code: none [0] } producer_id=2 producer_epoch=0} for second produce request

After it I see a lot of init producer_id response {throttle_time_ms=0 error_code={ error_code: not_coordinator [16] } producer_id=-1 producer_epoch=0} during ~1 min on first node
And finally I see
init producer_id response {throttle_time_ms=0 error_code={ error_code: none [0] } producer_id=4 producer_epoch=0} but As I understand it was too late

@jcsp
Copy link
Contributor

jcsp commented Jul 26, 2022

The eventual success of the producer id allocation is happening after the test fails, and part of teardown is de-isolating the node.

docker-rp-1 is the isolated node, and it is sending these not_coordinator responses. The client treats it as a retryable error, and keeps trying to ask the same broker to initialize the producer id.

docker-rp-1 was isolated about the same time as the id_allocator partition got its leadership, so it doesn't have the leadership info in it's cache: it's hitting the "failed to allocate pid" path in init_producer_id.cc.

docker-rp-1 initially knows the leadership of the id_allocator partition (06:04:58,558), but it sets it to null when it reaches its election timeout after isolation (06:05:05,765). This all happens before the producer sends its first request for id (06:05:06,831).

So why isn't this happening all the time? I think this commit is the reason it passes today:

commit c28792777836078a7de4a7903c224838aac6f670 (rystsov/fast-raft-availability)
Author: Denis Rystsov <denis@vectorized.io>
Date:   Fri May 6 00:34:06 2022 -0700

    ducky: make raft_availability run faster
    
    Replaces ad hoc ping-pong check with ping-pong client. The client
    maintains a connection so when we invoke ping-pong 128 times it
    initializes a producer (init_producer_id) only once.
    
    The change reduces test_follower_isolation execution time from 6
    minutes to 30 seconds.

That changed the test to initialize the client once, very soon after isolating the node, and then use the same client throughout. So it isn't trying to allocate IDs again. It will only have an issue if the test runs slowly enough to delay the producer setup until after the election timeout.

For the issue to happen, the client's startup needs to be delayed far enough for the isolated node to have set leader to null for id_allocator, but not so far that the client sees cluster metadata that excludes the isolated node (in the latter case it would just not send the init_producer_id to the isolated node to begin with).

In the failure, franz-go is in use -- the test more recently uses rdkafka (via python bindings). I think they may have different retry behaviors, as I can get this issue to reproduce with rpk (by hacking the test to wait 1.5s after failure injection, and using an RPK producer), but not with the rdkafka producer (where I just insert the sleep).

@jcsp
Copy link
Contributor

jcsp commented Jul 26, 2022

I don't think this test is unstable in practice (hasn't failed in 30 days and I couldn't get it to fail with strategically placed sleeps), so closing this ticket and moving discussion to #5648

@jcsp jcsp closed this as completed Jul 26, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants