test: Failure in rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation #4602

abhijat · 2022-05-06T11:00:42Z

https://buildkite.com/redpanda/redpanda/builds/9821#58ccc865-daf9-4429-bbe0-563e8835c70a/1527-7591
seen during build of #4404

rpk produce timed out after a minute, seems similar to #4360 which is closed.

test_id:    rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation

RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-015916b4485b41996-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/bin/rpk topic --brokers docker-rp-1:9092,docker-rp-19:9092,docker-rp-14:9092 produce --key tkey -z none --delivery-timeout 60s -f %v topic-gdffgawotc timed out')
Traceback (most recent call last):
  File "/root/tests/rptest/clients/rpk.py", line 479, in _execute
    output, error = p.communicate(input=stdin, timeout=timeout)
  File "/usr/lib/python3.9/subprocess.py", line 1134, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.9/subprocess.py", line 1980, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.9/subprocess.py", line 1178, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '['/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-015916b4485b41996-1/redpanda/redpanda/vbuild/debug/clang/dist/local/redpanda/bin/rpk', 'topic', '--brokers', 'docker-rp-1:9092,docker-rp-19:9092,docker-rp-14:9092', 'produce', '--key', 'tkey', '-z', 'none', '--delivery-timeout', '60s', '-f', '%v', 'topic-gdffgawotc']' timed out after 60.5 seconds

The text was updated successfully, but these errors were encountered:

piyushredpanda · 2022-05-18T23:37:55Z

I wonder if these timeout ones are due to resource contention with dockerized runs and would just go away if these were run on CDT? WDYT, @VadimPlh ?

VadimPlh · 2022-07-25T10:21:29Z

In log I see only 2 processing name:produce, key:0, version:7 for rpk

But as I understand before it I should see producer_id request without error.
On first node I see sending 22:init producer_id response {throttle_time_ms=0 error_code={ error_code: none [0] } producer_id=1 producer_epoch=0} so it is for first produce request for rpk
on second node I see init producer_id response {throttle_time_ms=0 error_code={ error_code: none [0] } producer_id=2 producer_epoch=0} for second produce request

After it I see a lot of init producer_id response {throttle_time_ms=0 error_code={ error_code: not_coordinator [16] } producer_id=-1 producer_epoch=0} during ~1 min on first node
And finally I see
init producer_id response {throttle_time_ms=0 error_code={ error_code: none [0] } producer_id=4 producer_epoch=0} but As I understand it was too late

jcsp · 2022-07-26T11:49:08Z

The eventual success of the producer id allocation is happening after the test fails, and part of teardown is de-isolating the node.

docker-rp-1 is the isolated node, and it is sending these not_coordinator responses. The client treats it as a retryable error, and keeps trying to ask the same broker to initialize the producer id.

docker-rp-1 was isolated about the same time as the id_allocator partition got its leadership, so it doesn't have the leadership info in it's cache: it's hitting the "failed to allocate pid" path in init_producer_id.cc.

docker-rp-1 initially knows the leadership of the id_allocator partition (06:04:58,558), but it sets it to null when it reaches its election timeout after isolation (06:05:05,765). This all happens before the producer sends its first request for id (06:05:06,831).

So why isn't this happening all the time? I think this commit is the reason it passes today:

commit c28792777836078a7de4a7903c224838aac6f670 (rystsov/fast-raft-availability)
Author: Denis Rystsov <denis@vectorized.io>
Date:   Fri May 6 00:34:06 2022 -0700

    ducky: make raft_availability run faster
    
    Replaces ad hoc ping-pong check with ping-pong client. The client
    maintains a connection so when we invoke ping-pong 128 times it
    initializes a producer (init_producer_id) only once.
    
    The change reduces test_follower_isolation execution time from 6
    minutes to 30 seconds.

That changed the test to initialize the client once, very soon after isolating the node, and then use the same client throughout. So it isn't trying to allocate IDs again. It will only have an issue if the test runs slowly enough to delay the producer setup until after the election timeout.

For the issue to happen, the client's startup needs to be delayed far enough for the isolated node to have set leader to null for id_allocator, but not so far that the client sees cluster metadata that excludes the isolated node (in the latter case it would just not send the init_producer_id to the isolated node to begin with).

In the failure, franz-go is in use -- the test more recently uses rdkafka (via python bindings). I think they may have different retry behaviors, as I can get this issue to reproduce with rpk (by hacking the test to wait 1.5s after failure injection, and using an RPK producer), but not with the rdkafka producer (where I just insert the sleep).

jcsp · 2022-07-26T11:54:26Z

I don't think this test is unstable in practice (hasn't failed in 30 days and I couldn't get it to fail with strategically placed sleeps), so closing this ticket and moving discussion to #5648

abhijat added kind/bug Something isn't working ci-failure labels May 6, 2022

abhijat mentioned this issue May 6, 2022

tests: Add random action injector #4404

Merged

piyushredpanda assigned VadimPlh May 16, 2022

mmedenjak added area/raft area/tests labels Jul 5, 2022

jcsp mentioned this issue Jul 26, 2022

Isolated Redpanda node hangs client trying to allocate producer ID #5648

Closed

jcsp closed this as completed Jul 26, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Failure in rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation #4602

test: Failure in rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation #4602

abhijat commented May 6, 2022

piyushredpanda commented May 18, 2022

VadimPlh commented Jul 25, 2022

jcsp commented Jul 26, 2022

jcsp commented Jul 26, 2022 •

edited

Loading

test: Failure in rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation #4602

test: Failure in rptest.tests.raft_availability_test.RaftAvailabilityTest.test_follower_isolation #4602

Comments

abhijat commented May 6, 2022

piyushredpanda commented May 18, 2022

VadimPlh commented Jul 25, 2022

jcsp commented Jul 26, 2022

jcsp commented Jul 26, 2022 • edited Loading

jcsp commented Jul 26, 2022 •

edited

Loading