Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DescribeGroups failed in ConsumerGroupTest.test_dead_group_recovery.static_members=True #5079

Closed
ztlpn opened this issue Jun 9, 2022 · 20 comments · Fixed by #5610 or #6499
Closed

Comments

@ztlpn
Copy link
Contributor

ztlpn commented Jun 9, 2022

https://buildkite.com/redpanda/redpanda/builds/11093#018144a0-4373-4133-b2d2-09624ddc39bd

Module: rptest.tests.consumer_group_test
Class:  ConsumerGroupTest
Method: test_dead_group_recovery
Arguments:
{
  "static_members": true
}
RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-03b8eb9dbfa8ff623-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-8:9092,docker-rp-7:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list\n', '')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_group_test.py", line 360, in test_dead_group_recovery
    wait_until(group_is_dead, 30, 2)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/tests/consumer_group_test.py", line 351, in group_is_dead
    rpk_group = rpk.group_describe(group)
  File "/root/tests/rptest/clients/rpk.py", line 383, in group_describe
    rpk_group = try_describe_group(group)
  File "/root/tests/rptest/clients/rpk.py", line 352, in try_describe_group
    out = self._run_group(cmd)
  File "/root/tests/rptest/clients/rpk.py", line 425, in _run_group
    return self._execute(cmd, stdin=stdin, timeout=timeout)
  File "/root/tests/rptest/clients/rpk.py", line 537, in _execute
    raise RpkException(
rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-03b8eb9dbfa8ff623-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-8:9092,docker-rp-7:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list
>
@bharathv
Copy link
Contributor

bharathv commented Jun 9, 2022

@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 10, 2022

@LenaAn
Copy link
Contributor

LenaAn commented Jun 20, 2022

Also with static_members=False

From PandaResults:

Unstable test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (1/66 runs failed)
  Latest failure at 2022-06-19T07:32:38.146Z: RpkException('command /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-0135eef11f8192030-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-21:9092,docker-rp-16:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list\n', '')
             in job https://buildkite.com/redpanda/redpanda/builds/11472#01817a96-21a4-4c30-be02-e67c6070c656

@dlex
Copy link
Contributor

dlex commented Jul 7, 2022

#5275 is a duplicate of this one

@jcsp
Copy link
Contributor

jcsp commented Jul 8, 2022

@LenaAn
Copy link
Contributor

LenaAn commented Jul 13, 2022

@dlex
Copy link
Contributor

dlex commented Jul 16, 2022

The test is run on 3 nodes. The sequence is: create topic - produce and consume in a group - delete topic - restart all nodes - * - create topic. At the moment * rpk group describe fails with the error.

After the restart of the nodes, there are two outcomes possible.

  1. Successful:
note over node1, node2, node3 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 2
...
test -> rpk : describe the-group --brokers 1,2,3
rpk -> node1 : FindCoordinator(the-group)
rpk <-- node1 : coordinator is node 2
rpk -> node1 : MetadataRequest
rpk <-- node1 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 2
rpk -> node3 : MetadataRequest
rpk <-- node3 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 2
test <-- rpk : coordinator: 2, state: dead

svg

  1. Failure:
note over node1 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 1
note over node2 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 2 or 3
note over node3 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 3
...
node3 -> node2 : {kafka/__consumer_offsets/0} Vote request 
node3 -> node1 : {kafka/__consumer_offsets/0} Vote request
node3 <-- node1 : vote_granted: 0 or 1
node3 <-- node2 : vote_granted: 1
note over node3 : becoming the leader
test -> rpk : describe the-group --brokers 1,2,3
rpk -> node2 : MetadataRequest
rpk <-- node2 : nodes: {1,2}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
rpk -> node1 : FindCoordinator(the-group)
rpk <-- node1 : coordinator is node 3
rpk -> node2 : MetadataRequest
rpk <-- node2 : nodes: {1,2}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
rpk -> node1 : MetadataRequest
rpk <-- node1 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
test <-- rpk : "Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list"

svg

In each case, role of each of the nodes is the same.

In the failure case, node2 does not list node3 because its node_state has is_alive: false.

@dotnwat
Copy link
Member

dotnwat commented Jul 17, 2022

Any thoughts on what the solution is? If it is an issue with test, redpanda, or kafka protocol in general?

@VladLazar
Copy link
Contributor

VladLazar commented Jul 18, 2022

I can't see the root cause, but I've experienced issues with the correctness of is_alive in the past (internal Slack link). Might be related somehow. I'll try and run my repro of that again.

@piyushredpanda
Copy link
Contributor

@dlex is OOO for a week, fyi.

@ZeDRoman
Copy link
Contributor

jcsp added a commit to jcsp/redpanda that referenced this issue Jul 25, 2022
If coordinator isn't available, we should retry, not throw.

Fixes redpanda-data#5079
@jcsp
Copy link
Contributor

jcsp commented Jul 25, 2022

The rpk group_describe function is meant to be a retry loop, but wasn't correctly handling the case of the coordinator being unavailable -- I hit the same thing with scale testing, and had an un-merged fix handy. Opened #5610 with it.

I haven't dissected the logs from this failure but I wouldn't be surprised if this became more likely when we recently tweaked the leader balancer to move more partitions at once.

@VadimPlh
Copy link
Contributor

The rpk group_describe function is meant to be a retry loop, but wasn't correctly handling the case of the coordinator being unavailable -- I hit the same thing with scale testing, and had an un-merged fix handy. Opened #5610 with it.

I haven't dissected the logs from this failure but I wouldn't be surprised if this became more likely when we recently tweaked the leader balancer to move more partitions at once.

I think it is the same issue
link

RpkException('command /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-07e7288186208f090-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-27:9092,docker-rp-17:9092,docker-rp-28:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: COORDINATOR_NOT_AVAILABLE: The coordinator is not available.\n', '')

@dlex
Copy link
Contributor

dlex commented Jul 25, 2022

@jcsp, ok let's see if that helps. Meanwhile I'm trying to run the older code with retry in the test itself to verify whether the coordinator reappears in the list in case of the failure.

@dlex
Copy link
Contributor

dlex commented Jul 25, 2022

In my local tests, a single retry was enough to recover in all 3 instances where I have reproduced the error.

@twmb
Copy link
Contributor

twmb commented Aug 5, 2022

@VladLazar
Copy link
Contributor

@jcsp
Copy link
Contributor

jcsp commented Aug 31, 2022

I think I ended up the assignee here because of my previous change to the describe groups function, but am not actively working on it.

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Sep 22, 2022
When describing a group in a `test_dead_group` recovery test we must
allow for RPK to return an error and then retry.

Fixes: redpanda-data#5079

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmedenjak mmedenjak assigned mmaslankaprv and unassigned dlex and piyushredpanda Sep 22, 2022
@mmedenjak mmedenjak added the kind/bug Something isn't working label Sep 22, 2022
ballard26 pushed a commit to ballard26/redpanda that referenced this issue Sep 27, 2022
When describing a group in a `test_dead_group` recovery test we must
allow for RPK to return an error and then retry.

Fixes: redpanda-data#5079

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@BenPope
Copy link
Member

BenPope commented Nov 17, 2022

ztlpn pushed a commit to ztlpn/redpanda that referenced this issue Nov 24, 2022
When describing a group in a `test_dead_group` recovery test we must
allow for RPK to return an error and then retry.

Fixes: redpanda-data#5079

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 47711c8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet