Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in rptest.tests.node_resize_test/NodeResizeTest/test_node_resize #4382

Closed
BenPope opened this issue Apr 22, 2022 · 2 comments
Closed

Comments

@BenPope
Copy link
Member

BenPope commented Apr 22, 2022

The failure is in: https://buildkite.com/redpanda/redpanda/builds/9271#e6faefd1-6dd7-414a-8941-f5fde026c3f6

recovery append entries error: raft group does not exist on target broker

[INFO  - 2022-04-21 18:13:29,233 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.node_resize_test.NodeResizeTest.test_node_resize: FAIL: <BadLogLines nodes=docker_n_6(1) example="ERROR 2022-04-21 18:13:17,383 [shard 1] raft - [follower: {id: {1}, revision: {20}}] [group_id:1, {kafka/test/0}] - recovery_stm.cc:436 - recovery append entries error: raft group does not exist on target broker">
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 47, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 853, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.redpanda.BadLogLines: <BadLogLines nodes=docker_n_6(1) example="ERROR 2022-04-21 18:13:17,383 [shard 1] raft - [follower: {id: {1}, revision: {20}}] [group_id:1, {kafka/test/0}] - recovery_stm.cc:436 - recovery append entries error: raft group does not exist on target broker">
TRACE 2022-04-21 18:13:17,376 [shard 1] raft - [group_id:1, {kafka/test/0}] consensus.cc:263 - Append entries response: {node_id: {id: {1}, revision: {20}}, target_node_id{id: {2}, revision: {20}}, group: {1}, term:{1}, last_dirty_log_index:{93}, last_flushed_log_index:{93}, last_term_base_offset:{-9223372036854775808}, result: failure}
TRACE 2022-04-21 18:13:17,376 [shard 1] raft - [group_id:1, {kafka/test/0}] consensus.cc:342 - Updated node {id: {1}, revision: {20}} last committed log index: 93
TRACE 2022-04-21 18:13:17,376 [shard 1] raft - [group_id:1, {kafka/test/0}] consensus.cc:374 - Starting recovery process for {id: {1}, revision: {20}} - current reply: {node_id: {id: {1}, revision: {20}}, target_node_id{id: {2}, revision: {20}}, group: {1}, term:{1}, last_dirty_log_index:{93}, last_flushed_log_index:{93}, last_term_base_offset:{-9223372036854775808}, result: failure}
TRACE 2022-04-21 18:13:17,377 [shard 1] raft - [follower: {id: {1}, revision: {20}}] [group_id:1, {kafka/test/0}] - recovery_stm.cc:221 - Reading batches, starting from: 94
TRACE 2022-04-21 18:13:17,377 [shard 1] storage - readers_cache.cc:88 - {kafka/test/0} - trying to get reader for: {start_offset:{94}, max_offset:{9223372036854775807}, min_bytes:1, max_bytes:524288, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-04-21 18:13:17,378 [shard 1] storage - readers_cache.cc:117 - {kafka/test/0} - reader cache miss for: {start_offset:{94}, max_offset:{9223372036854775807}, min_bytes:1, max_bytes:524288, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-04-21 18:13:17,378 [shard 1] storage - readers_cache.cc:75 - {kafka/test/0} - adding reader [0,95]
TRACE 2022-04-21 18:13:17,379 [shard 1] raft - [follower: {id: {1}, revision: {20}}] [group_id:1, {kafka/test/0}] - recovery_stm.cc:237 - Read batches in range [94,95] for recovery
TRACE 2022-04-21 18:13:17,379 [shard 1] storage - readers_cache.cc:309 - {kafka/test/0} - removing reader: [0,95] lower_bound: 96
WARN  2022-04-21 18:13:17,383 [shard 1] raft - [group_id:1, {kafka/test/0}] consensus.h:507 - received append_entries_recovery reply from the node where ntp {kafka/test/0} does not exists
ERROR 2022-04-21 18:13:17,383 [shard 1] raft - [follower: {id: {1}, revision: {20}}] [group_id:1, {kafka/test/0}] - recovery_stm.cc:436 - recovery append entries error: raft group does not exist on target broker
TRACE 2022-04-21 18:13:17,384 [shard 1] raft - [follower: {id: {1}, revision: {20}}] [group_id:1, {kafka/test/0}] - recovery_stm.cc:536 - Finished recovery
WARN  2022-04-21 18:13:17,384 [shard 1] raft - [group_id:1, {kafka/test/0}] consensus.cc:496 - Node {id: {1}, revision: {20}} recovery failed - boost::wrapexcept<std::__1::system_error> (error raft::errc:21, raft group does not exist on target broker)
@ajfabbri
Copy link
Contributor

ajfabbri commented Jul 13, 2022

Related to #5076; IMO, we should establish an invariant that you can only do RPCs to nodes in the current cluster quorum. This along with proper node id handling should eliminate the "sleep and hope it works" code around this log message raft group does not exist on target broker. We can instead hold a single "quorum changed" abort source which fires when the set of nodes in the cluster changes. Timeouts like this sleep() can be eliminated; instead we can use RPC hearbeats and socket errors to detect nodes leaving and joining.

@ajfabbri
Copy link
Contributor

This error is fairly normal today under partition movement, and the log message has been lowered in severity in PR #4561. Closing.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants