Failure in `RaftAvailabilityTest.test_follower_isolation` #3098

jcsp · 2021-11-29T18:35:08Z

Seen twice:
https://buildkite.com/vectorized/redpanda/builds/4730#4a3e6431-118d-453a-9075-109039c2f6b9
https://buildkite.com/vectorized/redpanda/builds/4811#57c8a844-486a-4c17-b4a4-13c95a3f89db


Traceback (most recent call last):

  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run

    data = self.run_test()

  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test

    return self.test_context.function(self.test)

  File "/root/tests/rptest/tests/raft_availability_test.py", line 505, in test_follower_isolation

    self._ping_pong()

  File "/root/tests/rptest/tests/raft_availability_test.py", line 181, in _ping_pong

    offset = rpk.produce(self.topic, "tkey", payload, timeout=5)

  File "/root/tests/rptest/clients/rpk.py", line 101, in produce

    out = self._run_topic(cmd, stdin=msg, timeout=timeout)

  File "/root/tests/rptest/clients/rpk.py", line 152, in _run_topic

    return self._execute(cmd, stdin=stdin, timeout=timeout)

  File "/root/tests/rptest/clients/rpk.py", line 205, in _execute

    raise RpkException(f"command {' '.join(cmd)} timed out")

rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-001dca3e9f55e6fdc-1/vectorized/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk topic --brokers docker_n_24:9092,docker_n_1:9092,docker_n_33:9092 produce --key tkey -z none --delivery-timeout 4.5s -f %v topic-fzpzdxqqbm timed out>

This is the test trying to verify availability of the cluster after isolation, and timing out while doing so.

The delivery timeout is 4.5s, which is tight enough that this could just be a slow election when the leader is failed, haven't gone over it forensically.

The text was updated successfully, but these errors were encountered:

jcsp · 2021-11-29T19:21:55Z

Two things are strange here:

The leader of the partition under test is not the node that is killed, but it is stepping down as leader a few seconds later anyway, before re-electing itself quickly after. I don't think this is why rpk is timing out though.
The controller leader /was/ on the node that was killed, but long after another node has become leader, there are still "timed out when refreshing cluster health state" on another node that is servicing kafka metadata requests.

docker_n_1/redpanda.log:INFO  2021-11-29 18:08:50,556 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] vote_stm.cc:252 - became the leader term:1
docker_n_1/redpanda.log:INFO  2021-11-29 18:08:57,870 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:252 - became the leader term:2
docker_n_1/redpanda.log:INFO  2021-11-29 18:09:09,116 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:134 - Stepping down as leader in term 1, dirty offset 12
docker_n_1/redpanda.log:INFO  2021-11-29 18:09:10,868 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] vote_stm.cc:252 - became the leader term:2
docker_n_24/redpanda.log:INFO  2021-11-29 18:08:56,368 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:08:58,730 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:00,911 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:02,912 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:04,914 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:05,774 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:09,226 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:11,728 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:13,729 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_33/redpanda.log:INFO  2021-11-29 18:08:41,516 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:252 - became the leader term:1
docker_n_33/redpanda.log:INFO  2021-11-29 18:08:53,918 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:134 - Stepping down as leader in term 1, dirty offset 11

(grepping out of https://buildkite.com/vectorized/redpanda/builds/4811#57c8a844-486a-4c17-b4a4-13c95a3f89db logs)

jcsp · 2021-11-29T19:41:32Z

Stranger and stranger...

$ grep -e triggering.leadership.notifi -e refreshing.health.cache docker_n_24/redpanda.log 
DEBUG 2021-11-29 18:08:43,551 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2380 - triggering leadership notification with term: 1, new leader: {{id: {1}, revision: {0}}}
DEBUG 2021-11-29 18:08:43,784 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {1}
DEBUG 2021-11-29 18:08:50,657 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:2380 - triggering leadership notification with term: 1, new leader: {{id: {2}, revision: {10}}}
DEBUG 2021-11-29 18:08:54,368 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {1}
DEBUG 2021-11-29 18:08:57,860 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {nullopt}
DEBUG 2021-11-29 18:08:58,009 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {{id: {2}, revision: {0}}}
DEBUG 2021-11-29 18:09:04,373 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {1}
DEBUG 2021-11-29 18:09:10,857 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {nullopt}
DEBUG 2021-11-29 18:09:10,917 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {{id: {2}, revision: {10}}}
DEBUG 2021-11-29 18:09:14,380 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {2}

Somehow health_monitor_backend is still trying to load health status from the old controller leader at 18:09:04,373 and it's only much later at 18:09:14,380 that it eventually sends a request to the new controller leader. I don't see how this is happening, it seems like the _raft0->get_leader_id() in the health monitor should always be up to date with the messages about triggering leadership notification :-/ @mmaslankaprv does this make sense to you?

Aside from that ghost in the machine, I think there's also a more general issue with the way we handle timeouts loading health info from the leader, where that timeout should be reset whenever the leader changes.

We should improve the health monitor updates to not take as long, but in the meantime let's make the test more tolerant. Related: redpanda-data#3098 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp · 2021-12-01T12:36:33Z

(revert timeout change in #3105 when underlying issue is fixed)

mmaslankaprv · 2021-12-01T13:46:37Z

There are two problems here:

we coalesce refresh requests with semaphore, when controller leader is isolated we wait for timeout to happen before allowing
next request to proceed, this makes the refreshes to wait for a long time before new leader will actually be asked for the status
second problem is that we ask for leader before waiting for semaphore, this way we may have stale data.

Going to fix that by aborting health requests pending to the old leader when leadership has changed,and issuing request to the new leader immediately.

jcsp added kind/bug Something isn't working area/raft ci-failure labels Nov 29, 2021

jcsp mentioned this issue Nov 29, 2021

tests: work around issue in test_follower_isolation #3105

Merged

dotnwat assigned mmaslankaprv Nov 30, 2021

mmaslankaprv mentioned this issue Nov 30, 2021

r/consensus: do not move flushed offset backward #3110

Merged

mmaslankaprv mentioned this issue Dec 2, 2021

Health monitor abortable refresh #3132

Merged

mmaslankaprv closed this as completed in #3132 Dec 9, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `RaftAvailabilityTest.test_follower_isolation` #3098

Failure in `RaftAvailabilityTest.test_follower_isolation` #3098

jcsp commented Nov 29, 2021

jcsp commented Nov 29, 2021 •

edited

Loading

jcsp commented Nov 29, 2021

jcsp commented Dec 1, 2021

mmaslankaprv commented Dec 1, 2021

Failure in RaftAvailabilityTest.test_follower_isolation #3098

Failure in RaftAvailabilityTest.test_follower_isolation #3098

Comments

jcsp commented Nov 29, 2021

jcsp commented Nov 29, 2021 • edited Loading

jcsp commented Nov 29, 2021

jcsp commented Dec 1, 2021

mmaslankaprv commented Dec 1, 2021

Failure in `RaftAvailabilityTest.test_follower_isolation` #3098

Failure in `RaftAvailabilityTest.test_follower_isolation` #3098

jcsp commented Nov 29, 2021 •

edited

Loading