Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in RaftAvailabilityTest.test_follower_isolation #3098

Closed
jcsp opened this issue Nov 29, 2021 · 4 comments · Fixed by #3132
Closed

Failure in RaftAvailabilityTest.test_follower_isolation #3098

jcsp opened this issue Nov 29, 2021 · 4 comments · Fixed by #3132
Assignees
Labels
area/raft ci-failure kind/bug Something isn't working

Comments

@jcsp
Copy link
Contributor

jcsp commented Nov 29, 2021

Seen twice:
https://buildkite.com/vectorized/redpanda/builds/4730#4a3e6431-118d-453a-9075-109039c2f6b9
https://buildkite.com/vectorized/redpanda/builds/4811#57c8a844-486a-4c17-b4a4-13c95a3f89db


Traceback (most recent call last):

  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run

    data = self.run_test()

  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test

    return self.test_context.function(self.test)

  File "/root/tests/rptest/tests/raft_availability_test.py", line 505, in test_follower_isolation

    self._ping_pong()

  File "/root/tests/rptest/tests/raft_availability_test.py", line 181, in _ping_pong

    offset = rpk.produce(self.topic, "tkey", payload, timeout=5)

  File "/root/tests/rptest/clients/rpk.py", line 101, in produce

    out = self._run_topic(cmd, stdin=msg, timeout=timeout)

  File "/root/tests/rptest/clients/rpk.py", line 152, in _run_topic

    return self._execute(cmd, stdin=stdin, timeout=timeout)

  File "/root/tests/rptest/clients/rpk.py", line 205, in _execute

    raise RpkException(f"command {' '.join(cmd)} timed out")

rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-001dca3e9f55e6fdc-1/vectorized/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk topic --brokers docker_n_24:9092,docker_n_1:9092,docker_n_33:9092 produce --key tkey -z none --delivery-timeout 4.5s -f %v topic-fzpzdxqqbm timed out>

This is the test trying to verify availability of the cluster after isolation, and timing out while doing so.

The delivery timeout is 4.5s, which is tight enough that this could just be a slow election when the leader is failed, haven't gone over it forensically.

@jcsp jcsp added kind/bug Something isn't working area/raft ci-failure labels Nov 29, 2021
@jcsp
Copy link
Contributor Author

jcsp commented Nov 29, 2021

Two things are strange here:

  1. The leader of the partition under test is not the node that is killed, but it is stepping down as leader a few seconds later anyway, before re-electing itself quickly after. I don't think this is why rpk is timing out though.
  2. The controller leader /was/ on the node that was killed, but long after another node has become leader, there are still "timed out when refreshing cluster health state" on another node that is servicing kafka metadata requests.
docker_n_1/redpanda.log:INFO  2021-11-29 18:08:50,556 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] vote_stm.cc:252 - became the leader term:1
docker_n_1/redpanda.log:INFO  2021-11-29 18:08:57,870 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:252 - became the leader term:2
docker_n_1/redpanda.log:INFO  2021-11-29 18:09:09,116 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:134 - Stepping down as leader in term 1, dirty offset 12
docker_n_1/redpanda.log:INFO  2021-11-29 18:09:10,868 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] vote_stm.cc:252 - became the leader term:2
docker_n_24/redpanda.log:INFO  2021-11-29 18:08:56,368 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:08:58,730 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:00,911 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:02,912 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:04,914 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:05,774 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:09,226 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:11,728 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_24/redpanda.log:INFO  2021-11-29 18:09:13,729 [shard 0] cluster - health_monitor_backend.cc:263 - timed out when refreshing cluster health state, falling back to previous cluster health snapshot
docker_n_33/redpanda.log:INFO  2021-11-29 18:08:41,516 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:252 - became the leader term:1
docker_n_33/redpanda.log:INFO  2021-11-29 18:08:53,918 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:134 - Stepping down as leader in term 1, dirty offset 11

(grepping out of https://buildkite.com/vectorized/redpanda/builds/4811#57c8a844-486a-4c17-b4a4-13c95a3f89db logs)

@jcsp
Copy link
Contributor Author

jcsp commented Nov 29, 2021

Stranger and stranger...

$ grep -e triggering.leadership.notifi -e refreshing.health.cache docker_n_24/redpanda.log 
DEBUG 2021-11-29 18:08:43,551 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2380 - triggering leadership notification with term: 1, new leader: {{id: {1}, revision: {0}}}
DEBUG 2021-11-29 18:08:43,784 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {1}
DEBUG 2021-11-29 18:08:50,657 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:2380 - triggering leadership notification with term: 1, new leader: {{id: {2}, revision: {10}}}
DEBUG 2021-11-29 18:08:54,368 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {1}
DEBUG 2021-11-29 18:08:57,860 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {nullopt}
DEBUG 2021-11-29 18:08:58,009 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {{id: {2}, revision: {0}}}
DEBUG 2021-11-29 18:09:04,373 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {1}
DEBUG 2021-11-29 18:09:10,857 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {nullopt}
DEBUG 2021-11-29 18:09:10,917 [shard 1] raft - [group_id:1, {kafka/topic-fzpzdxqqbm/0}] consensus.cc:2380 - triggering leadership notification with term: 2, new leader: {{id: {2}, revision: {10}}}
DEBUG 2021-11-29 18:09:14,380 [shard 0] cluster - health_monitor_backend.cc:184 - refreshing health cache, leader id: {2}

Somehow health_monitor_backend is still trying to load health status from the old controller leader at 18:09:04,373 and it's only much later at 18:09:14,380 that it eventually sends a request to the new controller leader. I don't see how this is happening, it seems like the _raft0->get_leader_id() in the health monitor should always be up to date with the messages about triggering leadership notification :-/ @mmaslankaprv does this make sense to you?

Aside from that ghost in the machine, I think there's also a more general issue with the way we handle timeouts loading health info from the leader, where that timeout should be reset whenever the leader changes.

jcsp added a commit to jcsp/redpanda that referenced this issue Nov 29, 2021
We should improve the health monitor updates
to not take as long, but in the meantime let's
make the test more tolerant.

Related: redpanda-data#3098

Signed-off-by: John Spray <jcs@vectorized.io>
jcsp added a commit to jcsp/redpanda that referenced this issue Nov 29, 2021
We should improve the health monitor updates
to not take as long, but in the meantime let's
make the test more tolerant.

Related: redpanda-data#3098

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp
Copy link
Contributor Author

jcsp commented Dec 1, 2021

(revert timeout change in #3105 when underlying issue is fixed)

@mmaslankaprv
Copy link
Member

There are two problems here:

  1. we coalesce refresh requests with semaphore, when controller leader is isolated we wait for timeout to happen before allowing
    next request to proceed, this makes the refreshes to wait for a long time before new leader will actually be asked for the status
  2. second problem is that we ask for leader before waiting for semaphore, this way we may have stale data.

Going to fix that by aborting health requests pending to the old leader when leadership has changed,and issuing request to the new leader immediately.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants