Fixed unavailability during follower & leader isolation #2856

mmaslankaprv · 2021-11-02T17:28:16Z

Cover letter

Fixed availability issue occurring during follower isolation. Follower isolation chaos test isolates one of the followers from other group members (it uses iptables to drop all the packets targeting isolated follower). In this situation leader is supposed to stop trying sending requests to isolated follower. The bug that has been fixed caused leader to always try to send append entries to follower even tho the follower is unreachable. This triggered backpressure propagation (based on constant number of in-flight requests per follower). The backpressure was propagated to the client which resulted in timeouts.

Metadata response heuristics

We use simple heuristic to tolerate isolation of a node hosting both partition leader and follower. Kafka clients request metadata refresh in case they receive error that is related with stale metadata - f.e. NOT_LEADER. Metadata request can be processed by any broker and there is no general rule for that which broker to choose to refresh metadata from. (f.e. Java kafka client uses the broker with active least loaded connection.) This may lead to the situation in which client will ask for metadata always the same broker. When that broker is isolated from rest of the cluster it will never update its metadata view. This way the client will always receive stale metadata.

This behavior may lead to a live lock in an event of network partition. If current partition leader is isolated from the cluster it will keep answering with its id in the leader_id field for that partition (according to policy where we return a former leader - there is no leader for that broker, it is a candidate). Client will retry produce or fetch request and receive NOT_LEADER error, this will force client to request metadata update, broker will respond with the same metadata and the whole cycle will loop indefinitely.

In order to break the loop and force client to make progress we use following heuristics:

when current leader is unknown, return former leader (Kafka behavior)
when current leader is unknown and previous leader is equal to current node id select random replica_id as a leader (indicate leader isolation)

With those heuristics we will always force the client to communicate with the nodes that may not be partitioned.

Release notes

Release note: [1-2 sentences of what this PR changes]

jcsp · 2021-11-02T17:41:27Z

Does this issue require the follower be network partitioned, or can it just be down? If just down, then this seems like something we should be able to cover in a ducktape test (there's RaftAvailabilityTest.test_one_node_down that might be a good starting point)

jcsp · 2021-11-02T21:33:41Z

docs/rfcs/20200421_raft_recovery.md

@@ -210,7 +210,7 @@ In order to make it possible new fields have to be added to
            // next index to send to this follower
            model::offset next_index;
            // timestamp of last append_entries_rpc call
-            clock_type::time_point last_append_timestamp;
+            clock_type::time_point last_sent_append_entries_req_timesptamp;


Big 👍 for clearer names

jcsp

The code change LGTM, would be interested in thoughts on whether we can encapsulate this in a ducktape test (chaos tests are good, but this seems like a simple enough situation to have a dedicated test for)

dotnwat · 2021-11-03T21:23:09Z

Does this issue require the follower be network partitioned, or can it just be down? If just down, then this seems like something we should be able to cover in a ducktape test (there's RaftAvailabilityTest.test_one_node_down that might be a good starting point)

are you thinking about network partition vs down in terms of being able to create the test in ducktape, or something more nuanced? assuming the test would drive everything implicitly through the leader and that client can reach leader, then from raft perspective, network partition seems like it would be the same as follower being down.

dotnwat

There is a connection missing for me going from the problem:

leader was propagating backpressure to Kafka API even tho the follower was down

to the solution:

do not update reply timestamp when sending request

Is it because the leader would wait on a quorum of responses that never arrived?

clarifying this in the PR would be really useful for future rafters
is there a related chaos test we can reference as a reproducer?

jcsp · 2021-11-03T22:05:46Z

are you thinking about network partition vs down in terms of being able to create the test in ducktape, or something more nuanced? assuming the test would drive everything implicitly through the leader and that client can reach leader, then from raft perspective, network partition seems like it would be the same as follower being down.

I was thinking of network partitions in the general sense of including cases where a node was only isolated from some of its peers (not sure how crazy the chaos tests get with that kind of thing). It seems like this is really just a straight "node down" situation though.

dotnwat · 2021-11-04T01:20:58Z

are you thinking about network partition vs down in terms of being able to create the test in ducktape, or something more nuanced? assuming the test would drive everything implicitly through the leader and that client can reach leader, then from raft perspective, network partition seems like it would be the same as follower being down.

I was thinking of network partitions in the general sense of including cases where a node was only isolated from some of its peers (not sure how crazy the chaos tests get with that kind of thing). It seems like this is really just a straight "node down" situation though.

got it. iirc correctly in raft there is only leader <-> follower communication so if we're thinking about a single raft group it might be the same scenario here.

mmaslankaprv · 2021-11-04T06:57:22Z

There is a connection missing for me going from the problem:

leader was propagating backpressure to Kafka API even tho the follower was down

to the solution:

do not update reply timestamp when sending request

Is it because the leader would wait on a quorum of responses that never arrived?

clarifying this in the PR would be really useful for future rafters

is there a related chaos test we can reference as a reproducer?

Done
There is a failure scenario called isolate_follower. I mentioned that in PR description,

rystsov

Looks good, the changes are crisp, very easy to read!

Signed-off-by: Michal Maslanka <michal@vectorized.io>

Do not update last received append entries reply timestamp when updating last sent append entries timestamp. Signed-off-by: Michal Maslanka <michal@vectorized.io>

Kafka Metadata API responses should contain leader id even tho the partition leader is unknown in a given instance of time. In Kafka the previous partition leader is returned when leader id is unknown. In order to be able to act according to Kafka behavior without loosing the ability to track the current leader state added storing the previous leader id into partitions leader table. Signed-off-by: Michal Maslanka <michal@vectorized.io>

Signed-off-by: Michal Maslanka <michal@vectorized.io>

We use simple heuristic to tolerate isolation of a node hosting both partition leader and follower. Kafka clients request metadata refresh in case they receive error that is related with stale metadata - f.e. NOT_LEADER. Metadata request can be processed by any broker and there is no general rule for that which broker to choose to refresh metadata from. (f.e. Java kafka client uses the broker with active least loaded connection.) This may lead to the situation in which client will ask for metadata always the same broker. When that broker is isolated from rest of the cluster it will never update its metadata view. This way the client will always receive stale metadata. This behavior may lead to a live lock in an event of network partition. If current partition leader is isolated from the cluster it will keep answering with its id in the leader_id field for that partition (according to policy where we return a former leader - there is no leader for that broker, it is a candidate). Client will retry produce or fetch request and receive NOT_LEADER error, this will force client to request metadata update, broker will respond with the same metadata and the whole cycle will loop indefinitely. In order to break the loop and force client to make progress we use following heuristics: 1) when current leader is unknown, return former leader (Kafka behavior) 2) when current leader is unknown and previous leader is equal to current node id select random replica_id as a leader (indicate leader isolation) With those heuristics we will always force the client to communicate with the nodes that may not be partitioned. Signed-off-by: Michal Maslanka <michal@vectorized.io>

Signed-off-by: Michal Maslanka <michal@vectorized.io>

emaxerrno · 2021-11-09T04:50:41Z

@mmaslankaprv and @rystsov - great work! let's backport to 21.10.x and 21.11.x

Backport of #2856, #1576, #2917, #2901, #2937

mmaslankaprv requested review from bmansheim, jcsp, rystsov, VadimPlh and ztlpn as code owners November 2, 2021 17:28

github-actions bot added area/docs area/redpanda labels Nov 2, 2021

jcsp reviewed Nov 2, 2021

View reviewed changes

jcsp previously approved these changes Nov 2, 2021

View reviewed changes

dotnwat self-requested a review November 3, 2021 21:23

dotnwat reviewed Nov 3, 2021

View reviewed changes

dotnwat self-requested a review November 4, 2021 14:18

dotnwat previously approved these changes Nov 4, 2021

View reviewed changes

mmaslankaprv dismissed stale reviews from dotnwat and jcsp via eb60c02 November 4, 2021 16:40

mmaslankaprv requested review from ivotron and NyaliaLui as code owners November 4, 2021 16:40

mmaslankaprv requested review from jcsp and dotnwat November 4, 2021 16:40

mmaslankaprv added this to the v21.10.1 milestone Nov 4, 2021

mmaslankaprv force-pushed the raft-availability-fixes branch 5 times, most recently from b11d060 to 6622484 Compare November 5, 2021 11:14

rystsov previously approved these changes Nov 6, 2021

View reviewed changes

mmaslankaprv added 3 commits November 8, 2021 10:58

r/types: renamed follower timestamps to better reflect their semantics

87ec58a

Signed-off-by: Michal Maslanka <michal@vectorized.io>

r/consensus: do not update reply timestamp when sending request

2703dd4

Do not update last received append entries reply timestamp when updating last sent append entries timestamp. Signed-off-by: Michal Maslanka <michal@vectorized.io>

mmaslankaprv dismissed rystsov’s stale review via bc3f1f5 November 8, 2021 11:11

mmaslankaprv force-pushed the raft-availability-fixes branch from 6622484 to bc3f1f5 Compare November 8, 2021 11:11

mmaslankaprv changed the title ~~Fixed unavailability during follower isolation~~ Fixed unavailability during follower & leader isolation Nov 8, 2021

mmaslankaprv added 3 commits November 8, 2021 12:24

c/metadata_cache: added const qualifier to get_leader method

3dc000c

Signed-off-by: Michal Maslanka <michal@vectorized.io>

c/metadata_cache: added api to query previous partition leader

d8fd61d

Signed-off-by: Michal Maslanka <michal@vectorized.io>

mmaslankaprv force-pushed the raft-availability-fixes branch from bc3f1f5 to fef4a6f Compare November 8, 2021 11:24

mmaslankaprv requested a review from rystsov November 8, 2021 11:28

mmaslankaprv added 3 commits November 8, 2021 14:42

ducky: added isolate node failure to failure injector

7a83955

Signed-off-by: Michal Maslanka <michal@vectorized.io>

ducky: log ping-pong e2e latency in raft availability tests

61cde14

Signed-off-by: Michal Maslanka <michal@vectorized.io>

ducky: added follower isolation raft availability test

a698c34

Signed-off-by: Michal Maslanka <michal@vectorized.io>

mmaslankaprv force-pushed the raft-availability-fixes branch from fef4a6f to a698c34 Compare November 8, 2021 13:42

rystsov approved these changes Nov 9, 2021

View reviewed changes

mmaslankaprv merged commit 1c2a530 into redpanda-data:dev Nov 9, 2021

mmaslankaprv mentioned this pull request Nov 10, 2021

Backport of #2856, #1576, #2917, #2901, #2937 #2929

Merged

dotnwat added a commit that referenced this pull request Nov 11, 2021

Merge pull request #2929 from mmaslankaprv/v21.10.x

e7b6714

Backport of #2856, #1576, #2917, #2901, #2937

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed unavailability during follower & leader isolation #2856

Fixed unavailability during follower & leader isolation #2856

mmaslankaprv commented Nov 2, 2021 •

edited

Loading

jcsp commented Nov 2, 2021

jcsp Nov 2, 2021

jcsp left a comment

dotnwat commented Nov 3, 2021

dotnwat left a comment •

edited

Loading

jcsp commented Nov 3, 2021

dotnwat commented Nov 4, 2021

mmaslankaprv commented Nov 4, 2021

rystsov left a comment

emaxerrno commented Nov 9, 2021

Fixed unavailability during follower & leader isolation #2856

Fixed unavailability during follower & leader isolation #2856

Conversation

mmaslankaprv commented Nov 2, 2021 • edited Loading

Cover letter

Metadata response heuristics

Release notes

jcsp commented Nov 2, 2021

jcsp Nov 2, 2021

Choose a reason for hiding this comment

jcsp left a comment

Choose a reason for hiding this comment

dotnwat commented Nov 3, 2021

dotnwat left a comment • edited Loading

Choose a reason for hiding this comment

jcsp commented Nov 3, 2021

dotnwat commented Nov 4, 2021

mmaslankaprv commented Nov 4, 2021

rystsov left a comment

Choose a reason for hiding this comment

emaxerrno commented Nov 9, 2021

mmaslankaprv commented Nov 2, 2021 •

edited

Loading

dotnwat left a comment •

edited

Loading