-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed unavailability during follower & leader isolation #2856
Fixed unavailability during follower & leader isolation #2856
Conversation
Does this issue require the follower be network partitioned, or can it just be down? If just down, then this seems like something we should be able to cover in a ducktape test (there's RaftAvailabilityTest.test_one_node_down that might be a good starting point) |
@@ -210,7 +210,7 @@ In order to make it possible new fields have to be added to | |||
// next index to send to this follower | |||
model::offset next_index; | |||
// timestamp of last append_entries_rpc call | |||
clock_type::time_point last_append_timestamp; | |||
clock_type::time_point last_sent_append_entries_req_timesptamp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big 👍 for clearer names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change LGTM, would be interested in thoughts on whether we can encapsulate this in a ducktape test (chaos tests are good, but this seems like a simple enough situation to have a dedicated test for)
are you thinking about network partition vs down in terms of being able to create the test in ducktape, or something more nuanced? assuming the test would drive everything implicitly through the leader and that client can reach leader, then from raft perspective, network partition seems like it would be the same as follower being down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a connection missing for me going from the problem:
leader was propagating backpressure to Kafka API even tho the follower was down
to the solution:
do not update reply timestamp when sending request
Is it because the leader would wait on a quorum of responses that never arrived?
- clarifying this in the PR would be really useful for future rafters
- is there a related chaos test we can reference as a reproducer?
I was thinking of network partitions in the general sense of including cases where a node was only isolated from some of its peers (not sure how crazy the chaos tests get with that kind of thing). It seems like this is really just a straight "node down" situation though. |
got it. iirc correctly in raft there is only leader <-> follower communication so if we're thinking about a single raft group it might be the same scenario here. |
|
b11d060
to
6622484
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, the changes are crisp, very easy to read!
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Do not update last received append entries reply timestamp when updating last sent append entries timestamp. Signed-off-by: Michal Maslanka <michal@vectorized.io>
Kafka Metadata API responses should contain leader id even tho the partition leader is unknown in a given instance of time. In Kafka the previous partition leader is returned when leader id is unknown. In order to be able to act according to Kafka behavior without loosing the ability to track the current leader state added storing the previous leader id into partitions leader table. Signed-off-by: Michal Maslanka <michal@vectorized.io>
6622484
to
bc3f1f5
Compare
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
We use simple heuristic to tolerate isolation of a node hosting both partition leader and follower. Kafka clients request metadata refresh in case they receive error that is related with stale metadata - f.e. NOT_LEADER. Metadata request can be processed by any broker and there is no general rule for that which broker to choose to refresh metadata from. (f.e. Java kafka client uses the broker with active least loaded connection.) This may lead to the situation in which client will ask for metadata always the same broker. When that broker is isolated from rest of the cluster it will never update its metadata view. This way the client will always receive stale metadata. This behavior may lead to a live lock in an event of network partition. If current partition leader is isolated from the cluster it will keep answering with its id in the leader_id field for that partition (according to policy where we return a former leader - there is no leader for that broker, it is a candidate). Client will retry produce or fetch request and receive NOT_LEADER error, this will force client to request metadata update, broker will respond with the same metadata and the whole cycle will loop indefinitely. In order to break the loop and force client to make progress we use following heuristics: 1) when current leader is unknown, return former leader (Kafka behavior) 2) when current leader is unknown and previous leader is equal to current node id select random replica_id as a leader (indicate leader isolation) With those heuristics we will always force the client to communicate with the nodes that may not be partitioned. Signed-off-by: Michal Maslanka <michal@vectorized.io>
bc3f1f5
to
fef4a6f
Compare
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
fef4a6f
to
a698c34
Compare
@mmaslankaprv and @rystsov - great work! let's backport to 21.10.x and 21.11.x |
Cover letter
Fixed availability issue occurring during follower isolation. Follower isolation chaos test isolates one of the followers from other group members (it uses
iptables
to drop all the packets targeting isolated follower). In this situation leader is supposed to stop trying sending requests to isolated follower. The bug that has been fixed caused leader to always try to send append entries to follower even tho the follower is unreachable. This triggered backpressure propagation (based on constant number of in-flight requests per follower). The backpressure was propagated to the client which resulted in timeouts.Metadata response heuristics
We use simple heuristic to tolerate isolation of a node hosting both partition leader and follower. Kafka clients request metadata refresh in case they receive error that is related with stale metadata - f.e. NOT_LEADER. Metadata request can be processed by any broker and there is no general rule for that which broker to choose to refresh metadata from. (f.e. Java kafka client uses the broker with active least loaded connection.) This may lead to the situation in which client will ask for metadata always the same broker. When that broker is isolated from rest of the cluster it will never update its metadata view. This way the client will always receive stale metadata.
This behavior may lead to a live lock in an event of network partition. If current partition leader is isolated from the cluster it will keep answering with its id in the leader_id field for that partition (according to policy where we return a former leader - there is no leader for that broker, it is a candidate). Client will retry produce or fetch request and receive NOT_LEADER error, this will force client to request metadata update, broker will respond with the same metadata and the whole cycle will loop indefinitely.
In order to break the loop and force client to make progress we use following heuristics:
With those heuristics we will always force the client to communicate with the nodes that may not be partitioned.
Release notes
Release note: [1-2 sentences of what this PR changes]