raft: notify peers about startup to avoid vote/pre-vote #4151

dotnwat · 2022-03-31T00:47:48Z

Cover letter

The PR #2071 provides a heuristic fix for leadership instability after
followers restart due to an interaction with leader rpc backoff rules and
follower timeouts (more information here #2048).

The heuristic works most of the time, but still depends on several moons to
align: failed heartbeats, prevote phase starting etc... This patch does away
with the heuristic and adds a new RPC for communicating an explicit startup
event to peers within a raft group via a new 'hello' rpc when a raft group
first starts. Peers now use this startup message to reset backoff (as it did
with the heuristic) but the reset is more precise because it is tied a specific
event rather than attempting to infer the scenario of interest.

Fixes: #4083

Release notes

Improvements

Improved behavior for restarted raft groups that reduces election churn.

jcsp · 2022-03-31T11:35:46Z

I'm a bit concerned about the scale implications of implementing this at a per-group level: as we progress to having 100,000s of partitions per node, the queuing latency of emitting all these messages will be substantial. I think we mainly just need to know that the server is up, rather than that an individual group is starting up.

The implementation of consensus::hello is to call down through client_protocol and ultimately into connection_cache to reset the backoff on the restarted node, so if there are lots of partitions they'll all ultimately be resetting the same thing.

dotnwat · 2022-03-31T15:19:04Z

I'm a bit concerned about the scale implications of implementing this at a per-group level: as we progress to having 100,000s of partitions per node, the queuing latency of emitting all these messages will be substantial. I think we mainly just need to know that the server is up, rather than that an individual group is starting up.

Ahh, yeh good point. Debouncing the message should not be a problem, either remaining at a per-group level or bumping up to the controller/node level for the message.

The implementation of consensus::hello is to call down through client_protocol and ultimately into connection_cache to reset the backoff on the restarted node, so if there are lots of partitions they'll all ultimately be resetting the same thing.

I had naively been operating under the assumption that this back-off was per group. Now I'm wondering if this is more complicated.

At scale with 100K partitions I would expect the start-up time of any given raft group to be much longer than the default vote timeout relative to the start of the node or some other raft group. The effect of this would seem to be that an initial backoff reset on the leader wouldn't help for a raft group that starts up much later on the restarted follower.

I guess an option would be to wait until all of the raft groups on a restarted are recovered/ready before saying hello and starting the vote timeout. But that would seem to potentially be problematic for boot up time and later on when we want to "pause" a raft group.

EDIT.0: I guess once the backoff is reset then it won't backoff as long as something on the connection looks healthy, even if some raft group on the connection hasn't booted up yet. I was thinking that the backoff was initiated any time a raft group was unresponsive. I need to go look a little closer at this mechanism.

EDIT.1: Ok now I see the backoff is applied down at the reconnection transport level. I was working at the wrong level of abstraction. I think the question about how to handle raft groups at scale when they can't be treated equally in a gang scheduled way is still relevant, but for more appropriate at a later time.

The loop which establishes connections in members_manager::start was not executing because the temporary returned by raft0->config() was being destroyed and the brokers reference was therefore operating on either a valid empty vector or it was UB. This is a general danger with using temporaries in range based for loops. Signed-off-by: Noah Watkins <noah@redpanda.com>

The hello rpc is used by nodes to announce that they are starting up. Peers should expect at most one message from a peer shortly after starting up. A node may use the rpc to perform optimizations such as resetting the connection backoff for the peer in the connection cache. Signed-off-by: Noah Watkins <noah@redpanda.com>

dotnwat · 2022-04-01T00:03:25Z

@jcsp ok think this solution should be more palatable.

mmaslankaprv · 2022-04-01T08:28:10Z

This looks really good, clean and simple

src/v/cluster/members_manager.cc

src/v/cluster/types.h

The announcement is made when the members manager is started. At this point the connection cache is populated with connections to all known members in the latest configuration. Signed-off-by: Noah Watkins <noah@redpanda.com>

Signed-off-by: Noah Watkins <noah@redpanda.com>

dotnwat · 2022-04-05T22:32:40Z

Forced push:

Say hello and setup connections in the background
Include process start time in hello message

The goods https://github.com/redpanda-data/redpanda/compare/7ba472aa5fce41a0efc6c2b3d13a1046e438fbf6..6a4a1cee2581417d06ea3212c3bec221d02a1ab0

jcsp

👍

dotnwat requested review from jcsp, rystsov and mmaslankaprv March 31, 2022 00:47

dotnwat requested review from ztlpn and VadimPlh as code owners March 31, 2022 00:47

github-actions bot added the area/redpanda label Mar 31, 2022

dotnwat mentioned this pull request Mar 31, 2022

raft: notify peers about startup to avoid vote/pre-vote #4094

Closed

dotnwat added 2 commits March 31, 2022 16:54

dotnwat force-pushed the raft-hello branch from c41540a to 7ba472a Compare April 1, 2022 00:02

mmaslankaprv previously approved these changes Apr 1, 2022

View reviewed changes

jcsp reviewed Apr 5, 2022

View reviewed changes

src/v/cluster/members_manager.cc Outdated Show resolved Hide resolved

jcsp reviewed Apr 5, 2022

View reviewed changes

src/v/cluster/types.h Show resolved Hide resolved

dotnwat added 4 commits April 5, 2022 10:30

cluster: announce startup to peers via hello rpc

a7e0bd7

The announcement is made when the members manager is started. At this point the connection cache is populated with connections to all known members in the latest configuration. Signed-off-by: Noah Watkins <noah@redpanda.com>

cluster: add todo for removing old optimization

d298028

Signed-off-by: Noah Watkins <noah@redpanda.com>

app: record process startup time

dc68d46

Signed-off-by: Noah Watkins <noah@redpanda.com>

cluster: include app startup in hello request

6a4a1ce

Signed-off-by: Noah Watkins <noah@redpanda.com>

dotnwat dismissed mmaslankaprv’s stale review via 6a4a1ce April 5, 2022 22:31

dotnwat force-pushed the raft-hello branch from 7ba472a to 6a4a1ce Compare April 5, 2022 22:31

dotnwat requested review from jcsp and mmaslankaprv April 6, 2022 00:15

jcsp approved these changes Apr 6, 2022

View reviewed changes

jcsp merged commit 4b8abc4 into redpanda-data:dev Apr 6, 2022

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: notify peers about startup to avoid vote/pre-vote #4151

raft: notify peers about startup to avoid vote/pre-vote #4151

dotnwat commented Mar 31, 2022 •

edited

Loading

jcsp commented Mar 31, 2022

dotnwat commented Mar 31, 2022 •

edited

Loading

dotnwat commented Apr 1, 2022

mmaslankaprv commented Apr 1, 2022

dotnwat commented Apr 5, 2022 •

edited

Loading

jcsp left a comment

raft: notify peers about startup to avoid vote/pre-vote #4151

raft: notify peers about startup to avoid vote/pre-vote #4151

Conversation

dotnwat commented Mar 31, 2022 • edited Loading

Cover letter

Release notes

Improvements

jcsp commented Mar 31, 2022

dotnwat commented Mar 31, 2022 • edited Loading

dotnwat commented Apr 1, 2022

mmaslankaprv commented Apr 1, 2022

dotnwat commented Apr 5, 2022 • edited Loading

jcsp left a comment

Choose a reason for hiding this comment

dotnwat commented Mar 31, 2022 •

edited

Loading

dotnwat commented Mar 31, 2022 •

edited

Loading

dotnwat commented Apr 5, 2022 •

edited

Loading