Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: notify peers about startup to avoid vote/pre-vote #4151

Merged
merged 6 commits into from
Apr 6, 2022

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Mar 31, 2022

Cover letter

The PR #2071 provides a heuristic fix for leadership instability after
followers restart due to an interaction with leader rpc backoff rules and
follower timeouts (more information here #2048).

The heuristic works most of the time, but still depends on several moons to
align: failed heartbeats, prevote phase starting etc... This patch does away
with the heuristic and adds a new RPC for communicating an explicit startup
event to peers within a raft group via a new 'hello' rpc when a raft group
first starts. Peers now use this startup message to reset backoff (as it did
with the heuristic) but the reset is more precise because it is tied a specific
event rather than attempting to infer the scenario of interest.

Fixes: #4083

Release notes

Improvements

  • Improved behavior for restarted raft groups that reduces election churn.

@jcsp
Copy link
Contributor

jcsp commented Mar 31, 2022

I'm a bit concerned about the scale implications of implementing this at a per-group level: as we progress to having 100,000s of partitions per node, the queuing latency of emitting all these messages will be substantial. I think we mainly just need to know that the server is up, rather than that an individual group is starting up.

The implementation of consensus::hello is to call down through client_protocol and ultimately into connection_cache to reset the backoff on the restarted node, so if there are lots of partitions they'll all ultimately be resetting the same thing.

@dotnwat
Copy link
Member Author

dotnwat commented Mar 31, 2022

I'm a bit concerned about the scale implications of implementing this at a per-group level: as we progress to having 100,000s of partitions per node, the queuing latency of emitting all these messages will be substantial. I think we mainly just need to know that the server is up, rather than that an individual group is starting up.

Ahh, yeh good point. Debouncing the message should not be a problem, either remaining at a per-group level or bumping up to the controller/node level for the message.

The implementation of consensus::hello is to call down through client_protocol and ultimately into connection_cache to reset the backoff on the restarted node, so if there are lots of partitions they'll all ultimately be resetting the same thing.

I had naively been operating under the assumption that this back-off was per group. Now I'm wondering if this is more complicated.

At scale with 100K partitions I would expect the start-up time of any given raft group to be much longer than the default vote timeout relative to the start of the node or some other raft group. The effect of this would seem to be that an initial backoff reset on the leader wouldn't help for a raft group that starts up much later on the restarted follower.

I guess an option would be to wait until all of the raft groups on a restarted are recovered/ready before saying hello and starting the vote timeout. But that would seem to potentially be problematic for boot up time and later on when we want to "pause" a raft group.


EDIT.0: I guess once the backoff is reset then it won't backoff as long as something on the connection looks healthy, even if some raft group on the connection hasn't booted up yet. I was thinking that the backoff was initiated any time a raft group was unresponsive. I need to go look a little closer at this mechanism.

EDIT.1: Ok now I see the backoff is applied down at the reconnection transport level. I was working at the wrong level of abstraction. I think the question about how to handle raft groups at scale when they can't be treated equally in a gang scheduled way is still relevant, but for more appropriate at a later time.

The loop which establishes connections in members_manager::start was not
executing because the temporary returned by raft0->config() was being
destroyed and the brokers reference was therefore operating on either
a valid empty vector or it was UB. This is a general danger with using
temporaries in range based for loops.

Signed-off-by: Noah Watkins <noah@redpanda.com>
The hello rpc is used by nodes to announce that they are starting up.
Peers should expect at most one message from a peer shortly after
starting up.

A node may use the rpc to perform optimizations such as resetting the
connection backoff for the peer in the connection cache.

Signed-off-by: Noah Watkins <noah@redpanda.com>
@dotnwat
Copy link
Member Author

dotnwat commented Apr 1, 2022

@jcsp ok think this solution should be more palatable.

@mmaslankaprv
Copy link
Member

This looks really good, clean and simple

mmaslankaprv
mmaslankaprv previously approved these changes Apr 1, 2022
The announcement is made when the members manager is started. At this
point the connection cache is populated with connections to all known
members in the latest configuration.

Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
@dotnwat
Copy link
Member Author

dotnwat commented Apr 5, 2022

Forced push:

  1. Say hello and setup connections in the background
  2. Include process start time in hello message

The goods https://github.com/redpanda-data/redpanda/compare/7ba472aa5fce41a0efc6c2b3d13a1046e438fbf6..6a4a1cee2581417d06ea3212c3bec221d02a1ab0

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jcsp jcsp merged commit 4b8abc4 into redpanda-data:dev Apr 6, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix disruption to leaders (via expo. backoff) when followers roll
3 participants