-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: notify peers about startup to avoid vote/pre-vote #4151
Conversation
I'm a bit concerned about the scale implications of implementing this at a per-group level: as we progress to having 100,000s of partitions per node, the queuing latency of emitting all these messages will be substantial. I think we mainly just need to know that the server is up, rather than that an individual group is starting up. The implementation of |
Ahh, yeh good point. Debouncing the message should not be a problem, either remaining at a per-group level or bumping up to the controller/node level for the message.
I had naively been operating under the assumption that this back-off was per group. Now I'm wondering if this is more complicated. At scale with 100K partitions I would expect the start-up time of any given raft group to be much longer than the default vote timeout relative to the start of the node or some other raft group. The effect of this would seem to be that an initial backoff reset on the leader wouldn't help for a raft group that starts up much later on the restarted follower. I guess an option would be to wait until all of the raft groups on a restarted are recovered/ready before saying hello and starting the vote timeout. But that would seem to potentially be problematic for boot up time and later on when we want to "pause" a raft group. EDIT.0: I guess once the backoff is reset then it won't backoff as long as something on the connection looks healthy, even if some raft group on the connection hasn't booted up yet. I was thinking that the backoff was initiated any time a raft group was unresponsive. I need to go look a little closer at this mechanism. EDIT.1: Ok now I see the backoff is applied down at the reconnection transport level. I was working at the wrong level of abstraction. I think the question about how to handle raft groups at scale when they can't be treated equally in a gang scheduled way is still relevant, but for more appropriate at a later time. |
The loop which establishes connections in members_manager::start was not executing because the temporary returned by raft0->config() was being destroyed and the brokers reference was therefore operating on either a valid empty vector or it was UB. This is a general danger with using temporaries in range based for loops. Signed-off-by: Noah Watkins <noah@redpanda.com>
The hello rpc is used by nodes to announce that they are starting up. Peers should expect at most one message from a peer shortly after starting up. A node may use the rpc to perform optimizations such as resetting the connection backoff for the peer in the connection cache. Signed-off-by: Noah Watkins <noah@redpanda.com>
@jcsp ok think this solution should be more palatable. |
This looks really good, clean and simple |
The announcement is made when the members manager is started. At this point the connection cache is populated with connections to all known members in the latest configuration. Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Forced push:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Cover letter
The PR #2071 provides a heuristic fix for leadership instability after
followers restart due to an interaction with leader rpc backoff rules and
follower timeouts (more information here #2048).
The heuristic works most of the time, but still depends on several moons to
align: failed heartbeats, prevote phase starting etc... This patch does away
with the heuristic and adds a new RPC for communicating an explicit startup
event to peers within a raft group via a new 'hello' rpc when a raft group
first starts. Peers now use this startup message to reset backoff (as it did
with the heuristic) but the reset is more precise because it is tied a specific
event rather than attempting to infer the scenario of interest.
Fixes: #4083
Release notes
Improvements