You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that we're doing more complicated cluster management features like leadership rebalancing, it becomes increasingly troublesome to have unresponsive nodes in the system:
they disrupt or break collective algorithms, e.g. leadership rebalancing sees down nodes as nodes with no leaders & therefore great places to attempt to migrate leadership to.
they generate log noise from errors connecting
When the administrator is intentionally stopping nodes, we should let them inform the cluster and thereby avoid these issues.
When entering maintenance mode, nodes should give up leaderships (an abdicate admin API is added in #1936).
In raft, we should still send heartbeats to nodes in maintenance mode, to enable them to catch up before being brought back into normal service. However, it would be nice to avoid emitting connection errors to the log for nodes in maintenance mode -- this might be something to implement in the RPC layer.
The text was updated successfully, but these errors were encountered:
Now that we're doing more complicated cluster management features like leadership rebalancing, it becomes increasingly troublesome to have unresponsive nodes in the system:
When the administrator is intentionally stopping nodes, we should let them inform the cluster and thereby avoid these issues.
When entering maintenance mode, nodes should give up leaderships (an
abdicate
admin API is added in #1936).In raft, we should still send heartbeats to nodes in maintenance mode, to enable them to catch up before being brought back into normal service. However, it would be nice to avoid emitting connection errors to the log for nodes in maintenance mode -- this might be something to implement in the RPC layer.
The text was updated successfully, but these errors were encountered: