Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redpanda: reassign replicas off of nodes which have been unavailable for a specific amount of time #3237

Open
rkruze opened this issue Dec 12, 2021 · 3 comments
Labels
area/controller kind/enhance New feature or request

Comments

@rkruze
Copy link
Contributor

rkruze commented Dec 12, 2021

Who is this for and what problem do they have today?

Today, when a node goes down in a Redpanda cluster, it will have all of its leaderships transferred from the offline node. However, unless the node is decommissioned via an admin API call, the replicas assigned to that node will not move to other nodes in the cluster.

This feature request is to have a setting in the cluster that says after a node has been unavailable more than X amount of time, replicas will be reassigned to other nodes in the cluster to be up replicated.

What are the success criteria?

Success would be that if a node was to go down and was down longer than X amount of time, specified by a config in the cluster, with a default of something like 5 minutes, that replicas assigned to that node will be reassigned to other nodes which are up in the cluster.

Why is solving this problem impactful?

This feature will help to automate cluster management and enable things like auto-scaling groups.

Additional notes

This feature is very similar to https://www.cockroachlabs.com/docs/v21.2/architecture/replication-layer#membership-changes-rebalance-repair

JIRA Link: CORE-800

@jcsp jcsp added area/controller kind/enhance New feature or request labels Dec 13, 2021
@jcsp
Copy link
Contributor

jcsp commented Dec 13, 2021

Couple thoughts from dealing with this kind of thing in storage systems:

  • The default timeout should be long enough for typical hardware/OS servicing (i.e. long enough to shutdown, click around in iDrac/iLo to install some new firmware, start back up, which is something more like 10-20 mins than 5 mins on a lot of servers).
  • Should have a well defined interaction with maintenance mode (Add per-node maintenance mode #3020) - this might be that when a node is in maintenance mode, we never auto-rebalance data away from it, because we take maintenance mode as a promise that the outage is temporary.
  • Need a prometheus stat for "node has been down too long" that can be alerted on, and a direct CLI equivalent of what we do on the timeout, for users who want a "human in the loop" version of this flow.
  • Should do some queuing of partitions when moving them: prioritize completely moving a smaller number of complete partitions rather than partially moving all the partitions.
  • Needs a clear way to view progress and cancel, as data movement can be a very long running operation.

@rkruze
Copy link
Contributor Author

rkruze commented Dec 14, 2021

The default timeout should be long enough for typical hardware/OS servicing (i.e., long enough to shutdown, click around in iDrac/iLo to install some new firmware, start back up, which is something more like 10-20 mins than 5 mins on a lot of servers).

On the point of timeouts, what I saw worked well was the ability to adjust the timeout dynamically via a config option for when you knew that updates would occur.

Needs a clear way to view progress and cancel, as data movement can be a very long-running operation.

Agreed with this, we should have a metric to show the queue of "replication reassignment," something similar to, perhaps a metric like "Queue Replication Pending."

We should also have a way to configure a dynamic bandwidth throttle for this operation. You don't want to have a node going offline cause more issues in the cluster. Ideally, if the cluster had Shadow Indexing enabled, the other replicas would replicate from the object store vs. the other nodes.

@scallister
Copy link

scallister commented Dec 14, 2021

We should also have a way to configure a dynamic bandwidth throttle for this operation.

What would this look like? I'm imaging it might be something like an upper total bandwidth bound for the node, so combining both regular traffic bandwidth and replication bandwidth and backing off on replication if that upper bound is reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller kind/enhance New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants