[mimir-distributed] Avoid series resharding during rolling ingester restart #1313

Logiraptor · 2022-04-27T19:28:42Z

As ingesters are restarted, their series are resharded onto other ingesters. During a rolling upgrade, this can lead to a lot of churn as series are resharded across virtually every ingester.

Since we know the ingesters in this chart are deployed via a StatefulSet, we know the hostname and storage is stable across restarts. This means we could feasibly run with the following settings:

-distributor.extend-writes=false
-ingester.unregister-on-shutdown=false

If false, ingesters are not unregistered on shutdown and left in the ring with the LEAVING state. Setting to false prevents series resharding during ingesters rollouts, but requires to:

Either manually forget ingesters on scale down or invoke the /shutdown endpoint
Ensure ingester ID is preserved during rollouts

This is how we run Grafana Cloud, so it would make sense to me to bake this into the Helm Chart as well.

The text was updated successfully, but these errors were encountered:

09jvilla · 2022-05-16T14:34:27Z

Adding some more context to the discussion:

We believe setting both of the above flags to false should reduce chance of failure during rolling ingester restarts. Specifically, without these settings applied, performing a rolling restart of all ingesters can cause a cascading failure as active series are resharded onto healthy ingesters. In an extreme case, the system can end up being overloaded as every ingester ends up holding an open chunk in the TSDB head block for every active series.

Setting both flags to false eliminates the risk outlined above by allowing the system to fall below the configured replication factor temporarily. This relies on ingesters being run with stable hostname and storage, which is handled automatically if you're deploying the ingesters as a kubernetes StatefulSet. The helm chart does this for you automatically, so as long as you are using it you should be able to apply the recommended settings.

09jvilla · 2022-05-16T14:35:24Z

As we evaluate this, we're discussing the possibility of removing -distributor.extend-writes completely. Will link to the issue in the Mimir repo when we have it.

edit -- here's the issue for deprecating extend-writes:
grafana/mimir#1854

09jvilla · 2022-05-16T14:36:49Z

Related issue in Cortex filed by @bboreham about extend-writes:
cortexproject/cortex#1290

Related extend-writes issue in Mimir: grafana/mimir#92

09jvilla · 2022-05-31T23:08:48Z

We set -distributor.extend-writes to false in Mimir 2.1 and we'll be removing it (leaving it permanently set to false) in 2 releases.

Now the question is, do we set -ingester.unregister-on-shutdown=false in the Helm chart? We know that we may not want to do this in Mimir itself since we may not be able to make assumptions about stable hostname and storage, but we should be able to do this in the Helm chart where we deploy as a SS.

09jvilla · 2022-06-01T20:03:46Z

For those following along, we've decided that yes we're comfortable setting -ingester.unregister-on-shutdown=false in the Helm chart. PR incoming.

09jvilla mentioned this issue May 16, 2022

Proposal: deprecate -distributor.extend-writes and keep it always disabled grafana/mimir#1854

Closed

Logiraptor mentioned this issue Jun 1, 2022

[helm] disable unregister on shutdown grafana/mimir#1994

Merged

3 tasks

56quarters closed this as completed in grafana/mimir#1994 Jun 2, 2022

liam-howe-maersk mentioned this issue Sep 1, 2023

Unregister ingesters on shutdown unless an update is being rolled out grafana/mimir#5901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mimir-distributed] Avoid series resharding during rolling ingester restart #1313

[mimir-distributed] Avoid series resharding during rolling ingester restart #1313

Logiraptor commented Apr 27, 2022

09jvilla commented May 16, 2022

09jvilla commented May 16, 2022 •

edited

Loading

09jvilla commented May 16, 2022 •

edited

Loading

09jvilla commented May 31, 2022

09jvilla commented Jun 1, 2022

[mimir-distributed] Avoid series resharding during rolling ingester restart #1313

[mimir-distributed] Avoid series resharding during rolling ingester restart #1313

Comments

Logiraptor commented Apr 27, 2022

09jvilla commented May 16, 2022

09jvilla commented May 16, 2022 • edited Loading

09jvilla commented May 16, 2022 • edited Loading

09jvilla commented May 31, 2022

09jvilla commented Jun 1, 2022

09jvilla commented May 16, 2022 •

edited

Loading

09jvilla commented May 16, 2022 •

edited

Loading