-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mimir-distributed] Avoid series resharding during rolling ingester restart #1313
Comments
Adding some more context to the discussion: We believe setting both of the above flags to false should reduce chance of failure during rolling ingester restarts. Specifically, without these settings applied, performing a rolling restart of all ingesters can cause a cascading failure as active series are resharded onto healthy ingesters. In an extreme case, the system can end up being overloaded as every ingester ends up holding an open chunk in the TSDB head block for every active series. Setting both flags to false eliminates the risk outlined above by allowing the system to fall below the configured replication factor temporarily. This relies on ingesters being run with stable hostname and storage, which is handled automatically if you're deploying the ingesters as a kubernetes StatefulSet. The helm chart does this for you automatically, so as long as you are using it you should be able to apply the recommended settings. |
As we evaluate this, we're discussing the possibility of removing edit -- here's the issue for deprecating extend-writes: |
Related issue in Cortex filed by @bboreham about extend-writes: Related extend-writes issue in Mimir: grafana/mimir#92 |
We set Now the question is, do we set |
For those following along, we've decided that yes we're comfortable setting |
As ingesters are restarted, their series are resharded onto other ingesters. During a rolling upgrade, this can lead to a lot of churn as series are resharded across virtually every ingester.
Since we know the ingesters in this chart are deployed via a StatefulSet, we know the hostname and storage is stable across restarts. This means we could feasibly run with the following settings:
If false, ingesters are not unregistered on shutdown and left in the ring with the LEAVING state. Setting to false prevents series resharding during ingesters rollouts, but requires to:
This is how we run Grafana Cloud, so it would make sense to me to bake this into the Helm Chart as well.
The text was updated successfully, but these errors were encountered: