Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mimir-distributed] Avoid series resharding during rolling ingester restart #1313

Closed
Logiraptor opened this issue Apr 27, 2022 · 5 comments · Fixed by grafana/mimir#1994
Closed

Comments

@Logiraptor
Copy link
Contributor

As ingesters are restarted, their series are resharded onto other ingesters. During a rolling upgrade, this can lead to a lot of churn as series are resharded across virtually every ingester.

Since we know the ingesters in this chart are deployed via a StatefulSet, we know the hostname and storage is stable across restarts. This means we could feasibly run with the following settings:

-distributor.extend-writes=false
-ingester.unregister-on-shutdown=false

If false, ingesters are not unregistered on shutdown and left in the ring with the LEAVING state. Setting to false prevents series resharding during ingesters rollouts, but requires to:

  1. Either manually forget ingesters on scale down or invoke the /shutdown endpoint
  2. Ensure ingester ID is preserved during rollouts

This is how we run Grafana Cloud, so it would make sense to me to bake this into the Helm Chart as well.

@09jvilla
Copy link
Contributor

Adding some more context to the discussion:

We believe setting both of the above flags to false should reduce chance of failure during rolling ingester restarts. Specifically, without these settings applied, performing a rolling restart of all ingesters can cause a cascading failure as active series are resharded onto healthy ingesters. In an extreme case, the system can end up being overloaded as every ingester ends up holding an open chunk in the TSDB head block for every active series.

Setting both flags to false eliminates the risk outlined above by allowing the system to fall below the configured replication factor temporarily. This relies on ingesters being run with stable hostname and storage, which is handled automatically if you're deploying the ingesters as a kubernetes StatefulSet. The helm chart does this for you automatically, so as long as you are using it you should be able to apply the recommended settings.

@09jvilla
Copy link
Contributor

09jvilla commented May 16, 2022

As we evaluate this, we're discussing the possibility of removing -distributor.extend-writes completely. Will link to the issue in the Mimir repo when we have it.

edit -- here's the issue for deprecating extend-writes:
grafana/mimir#1854

@09jvilla
Copy link
Contributor

09jvilla commented May 16, 2022

Related issue in Cortex filed by @bboreham about extend-writes:
cortexproject/cortex#1290

Related extend-writes issue in Mimir: grafana/mimir#92

@09jvilla
Copy link
Contributor

We set -distributor.extend-writes to false in Mimir 2.1 and we'll be removing it (leaving it permanently set to false) in 2 releases.

Now the question is, do we set -ingester.unregister-on-shutdown=false in the Helm chart? We know that we may not want to do this in Mimir itself since we may not be able to make assumptions about stable hostname and storage, but we should be able to do this in the Helm chart where we deploy as a SS.

@09jvilla
Copy link
Contributor

09jvilla commented Jun 1, 2022

For those following along, we've decided that yes we're comfortable setting -ingester.unregister-on-shutdown=false in the Helm chart. PR incoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants