Proposal: deprecate -distributor.extend-writes and keep it always disabled #1854

pracucci · 2022-05-16T14:58:24Z

-distributor.extend-writes (defaults true) is a setting used by distributors and rulers to find the set of ingesters a series should be written to. The doc says:

Try writing to an additional ingester in the presence of an ingester not in the ACTIVE state. It is useful to disable this along with -ingester.ring.unregister-on-shutdown=false in order to not spread samples to extra ingesters during rolling restarts with consistent naming.

In Mimir jsonnet we currently disable extend-writes when ingesters are configured to not unregister on shutdown. For that use case, we already know it's safe to disable it.

Question: is it safe to also disable it for the case ingesters do unregister on shutdown (default)? To answer this question, we need to understand what are the cases when an ingester is not ACTIVE and if there could be more than 1 non-ACTIVE ingester in 1 replication set under normal conditions. If there's no more than 1 non-ACTIVE ingester in 1 replication set under normal conditions, then it's safe to keep it always disabled.

in 1 replication set

I'm mentioning it, because the extend-writes setting affect only the ring.Get() called by distributor Push(), which is used to build the replication set for 1 series. With a RF=3 and quorum=2, we can tolerate up to 1 non-ACTIVE ingester in each replication set when we write series from distributors to ingesters.

we need to understand what are the cases when an ingester is not ACTIVE and if there could be more than 1 non-ACTIVE ingester in 1 replication set under normal conditions

Cases coming to my mind:

Rolling update: 1 ingester at a time (when multi-zone is disabled) or multiple ingesters in the same zone (when multi-zone is enabled). In both cases, there should be only up to 1 non-ACTIVE ingester per replication set. We're safe.
Scale up: multiple ingesters in PENDING or JOINING state, in different zones, at the same time. This could happen, but it's no different than the current behaviour with Mimir deployed with Jsonnet. I think it's not a big deal for Mimir, because a new ingester stays in PENDING/JOINING state for a very short time.
Scale down: in order to guarantee query results correctness, we need to scale down 1 ingester at a time or, at most, 1 zone at a time (when multi-zone is enabled). We should be safe.

Proposal

I propose to deprecate it and keep it always disabled.

The text was updated successfully, but these errors were encountered:

johannaratliff · 2022-05-16T15:09:28Z

With confidence from our cloud configuration, I support this.

09jvilla · 2022-05-16T16:31:48Z

@pracucci I'll excuse myself from commenting on the correctness of the proposal since I'm a bit out of my depth, but roughly the logic makes sense to me.

As you're considering both the -ingester.unregister-on-shutdown=false and -ingester.unregister-on-shutdown=true cases, am I right to assume that you want to keep -ingester.unregister-on-shutdown=true as the default for the unregister flag?

If so, do you have opposition to overriding this default in the Helm chart (meaning, having -ingester.unregister-on-shutdown=false in the Helm chart) since we can make assumptions about our Helm users deploying the ingesters as StatefulSets? See @Logiraptor 's issue here in the chart repo: grafana/helm-charts#1313

Both my comments are a bit out of scope for this issue so we can take that discussion elsewhere if better (e.g., on that helm chart issue I linked above). Don't want to sidetrack the simple yes/no on deprecating the extend-writes flag.

pracucci · 2022-05-16T16:38:26Z

As you're considering both the -ingester.unregister-on-shutdown=false and -ingester.unregister-on-shutdown=true cases, am I right to assume that you want to keep -ingester.unregister-on-shutdown=true as the default for the unregister flag?

I would start a separation discussion about it (unrelated from this issue). The main problem changing default to -ingester.unregister-on-shutdown=false is that rolling updates wouldn't work if the instance ID (in the ring) is not preserved between restarts, because after the restart of an ingester it would startup with a different ID. Maybe not a common case in the real world, but I want all of us to think a bit more about it.

bboreham · 2022-05-17T10:26:16Z

I have never understood what this feature was supposed to achieve. I believe it causes issues such as cortexproject/cortex#1290. Happy to see it removed, although it would be even better to understand it first.

With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts. My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, [ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403) (particularly these lines [1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389) and [2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397)) and [replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71) (particularly these lines [1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36) and [2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67)) The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using `WriteNoExtend` operations instead of `Write` operations ([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93)) In this PR I've hardcoded our call to the ring in `distributor.Push` to always use `ring.WriteNoExtend`. IMO this is the better option to allowing `ring.WriteNoExtend` to be used optionally and defaulting to `ring.Write`. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug. For more background information see: grafana/mimir#1854 Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>

…7517) With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts. My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, [ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403) (particularly these lines [1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389) and [2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397)) and [replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71) (particularly these lines [1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36) and [2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67)) The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using `WriteNoExtend` operations instead of `Write` operations ([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93)) In this PR I've hardcoded our call to the ring in `distributor.Push` to always use `ring.WriteNoExtend`. IMO this is the better option to allowing `ring.WriteNoExtend` to be used optionally and defaulting to `ring.Write`. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug. For more background information see: grafana/mimir#1854 Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>

pracucci added the component/distributor label May 16, 2022

pracucci mentioned this issue May 16, 2022

Deprecate -distributor.extend-writes and set it always to false #1856

Merged

3 tasks

09jvilla mentioned this issue May 16, 2022

[mimir-distributed] Avoid series resharding during rolling ingester restart grafana/helm-charts#1313

Closed

pracucci self-assigned this May 17, 2022

pracucci closed this as completed in #1856 May 17, 2022

cstyan mentioned this issue Oct 26, 2022

Hardcode ring.WriteNoExtend for distributor push operations grafana/loki#7517

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: deprecate -distributor.extend-writes and keep it always disabled #1854

Proposal: deprecate -distributor.extend-writes and keep it always disabled #1854

pracucci commented May 16, 2022

johannaratliff commented May 16, 2022

09jvilla commented May 16, 2022 •

edited

Loading

pracucci commented May 16, 2022 •

edited

Loading

bboreham commented May 17, 2022

Proposal: deprecate -distributor.extend-writes and keep it always disabled #1854

Proposal: deprecate -distributor.extend-writes and keep it always disabled #1854

Comments

pracucci commented May 16, 2022

Proposal

johannaratliff commented May 16, 2022

09jvilla commented May 16, 2022 • edited Loading

pracucci commented May 16, 2022 • edited Loading

bboreham commented May 17, 2022

09jvilla commented May 16, 2022 •

edited

Loading

pracucci commented May 16, 2022 •

edited

Loading