Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: deprecate -distributor.extend-writes and keep it always disabled #1854

Closed
pracucci opened this issue May 16, 2022 · 4 comments · Fixed by #1856
Closed

Proposal: deprecate -distributor.extend-writes and keep it always disabled #1854

pracucci opened this issue May 16, 2022 · 4 comments · Fixed by #1856
Assignees

Comments

@pracucci
Copy link
Collaborator

-distributor.extend-writes (defaults true) is a setting used by distributors and rulers to find the set of ingesters a series should be written to. The doc says:

Try writing to an additional ingester in the presence of an ingester not in the ACTIVE state. It is useful to disable this along with -ingester.ring.unregister-on-shutdown=false in order to not spread samples to extra ingesters during rolling restarts with consistent naming.

In Mimir jsonnet we currently disable extend-writes when ingesters are configured to not unregister on shutdown. For that use case, we already know it's safe to disable it.

Question: is it safe to also disable it for the case ingesters do unregister on shutdown (default)? To answer this question, we need to understand what are the cases when an ingester is not ACTIVE and if there could be more than 1 non-ACTIVE ingester in 1 replication set under normal conditions. If there's no more than 1 non-ACTIVE ingester in 1 replication set under normal conditions, then it's safe to keep it always disabled.

in 1 replication set

I'm mentioning it, because the extend-writes setting affect only the ring.Get() called by distributor Push(), which is used to build the replication set for 1 series. With a RF=3 and quorum=2, we can tolerate up to 1 non-ACTIVE ingester in each replication set when we write series from distributors to ingesters.

we need to understand what are the cases when an ingester is not ACTIVE and if there could be more than 1 non-ACTIVE ingester in 1 replication set under normal conditions

Cases coming to my mind:

  • Rolling update: 1 ingester at a time (when multi-zone is disabled) or multiple ingesters in the same zone (when multi-zone is enabled). In both cases, there should be only up to 1 non-ACTIVE ingester per replication set. We're safe.
  • Scale up: multiple ingesters in PENDING or JOINING state, in different zones, at the same time. This could happen, but it's no different than the current behaviour with Mimir deployed with Jsonnet. I think it's not a big deal for Mimir, because a new ingester stays in PENDING/JOINING state for a very short time.
  • Scale down: in order to guarantee query results correctness, we need to scale down 1 ingester at a time or, at most, 1 zone at a time (when multi-zone is enabled). We should be safe.

Proposal

I propose to deprecate it and keep it always disabled.

@johannaratliff
Copy link
Contributor

With confidence from our cloud configuration, I support this.

@09jvilla
Copy link
Contributor

09jvilla commented May 16, 2022

@pracucci I'll excuse myself from commenting on the correctness of the proposal since I'm a bit out of my depth, but roughly the logic makes sense to me.

As you're considering both the -ingester.unregister-on-shutdown=false and -ingester.unregister-on-shutdown=true cases, am I right to assume that you want to keep -ingester.unregister-on-shutdown=true as the default for the unregister flag?

If so, do you have opposition to overriding this default in the Helm chart (meaning, having -ingester.unregister-on-shutdown=false in the Helm chart) since we can make assumptions about our Helm users deploying the ingesters as StatefulSets? See @Logiraptor 's issue here in the chart repo: grafana/helm-charts#1313


Both my comments are a bit out of scope for this issue so we can take that discussion elsewhere if better (e.g., on that helm chart issue I linked above). Don't want to sidetrack the simple yes/no on deprecating the extend-writes flag.

@pracucci
Copy link
Collaborator Author

pracucci commented May 16, 2022

As you're considering both the -ingester.unregister-on-shutdown=false and -ingester.unregister-on-shutdown=true cases, am I right to assume that you want to keep -ingester.unregister-on-shutdown=true as the default for the unregister flag?

I would start a separation discussion about it (unrelated from this issue). The main problem changing default to -ingester.unregister-on-shutdown=false is that rolling updates wouldn't work if the instance ID (in the ring) is not preserved between restarts, because after the restart of an ingester it would startup with a different ID. Maybe not a common case in the real world, but I want all of us to think a bit more about it.

@pracucci pracucci self-assigned this May 17, 2022
@bboreham
Copy link
Contributor

I have never understood what this feature was supposed to achieve. I believe it causes issues such as cortexproject/cortex#1290. Happy to see it removed, although it would be even better to understand it first.

cstyan added a commit to grafana/loki that referenced this issue Oct 27, 2022
With zone awareness enabled we allow restarting multiple ingesters
within a single zone at the same time, this means that for many more
write requests there's potentially only 2 of the 3 replicas for a stream
up at the time the write request is received. This lead to a large
increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the
following; the replica set ring code has something referred to as
"extending the replica set". This essentially is the adding of ingesters
in the ring that are not in an ACTIVE state (for example, are in
LEAVING) to the set of replicas considered valid for an operation.
Adding of the instance in a state other than ACTIVE can be controlled by
using a different operation type, which Mimir was using but we were not.
Here are the relevant functions,
[ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389)
and
[2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397))
and
[replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36)
and
[2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67))

The tl; dr of the above code snippets is that we seem to always have
required a minSuccess from replicationFactor/2 + 1 except we overwrite
replicationFactor to be the # of instances which is likely always going
to be more than just 3 ingester replicas because of the replica set
extension. At least for zone aware rollouts this seems to be the case.
This is because many ingesters will be in the LEAVING state. We can
avoid this by just using `WriteNoExtend` operations instead of `Write`
operations
([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93))

In this PR I've hardcoded our call to the ring in `distributor.Push` to
always use `ring.WriteNoExtend`. IMO this is the better option to
allowing `ring.WriteNoExtend` to be used optionally and defaulting to
`ring.Write`. Extending the replication set on writes feels like a
footgun for write outages and there's even some feelings internally that
allowing that extension is a bug.

For more background information see:
grafana/mimir#1854

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>
lxwzy pushed a commit to lxwzy/loki that referenced this issue Nov 7, 2022
…7517)

With zone awareness enabled we allow restarting multiple ingesters
within a single zone at the same time, this means that for many more
write requests there's potentially only 2 of the 3 replicas for a stream
up at the time the write request is received. This lead to a large
increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the
following; the replica set ring code has something referred to as
"extending the replica set". This essentially is the adding of ingesters
in the ring that are not in an ACTIVE state (for example, are in
LEAVING) to the set of replicas considered valid for an operation.
Adding of the instance in a state other than ACTIVE can be controlled by
using a different operation type, which Mimir was using but we were not.
Here are the relevant functions,
[ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389)
and
[2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397))
and
[replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36)
and
[2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67))

The tl; dr of the above code snippets is that we seem to always have
required a minSuccess from replicationFactor/2 + 1 except we overwrite
replicationFactor to be the # of instances which is likely always going
to be more than just 3 ingester replicas because of the replica set
extension. At least for zone aware rollouts this seems to be the case.
This is because many ingesters will be in the LEAVING state. We can
avoid this by just using `WriteNoExtend` operations instead of `Write`
operations
([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93))

In this PR I've hardcoded our call to the ring in `distributor.Push` to
always use `ring.WriteNoExtend`. IMO this is the better option to
allowing `ring.WriteNoExtend` to be used optionally and defaulting to
`ring.Write`. Extending the replication set on writes feels like a
footgun for write outages and there's even some feelings internally that
allowing that extension is a bug.

For more background information see:
grafana/mimir#1854

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>
changhyuni pushed a commit to changhyuni/loki that referenced this issue Nov 8, 2022
…7517)

With zone awareness enabled we allow restarting multiple ingesters
within a single zone at the same time, this means that for many more
write requests there's potentially only 2 of the 3 replicas for a stream
up at the time the write request is received. This lead to a large
increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the
following; the replica set ring code has something referred to as
"extending the replica set". This essentially is the adding of ingesters
in the ring that are not in an ACTIVE state (for example, are in
LEAVING) to the set of replicas considered valid for an operation.
Adding of the instance in a state other than ACTIVE can be controlled by
using a different operation type, which Mimir was using but we were not.
Here are the relevant functions,
[ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389)
and
[2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397))
and
[replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36)
and
[2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67))

The tl; dr of the above code snippets is that we seem to always have
required a minSuccess from replicationFactor/2 + 1 except we overwrite
replicationFactor to be the # of instances which is likely always going
to be more than just 3 ingester replicas because of the replica set
extension. At least for zone aware rollouts this seems to be the case.
This is because many ingesters will be in the LEAVING state. We can
avoid this by just using `WriteNoExtend` operations instead of `Write`
operations
([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93))

In this PR I've hardcoded our call to the ring in `distributor.Push` to
always use `ring.WriteNoExtend`. IMO this is the better option to
allowing `ring.WriteNoExtend` to be used optionally and defaulting to
`ring.Write`. Extending the replication set on writes feels like a
footgun for write outages and there's even some feelings internally that
allowing that extension is a bug.

For more background information see:
grafana/mimir#1854

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Abuelodelanada pushed a commit to canonical/loki that referenced this issue Dec 1, 2022
…7517)

With zone awareness enabled we allow restarting multiple ingesters
within a single zone at the same time, this means that for many more
write requests there's potentially only 2 of the 3 replicas for a stream
up at the time the write request is received. This lead to a large
increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the
following; the replica set ring code has something referred to as
"extending the replica set". This essentially is the adding of ingesters
in the ring that are not in an ACTIVE state (for example, are in
LEAVING) to the set of replicas considered valid for an operation.
Adding of the instance in a state other than ACTIVE can be controlled by
using a different operation type, which Mimir was using but we were not.
Here are the relevant functions,
[ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389)
and
[2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397))
and
[replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36)
and
[2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67))

The tl; dr of the above code snippets is that we seem to always have
required a minSuccess from replicationFactor/2 + 1 except we overwrite
replicationFactor to be the # of instances which is likely always going
to be more than just 3 ingester replicas because of the replica set
extension. At least for zone aware rollouts this seems to be the case.
This is because many ingesters will be in the LEAVING state. We can
avoid this by just using `WriteNoExtend` operations instead of `Write`
operations
([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93))

In this PR I've hardcoded our call to the ring in `distributor.Push` to
always use `ring.WriteNoExtend`. IMO this is the better option to
allowing `ring.WriteNoExtend` to be used optionally and defaulting to
`ring.Write`. Extending the replication set on writes feels like a
footgun for write outages and there's even some feelings internally that
allowing that extension is a bug.

For more background information see:
grafana/mimir#1854

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants