Hardcode ring.WriteNoExtend for distributor push operations #7517

cstyan · 2022-10-25T23:58:19Z

With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, ring.Get (particularly these lines 1 and 2) and replicationStrategy.Filter (particularly these lines 1 and 2)

The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using WriteNoExtend operations instead of Write operations (here)

In this PR I've hardcoded our call to the ring in distributor.Push to always use ring.WriteNoExtend. IMO this is the better option to allowing ring.WriteNoExtend to be used optionally and defaulting to ring.Write. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug.

For more background information see: grafana/mimir#1854

Signed-off-by: Callum Styan callumstyan@gmail.com

Signed-off-by: Callum Styan <callumstyan@gmail.com>

grafanabot · 2022-10-26T00:04:44Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot · 2022-10-26T20:23:33Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

JordanRushing

LGTM

owen-d

It looks to me like this is an underlying bug in dskit which artificially inflates the replication factor via non-ready nodes. However, I also think using WriteNoExtend is an independently safer & better choice as it reduces the risk of cascading failures due to overloading nodes when their neighbors in the ring die.

…7517) With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts. My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, [ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403) (particularly these lines [1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389) and [2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397)) and [replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71) (particularly these lines [1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36) and [2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67)) The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using `WriteNoExtend` operations instead of `Write` operations ([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93)) In this PR I've hardcoded our call to the ring in `distributor.Push` to always use `ring.WriteNoExtend`. IMO this is the better option to allowing `ring.WriteNoExtend` to be used optionally and defaulting to `ring.Write`. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug. For more background information see: grafana/mimir#1854 Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>

use ring.WriteNoExtend for distributor push operations

19555ec

Signed-off-by: Callum Styan <callumstyan@gmail.com>

pull-request-size bot added the size/XS label Oct 25, 2022

cstyan marked this pull request as ready for review October 26, 2022 23:26

cstyan requested a review from a team as a code owner October 26, 2022 23:26

JordanRushing approved these changes Oct 26, 2022

View reviewed changes

owen-d approved these changes Oct 27, 2022

View reviewed changes

cstyan merged commit 46b40a5 into main Oct 27, 2022

cstyan deleted the 20221025-write-hardcode branch October 27, 2022 20:31

cstyan mentioned this pull request Feb 6, 2023

allow optionally setting extend-writes #7516

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcode ring.WriteNoExtend for distributor push operations #7517

Hardcode ring.WriteNoExtend for distributor push operations #7517

cstyan commented Oct 25, 2022 •

edited

Loading

grafanabot commented Oct 26, 2022

grafanabot commented Oct 26, 2022

JordanRushing left a comment

owen-d left a comment

Hardcode ring.WriteNoExtend for distributor push operations #7517

Hardcode ring.WriteNoExtend for distributor push operations #7517

Conversation

cstyan commented Oct 25, 2022 • edited Loading

grafanabot commented Oct 26, 2022

grafanabot commented Oct 26, 2022

JordanRushing left a comment

Choose a reason for hiding this comment

owen-d left a comment

Choose a reason for hiding this comment

cstyan commented Oct 25, 2022 •

edited

Loading