-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardcode ring.WriteNoExtend for distributor push operations #7517
Conversation
Signed-off-by: Callum Styan <callumstyan@gmail.com>
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
1 similar comment
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like this is an underlying bug in dskit which artificially inflates the replication factor via non-ready nodes. However, I also think using WriteNoExtend
is an independently safer & better choice as it reduces the risk of cascading failures due to overloading nodes when their neighbors in the ring die.
…7517) With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts. My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, [ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403) (particularly these lines [1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389) and [2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397)) and [replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71) (particularly these lines [1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36) and [2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67)) The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using `WriteNoExtend` operations instead of `Write` operations ([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93)) In this PR I've hardcoded our call to the ring in `distributor.Push` to always use `ring.WriteNoExtend`. IMO this is the better option to allowing `ring.WriteNoExtend` to be used optionally and defaulting to `ring.Write`. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug. For more background information see: grafana/mimir#1854 Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>
…7517) With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts. My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, [ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403) (particularly these lines [1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389) and [2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397)) and [replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71) (particularly these lines [1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36) and [2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67)) The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using `WriteNoExtend` operations instead of `Write` operations ([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93)) In this PR I've hardcoded our call to the ring in `distributor.Push` to always use `ring.WriteNoExtend`. IMO this is the better option to allowing `ring.WriteNoExtend` to be used optionally and defaulting to `ring.Write`. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug. For more background information see: grafana/mimir#1854 Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>
…7517) With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts. My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, [ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403) (particularly these lines [1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389) and [2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397)) and [replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71) (particularly these lines [1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36) and [2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67)) The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using `WriteNoExtend` operations instead of `Write` operations ([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93)) In this PR I've hardcoded our call to the ring in `distributor.Push` to always use `ring.WriteNoExtend`. IMO this is the better option to allowing `ring.WriteNoExtend` to be used optionally and defaulting to `ring.Write`. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug. For more background information see: grafana/mimir#1854 Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>
With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts.
My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, ring.Get (particularly these lines 1 and 2) and replicationStrategy.Filter (particularly these lines 1 and 2)
The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using
WriteNoExtend
operations instead ofWrite
operations (here)In this PR I've hardcoded our call to the ring in
distributor.Push
to always usering.WriteNoExtend
. IMO this is the better option to allowingring.WriteNoExtend
to be used optionally and defaulting toring.Write
. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug.For more background information see: grafana/mimir#1854
Signed-off-by: Callum Styan callumstyan@gmail.com