Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcode ring.WriteNoExtend for distributor push operations #7517

Merged
merged 1 commit into from
Oct 27, 2022

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Oct 25, 2022

With zone awareness enabled we allow restarting multiple ingesters within a single zone at the same time, this means that for many more write requests there's potentially only 2 of the 3 replicas for a stream up at the time the write request is received. This lead to a large increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the following; the replica set ring code has something referred to as "extending the replica set". This essentially is the adding of ingesters in the ring that are not in an ACTIVE state (for example, are in LEAVING) to the set of replicas considered valid for an operation. Adding of the instance in a state other than ACTIVE can be controlled by using a different operation type, which Mimir was using but we were not. Here are the relevant functions, ring.Get (particularly these lines 1 and 2) and replicationStrategy.Filter (particularly these lines 1 and 2)

The tl; dr of the above code snippets is that we seem to always have required a minSuccess from replicationFactor/2 + 1 except we overwrite replicationFactor to be the # of instances which is likely always going to be more than just 3 ingester replicas because of the replica set extension. At least for zone aware rollouts this seems to be the case. This is because many ingesters will be in the LEAVING state. We can avoid this by just using WriteNoExtend operations instead of Write operations (here)

In this PR I've hardcoded our call to the ring in distributor.Push to always use ring.WriteNoExtend. IMO this is the better option to allowing ring.WriteNoExtend to be used optionally and defaulting to ring.Write. Extending the replication set on writes feels like a footgun for write outages and there's even some feelings internally that allowing that extension is a bug.

For more background information see: grafana/mimir#1854

Signed-off-by: Callum Styan callumstyan@gmail.com

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

1 similar comment
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@cstyan cstyan marked this pull request as ready for review October 26, 2022 23:26
@cstyan cstyan requested a review from a team as a code owner October 26, 2022 23:26
Copy link
Contributor

@JordanRushing JordanRushing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this is an underlying bug in dskit which artificially inflates the replication factor via non-ready nodes. However, I also think using WriteNoExtend is an independently safer & better choice as it reduces the risk of cascading failures due to overloading nodes when their neighbors in the ring die.

@cstyan cstyan merged commit 46b40a5 into main Oct 27, 2022
@cstyan cstyan deleted the 20221025-write-hardcode branch October 27, 2022 20:31
lxwzy pushed a commit to lxwzy/loki that referenced this pull request Nov 7, 2022
…7517)

With zone awareness enabled we allow restarting multiple ingesters
within a single zone at the same time, this means that for many more
write requests there's potentially only 2 of the 3 replicas for a stream
up at the time the write request is received. This lead to a large
increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the
following; the replica set ring code has something referred to as
"extending the replica set". This essentially is the adding of ingesters
in the ring that are not in an ACTIVE state (for example, are in
LEAVING) to the set of replicas considered valid for an operation.
Adding of the instance in a state other than ACTIVE can be controlled by
using a different operation type, which Mimir was using but we were not.
Here are the relevant functions,
[ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389)
and
[2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397))
and
[replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36)
and
[2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67))

The tl; dr of the above code snippets is that we seem to always have
required a minSuccess from replicationFactor/2 + 1 except we overwrite
replicationFactor to be the # of instances which is likely always going
to be more than just 3 ingester replicas because of the replica set
extension. At least for zone aware rollouts this seems to be the case.
This is because many ingesters will be in the LEAVING state. We can
avoid this by just using `WriteNoExtend` operations instead of `Write`
operations
([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93))

In this PR I've hardcoded our call to the ring in `distributor.Push` to
always use `ring.WriteNoExtend`. IMO this is the better option to
allowing `ring.WriteNoExtend` to be used optionally and defaulting to
`ring.Write`. Extending the replication set on writes feels like a
footgun for write outages and there's even some feelings internally that
allowing that extension is a bug.

For more background information see:
grafana/mimir#1854

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>
changhyuni pushed a commit to changhyuni/loki that referenced this pull request Nov 8, 2022
…7517)

With zone awareness enabled we allow restarting multiple ingesters
within a single zone at the same time, this means that for many more
write requests there's potentially only 2 of the 3 replicas for a stream
up at the time the write request is received. This lead to a large
increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the
following; the replica set ring code has something referred to as
"extending the replica set". This essentially is the adding of ingesters
in the ring that are not in an ACTIVE state (for example, are in
LEAVING) to the set of replicas considered valid for an operation.
Adding of the instance in a state other than ACTIVE can be controlled by
using a different operation type, which Mimir was using but we were not.
Here are the relevant functions,
[ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389)
and
[2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397))
and
[replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36)
and
[2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67))

The tl; dr of the above code snippets is that we seem to always have
required a minSuccess from replicationFactor/2 + 1 except we overwrite
replicationFactor to be the # of instances which is likely always going
to be more than just 3 ingester replicas because of the replica set
extension. At least for zone aware rollouts this seems to be the case.
This is because many ingesters will be in the LEAVING state. We can
avoid this by just using `WriteNoExtend` operations instead of `Write`
operations
([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93))

In this PR I've hardcoded our call to the ring in `distributor.Push` to
always use `ring.WriteNoExtend`. IMO this is the better option to
allowing `ring.WriteNoExtend` to be used optionally and defaulting to
`ring.Write`. Extending the replication set on writes feels like a
footgun for write outages and there's even some feelings internally that
allowing that extension is a bug.

For more background information see:
grafana/mimir#1854

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Abuelodelanada pushed a commit to canonical/loki that referenced this pull request Dec 1, 2022
…7517)

With zone awareness enabled we allow restarting multiple ingesters
within a single zone at the same time, this means that for many more
write requests there's potentially only 2 of the 3 replicas for a stream
up at the time the write request is received. This lead to a large
increase in 500s during rollouts.

My suspicion is that the 500s on rollouts was always related to the
following; the replica set ring code has something referred to as
"extending the replica set". This essentially is the adding of ingesters
in the ring that are not in an ACTIVE state (for example, are in
LEAVING) to the set of replicas considered valid for an operation.
Adding of the instance in a state other than ACTIVE can be controlled by
using a different operation type, which Mimir was using but we were not.
Here are the relevant functions,
[ring.Get](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L335-L403)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L381-L389)
and
[2](https://github.com/grafana/dskit/blob/8d6d914ef639c45eda4ab6f76448b115dfbc504c/ring/ring.go#L394-L397))
and
[replicationStrategy.Filter](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L22-L71)
(particularly these lines
[1](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L32-L36)
and
[2](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/replication_strategy.go#L54-L67))

The tl; dr of the above code snippets is that we seem to always have
required a minSuccess from replicationFactor/2 + 1 except we overwrite
replicationFactor to be the # of instances which is likely always going
to be more than just 3 ingester replicas because of the replica set
extension. At least for zone aware rollouts this seems to be the case.
This is because many ingesters will be in the LEAVING state. We can
avoid this by just using `WriteNoExtend` operations instead of `Write`
operations
([here](https://github.com/grafana/dskit/blob/789ec0ca4a3b372335fad1c73147366a23ddc8b6/ring/ring.go#L83-L93))

In this PR I've hardcoded our call to the ring in `distributor.Push` to
always use `ring.WriteNoExtend`. IMO this is the better option to
allowing `ring.WriteNoExtend` to be used optionally and defaulting to
`ring.Write`. Extending the replication set on writes feels like a
footgun for write outages and there's even some feelings internally that
allowing that extension is a bug.

For more background information see:
grafana/mimir#1854

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants