Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and fix unbalanced series between ingesters (no shuffle sharding) #4736

Closed
pracucci opened this issue Apr 14, 2023 · 20 comments
Closed
Assignees

Comments

@pracucci
Copy link
Collaborator

While running Mimir in production (and Cortex before) we've always observed some unbalanced series between ingesters. This unbalance exists both when shuffle sharding is enabled and disabled, even if it's even more emphasised when shuffle sharding is enabled.

For example, the following Mimir cluster shows a 25% spread between the ingester with the lowest and higher number of series:

Screenshot 2023-04-14 at 15 24 57

This issue tracks the investigation and discussion of how to fix it in the case shuffle sharding is not used. When shuffle sharding is used, it adds extra complexity to fix it, but I believe we first have to fix it for the "no shuffle sharding" case and then move to shuffle sharding (because shuffle sharding was designed on the expectation that series for a given tenant are evenly sharded between ingesters of its shard).

@pracucci
Copy link
Collaborator Author

This comment is a redacted copy-paste of an internal ticket at Grafana Labs.

Why are series unbalanced between ingesters?

In the past I've analysed the correlation between the number of series per ingester and the number of tokens registered in the ring and I haven't found any correlation. There are two reasons for this:

  1. The function used to compute the % of registered tokens in the ring was wrong (fix here)
  2. Assuming no shuffle sharding, to have a correlation between the number of series in the ingesters and the owned tokens % we need the compute the latter taking in account zone-awareness and RF=3

So, I've computed the % of tokens owned by each ingester taking in account zone-awareness and RF=3 and there's an almost perfect correlation with the number of series (no shuffle sharding) per ingester:

Screenshot 2023-04-13 at 15 19 27

Summary: series are uneven because token ranges are unevenly spaced with computing it taking in account zone-awareness and RF=3.

The spread of series per-ingester vs owned tokens

We observe a 24% spread between the ingester with lowest and highest number of series of an user with shuffle sharding disabled. The same 24% spread is measured in the number of tokens owned per ingester taking in account zone-awareness and RF=3, and a similar spread of 22% is measured in the number of tokens registered in the ring:

Screenshot 2023-04-13 at 15 21 06

Summary: zone-awareness and RF=3 does not add more spread than the initial one caused by uneven token ranges registered in the ring.

Experiment with a simulated ring with perfectly evenly spaced tokens

I've built a ring with perfectly evenly spaced tokens and then run the same analysis. Tokens are still perfectly spaced even after applying the RF=3 and zone-awareness:

Screenshot 2023-04-13 at 15 30 03

Summary: this is a counterproof that "zone-awareness and RF=3 does not add more spread than the initial one caused by uneven token ranges registered in the ring".

Conclusion

  • Series are unevenly distributed between ingesters even when shuffle-sharding is disabled
  • When shuffle sharding is disabled, uneven distribution of series is caused by uneven token ranges registered in the ring
  • Token ranges registered in the ring should be evenly distributed not only between all ingesters but also among the ingesters of each zone
  • Shuffle sharding exacerbates this problem (out of the scope of this issue)

References

@pracucci
Copy link
Collaborator Author

pracucci commented Apr 14, 2023

How much the "number of tokens per ingester" impact the spread between ingesters?

I've run a simulation to check how the "tokens ownership spread" (which is the difference in % between the tokens owned by the lowest and highest ingester) change based on changing the number of tokens per ingester:

Screenshot 2023-04-14 at 15 33 24

How the spread changes (lower is better):

  • 64 tokens per ingester: 52%
  • 128 tokens per ingester: 41%
  • 256 tokens per ingester: 32%
  • 512 tokens per ingester: 24%
  • 1024 tokens per ingester: 17%
  • 2048 tokens per ingester: 12%
  • 4096 tokens per ingester: 9%
  • 8192 tokens per ingester: 6%

References

@pracucci
Copy link
Collaborator Author

Is crypto/rand better than math/rand?

No.

ring.GenerateTokens() uses math/rand, so I've run the simulation changing it with crypto/rand and the spread I got is the same.

@duricanikolic
Copy link
Contributor

This weekend I played a bit with the tool created by @pracucci and extended it a bit in the following way:

  • the first ingesters per zone will be assigned a tooken randomly, but they will be "almost" equidistant one from another (here)
  • tokens of all other ingesters are distributed by using this strategy: in each moment the biggest ring range from the same zone is split in 2 parts, and a new token is added in the middle (if not already taken by another ingester from another zone, otherwise the first available one is found). Token infos containing tokens info and sizes are connected via a doubly-linked list and a priority queue is used for choosing the next one (here).

Then I did some analyses to see :

  • A1: how the distribution of tokens will be done for ingesters during their registration by fixing the number of ingesters and choosing different number of tokens (see the results), and
  • A2: also if we fix the number of ingesters and number of tokens, create the distribution by using the approach above, and then generating tokens randomly (to simulate timeseries to be handled) and see to which ingesters they would be assigned.

I didn't get perfect results, but I am glad I did this exercise. In analysis A1 I calculated both "spreads" per-zone and for all ingesters. The later is pretty bad, but the former is actually good: depending from the combination of number of ingester - number of tokens per ingester we actually get either spread 0 or spred 0.5. The question is: which spreads should we be more concerned about: per-zone or in total? I guess the answer is both. This is also what Cassandra tries to achieve, and I would like to try to apply their-like algorithm (which is pretty complecated but I studied it a lot in the last days) and see how it works. The good thing with this approach is that we don't need a huge number of tokens per ingester.
In analysis A2 I simulated different scenarios (by changing the number of ingesters, tokens, zone-awareness, replication factor and number of series). Again the results are not perfect at all, because there are big differences in spreads, but if I am not mistaken there are usually only 2-3 groups of spreads where the majority of ingesters belong to. The "best" analysis that I got was when I chose 512 ingesters (I was tired and I didn't realize that 512%3!=0), 64 tokens per ingester, zone-awareness enabled, RF=3 and 10M series: The spread was only 4.5%. I don't know whether it was just a lucky case, but the result is this one and the analysis took ages to complete.

Some thoughts:

  • maybe we should try to chose candidate tokens in Cassandra-way (by selecting the best one which minimizes the spread in its zone and in general at each step)
  • should we try Murmur3Partitioner for chosing the tokens randomly?

@pracucci pracucci self-assigned this Apr 18, 2023
@krajorama
Copy link
Contributor

Increasing the number of tokens can mitigate the effect, especially if the current number of tokens is relatively small (64).
Feedback on Grafana Mimir community call: the migration is non-trivial in large clusters due to grafana/dskit#79 . Need to prioritize.

@duricanikolic
Copy link
Contributor

duricanikolic commented May 12, 2023

Inspired by the analysis done by @pracucci , I have spent some time on searching how other companies faced similar problems. I was particularly curious about the solution provided by Cassandra (details). As a result of that research I have built a prototype for building better balanced rings, both with and without zone-awareness. The prototype is just a simulator and does NOT use the same dskit structs Mimir does, e.g., ring.Desc, InstanceDesc). The reason for that is that at the moment I start writing the prototype I didn't have enough knowledge about dskit. I will try to keep this post as short as possible, and I am happy to share more details.

Problem Description

Starting from a ring of tokens and from a set of instances with a certain token ownership, we want to enrich the ring with additional tokens and assign them to a new instance in such a way that token ownership of the new set of instances is as close as possible to the previous one. The main challenges here are:

  1. How do we choose new tokens?
  2. How do we calculate the token ownership?
  3. How do we evaluate which token distribution is better than another one?

Each token is owned by exactly one instance, and each instance owns the same fixed number of tokens ($tokensPerInstance$).

Current Approach (Mimir-like)

In the current approach (analyses done so far) the 3 challenges are addressed like this:

  1. For each new instance we choose $tokensPerInstance$ random tokens.
  2. Each token “owns” the range between its predecessor in the ring and itself. Each instance owns all the ranges owned by all its tokens.
  3. We compare different token distributions by using the “token ownership spread”, i.e., the difference in % between instances with the lowest and the highest token ownership.

Proposed Approach (Cassandra-like)

The approach used in my prototype addresses the 3 challenges like this:

  1. For each new instance $tokensPerInstance$ tokens are chosen from a set of candidates, where each candidate is taken from the middle between 2 existing tokens. For example, if the ring contains tokens 10, 110 and 210, candidates would be 60, 160 and 260.
  2. Each token T “owns” the range between the predecessor (barrier) of the furthest token whose replica ends in T and T itself (replica start). Each instance owns all the ranges owned by all its tokens. Calculation of replicas depends on the zone-awareness and/or the replication factor of the ring. Have a look at an example of token ownership in a small ring (more details available in the draft).
  3. Comparison between different distributions is done via the standard deviation from the optimal token distribution, i.e., from $\frac{RF * tokenSpace}{instancesCount}$.
    TokenOwnership

Evaluation of an insertion of a candidate token

When a candidate token is inserted in a ring of tokens, it might affect ownership of other tokens from the ring, as well as of the instances that own the affected tokens. Changes of a token are calculated as the difference between the standard deviation of the ownership of the token in the original ring enriched with the candidate token and the standard deviation of the ownership of the token in the original ring. Changes of an instance are calculated as a sum of changes of all the tokens owned by that instance. The sum of changes of all tokens and instances represents the evaluation of the candidate token.

Adding a new instance to the token

As stated above, for each new instance $tokensPerInstance$ most appropriate candidate tokens are chosen, placed in the ring and assigned to the instance. For each of the $tokensPerInstance$ tokens the following is done:

  1. We calculate all possible candidates in an existing ring by selecting a candidate token in the middle between each pair of existing tokens in the ring.
  2. For each candidate, we calculate the change it gives rise to and put all the changes in a priority queue (min queue).
  3. We select the token giving rise to the minimum change, remove it from the priority queue, and insert it in the ring.
  4. We repeat the process $tokensPerInstance$ number of times.

Preliminary Results

Ring generation

In this phase I wanted to generate different rings related to 66 instances and compare them. I considered 3 scenarios:

  1. zone-awareness disabled with RF = 1,
  2. zone-awareness disabled with RF = 3, and
  3. zone-awareness enabled with RF = 3.

In each scenario I generated rings with different number of tokens per instance (4, 16, 64, 128, 256 and 512), first by using the random token selection (current approach) and then by using candidate token selection (new approach). For simplicity, I am sharing the results for scenario 3 only, and I am happy to share the rest of them. Optimal token ownership in this case is $\frac{3 * max(uint32)}{66} = 195225786.136$.

The first chart shows the comparison of standard deviations (average over 15 executed iterations) of both approaches for different number of tokens per instance. It can be seen that the standard deviation with lower number of tokens is much lower in the new approach, and that for a higher number of tokens they are close to each other. For example, the precision that is reached with 256 tokens in the current approach is similar to the one with 64 tokens in the new approach, while the precisions with 128 tokens are similar.
StDevCompRF3ZAE

You can also see the token ownership distributions of the 4 rings mentioned above (64 and 128 tokens per instance built with the 2 approaches). The new approach builds much more balanced rings comparing to the current one.
RingsWith64Tokens
RingsWith128Tokens

Timeseries distributions

I have then used the 4 rings generated above to simulate the distribution of 10M random keys (simulating 10M timeseries). The charts below show how many of the 10M tokens replicated 3 times with zone-awareness enabled are owned by each series. In this case, optimal token ownership is $\frac{3 * 10M}{66} =454545.45$. Again, we can see that the new approach distributes series in a more balanced way (standard deviations are 8.31% and 11.41% respectively).
TimeSeriesDistribution64Tokens
TimeSeriesDistribution128Tokens

The final chart I am sharing here is the distribution of 10M timeseries in the rings of 513 instances with 64 and 128 tokens each generated with the current and the new approach. We can notice that, having more tokens in the ring, the distribution became a bit more chaotic, but the standard deviation from the optimal token ownership ($\frac{3 * 10M}{513} =58479.53$) in the new approach (2.76% and 3.69%) is lower than in the current approach (12.69% and 9.03%).
TimeSeriesDistribution513Inst64Tokens
TimeSeriesDistribution513Inst128Token

A prototype for building better balanced rings
More details about the new approach (draft)

@duricanikolic
Copy link
Contributor

duricanikolic commented May 13, 2023

Changes of ring topology

I have done an additional test: I started from a generated ring of 66 instances with a certain number of tokens per instance, and I collected all standard deviations from the current optimal ownership while I was:

  1. first adding 66 instances one-by-one,
  2. and then removing the added 66 instances one-by-one in the LIFO mode.
    This was done for rings generated with different number of tokens per instance, and for both approaches. This somehow simulates the regular scaling up and scaling down of instances. The results are shown in the following chart.

Note: Next week I want to simulate removals of random instances (from a random zone) and consecutive addition of another instance in the same zone.

  • The x-axis represents the number of instances in the ring: 66, 68a (meaning 68 instances after an instance is added), ..., 132a (meaning 132 after an instance is added), 130r (meaning 130 after an instance is removed), ..., 66r (meaning 66 after an instance is removed).
  • The y-axis represents the standard deviation from the optimal token ownership (when having the number of instances on the x-axis)
  • We can see that, as expected, in all analysed cases, the standard deviation after removal of an instance is equal to the standard deviation before its addition, if we add and remove instances in the same order.

StDevByAddingAndRemoving

@duricanikolic
Copy link
Contributor

As mentioned in @pracucci's feedback, I am adding some additional comparisons:

  • comparison of token ownership in rings of 66 instances with 512 tokens each built by using the candidate selection approach and by random token selection
    RingsWith512Tokens
  • comparison of distributions of 10M timeseries to 66 instances with 512 tokens each in the rings built by using the candidate selection approach and by random token selection
    TimeSeriesDistribution66Inst512Tokens
  • comparison of distributions of 10M timeseries to 513 instances with 512 tokens each in the rings built by using the candidate selection approach and by random token selection
    TimeSeriesDistributions513Inst512Tokens

@duricanikolic
Copy link
Contributor

Spread comparison

If, instead of using standard deviation from optimal token distribution, we wanted to compare the spread, i.e., the difference between ingester with the highest and the lowest token ownership ($\frac{max - min}{max}$), we obtain the following results.

Note: we can notice that with higher number of tokens per instance (128+) the random token selection approach gives lower spread. It is, though, worth noting that the candidate selection approach selects candidates in such a way that the difference between the standard deviation from the optimal token ownership before and after adding a new token is minimal. Therefore this comparison is not perfect. Maybe it makes sense to define and analyze another approach in which candidates are selected related to the spread.

  • comparison of spread in rings of 66 instances with variable tokens each built by using the candidate selection approach and by random token selection
    SpreadComparisonRF3ZAE
  • comparison of spread of distributions of 10M timeseries in rings of 66 instances with variable tokens each built by using the candidate selection approach and by random token selection
    SpreadComparsion10MSeries66InstRF3ZAE
  • comparison of spread of distributions of 10M timeseries in rings of 513 instances with variable tokens each built by using the candidate selection approach and by random token selection
    SpreadComparison10MSeries513InstRF3ZAE

@colega
Copy link
Contributor

colega commented May 23, 2023

Hey, hi, can you explain me why are we talking about zone-awareness & replication factor here?

Maybe I'm missing the point, but I think that it only makes the conversation harder, without adding a lot of value to the problem statement.

I think we should consider just one zone, say zone-a, and replication to one ingester in that zone when talking about unbalanced series. The tokens that zone-b and zone-c have assigned should matter little to none when talking about unbalanced series in zone-a.

The only point where all zones matter is when designing an algorithm to choose the tokens, as unfortunately (or fortunately, if we keep adding features that use the ring) we use the same token space, so when deciding which tokens a new instance should take it should:

  • filter out the tokens of different zones
  • consider the tokens of different zones to avoid collisions, having to find a consecutive non-colliding token if that happens (this can be mitigated by starting each zone's tokens with the zone-index offset * max estimated ingesters per zone).

@colega
Copy link
Contributor

colega commented May 23, 2023

I had a chat with Yuri and had some ideas that I want to write down.

My thoughts flow

In the simulations I see here, I see iterations of simulations based on different randomness seeds. This introduces an extra dimension to the problem to reason about and also a non-deterministic production behaviour.

Why do we need randomness? Can we use the same seed all the time?

Let's say we want 512 tokens per ingester. First I thought: can we make the ingester-zone-a-0 take the tokens 0, MaxUint32/512, 2*(MaxUint32/512), etc., so we already start with a known situation? Then the rest of ingesters base their decision following an agorithm that we choose (see @duricanikolic's investigation on the best algorigthm, that's not the point here)

Then I thought, well that is still suboptimal, I see two issues:

  • When we add 10 more replicas, they will all try to take the same tokens, and since we use memberlist, they will take the same tokens, but then they'll realize the collision. So we'll need an observer like the one @bboreham is proposing
  • If we solve the previous one, the fact that we take decisions based on what we see in the ring still makes it non-deterministic as the order of startup of ingesters matters.
    • What if an ingester-zone-a-10 starts before ingester-zone-a-11, but then we shutdown ingester-zone-a-11? Is that still an optimal distribution? Will the ingester-zone-a-11 get the same optimal tokens the next time it starts?

But, it's easy for us to make each ingester know it's index (0, 1, 2...) and zone (a, b, ic...). (I know it's easy in Kubernetes, and I guess it's easy on bare-metal setups with Ansible, etc.)

So, what if the ingesters don't base their decision on what they see in the ring, but on what should be in the ring instead? When the ingester-zone-a-2 starts, it doesn't need to check which tokens belong to ingester-zone-a-1, it can assume they're the ones assigned based on the known algorithm, and just take the tokens that the algorithm tells them. This removes the race condition between the propagation of tokens, and it also removes the overhead of passing more tokens.

But wait, the whole point of ingesters registering the tokens in the ring is to distributors know which series belong to each ingester, right? So, we still need to pass the tokens to the distributors. Well no, we only have to tell the distributors that we're ingester-zone-a-134 and we own 512 tokens: the distributor can infer our tokens because they're well known. We don't need to put the tokens in the ring anymore.

TL;DR, my proposal

Write a function that assigns the optimal c tokens to the n-th ingester idx in m-th zone:

func tokens(c, idx, zone int) []uint32 {
	// Whatever is the optimal distribution.
	// Tested that doesn't generate collisions for 0 <= zone < 10 and 0 <= idx < 10000
	// Naive approach, which of course will collide.
	s := rand.NewSource(int64(idx))
	out := make([]uint32, count)
	for i := 0; i < count; i++ {
		out[i] = uint32(s.Int63() + int64(zone)) // offset each zone by 1.
	}
	return out    
}

Now, each ingester registers itself in the ring with the index and zone and the number of tokens. All ingesters should have the same number of tokens (otherwise we should deal with a migration process, as what is optimal for 128 may not be for 512).

Each distributor reads the available ingesters and rebuilds the ingester in-memory for the given amount of ingesters.

This way we remove all the randomness from this process, we make the messages lightweight and easy to understand, and the simulation will be much easier to follow.

@duricanikolic
Copy link
Contributor

I was thinking about generating the tokens for the first instance per zone like this, and then to add tokens for all consecutive instances in the middle between 2 existing tokens in the same zone:

func tokens(tokenCount, zoneCount int) [][]uint32 {
  tokensPerZone := make([][]uint32, 0, zonesCount)
  for z := 0; z < zonesCount; z++ {
    tokensPerZone = append(tokensPerZone, make([]uint32, 0, tokensPerInstanceCount))
      for t := 0; t < tokensPerInstanceCount; t++ {
        token := uint32(math.Pow(2, 32)*(1.0-(3.0*float64(t)+float64(zonesCount)))/float64(zonesCount*tokensPerInstanceCount)) - 1
        tokensPerZone[z] = append(tokensPerZone[z], token)
      }
      slices.Sort(tokensPerZone[z])
  }
  return tokensPerZone
}

@duricanikolic
Copy link
Contributor

duricanikolic commented Jun 13, 2023

Introduction of the Spread Minimizing Approach

The Cassandra-like approach has been abandoned, and we have opted for a different one, called spread minimizing approach. Assuming that each instance owns tokensPerInstance tokens and that there are zoneCount zones, given an instance id instanceID and zone id zoneID to which this instance belongs to, we build the tokens for that instance assuming that the tokens of all other instances having ids lower than the given one are present, as follows:

  • if instanceID == 0, we evenly distribute the tokensPerInstance tokens in the ring, such that for each token token, token%zonesCount == zoneID.
  • if instanceID != 0, we first calculate the optimal token ownership optimalInstanceOwnership for the instances in the ring as $\frac{2^{32}}{instanceID + 1}$, and then we try to assign that ownership to the new instance. It is done in tokensPerInstance iterations, one for each of the tokens that will be assigned to instanceID. At the beginning of the process, ownership of instanceID is 0. At each iteration i, we calculate the ownership tokenOwnership of the i$^{th}$ token as the highest multiple of zoneCount less or equal to $\frac{optimalInstanceOwnership - ownership}{tokensPerInstance - i}$. Then we determine the token with the highest ownership (let's say covering range [a, b]) of the instance with the highest ownership, and reduce their ownership by tokenOwnership, by placing a new token at position a + tokenOwnership.

Some important properties of the spread minimizing approach are:

  • for each token of the tokens calculated for instance instanceID and zone zoneID, token % zonesCount == zoneID.
  • the approach is idempotent, i.e., if applied multiple times to the same input, it will always produce the same result.
  • it guarantees consistency, i.e., by adding/removing 1 instance per zone, the number of series that are moved to another instance is at most seriesCount/instancesCount, where seriesCount and instancesCount are the total number of series and the total number of instances after adding/removing respectively.
  • spreads in different zones are equal

@duricanikolic
Copy link
Contributor

Preliminary Analyses (dedicated cells) of the Spread Minimizing Approach

We have done some preliminary analysis of the spread, calculated as $\frac{max - min}{max}$, where $min$ and $max$ are minimal and maximal ownership of instances in the ring. We have analyzed the following rings, whose instances own 512 tokens each:

  • the random token rings, i.e., rings generated by selecting 512 tokens randomly (rt approach)
  • the real rings, i.e., rings built by using the real tokens assigned to the ingesters on some selected prod cells (rr approach)
  • the rings built by using the spread minimizing approach (sm approach)
    The analyses are done on some of our prod cells.
    The results of the analyses can be seen in the following charts, which show the comparison of the spread of the 3 rings zone-by-zone, with shuffle sharding enabled (at left) and with shuffle sharding disabled(at right).

Test 1

tested-cell-01

Test 2

tested-cell-02

Test 3

tested-cell-03

@duricanikolic
Copy link
Contributor

Shuffle sharding on a big prod cell

In the case of shared cells with shuffle sharding active, we don't expect to have huge improvements. Nevertheless, we have analyzed the token distribution on a big prod cell and performed some experiments (by using the rt, rr and sm approaches introduced in the previous comment). In all experiments, the real shard sizes of the cell tenants were used.

Spread comparison by changing the minimum shard size

Currently, the minimum shard size is 3. We have performed analyses that show how the spread changes when the minimum shard size is set to 3 (current default), 6, 9, 12, 15, 120, 126, 240, 480, 486, 492, 498 and 504 (meaning shuffle sharding disabled).
In all the analyses the spread minimizing approach introduces some improvements. As expected, the higher the minimum shard size is the lower the spread is. Nevertheless, the minimum spread cannot be drastically changed.

ShuffleShardByMinShardSizes

Spread comparison by disabling shuffle sharding of big tenants

We have performed an additional test: we have compared the spreads obtained by disabling shuffle sharding for all tenants of the big cell having max-global-series-per-user limit above a certain value (4M, 6M and 8M before replication). The results are shown in the following chart. Note: MAX represents the current default, i.e., shuffle sharding is always enabled.

SpreadWithShardSize0WhenTimeseriesCountLimitReachedBeforeReplication

@duricanikolic
Copy link
Contributor

Out-of-order instance registration

We have analyzed how the spread in the ring built by the spread minimizing approach changes when the instances are not registered with the ring in ascending, but in a random order. We have calculated the worse spreads that we could have in rings with 3, 5, 10, 20, 30, 50, 100, 150 and 200 instances per zone, when we add 3, 5 or 7 instances per zone in a random order. The permutations of new ids that give the worse spread both when we add and when we remove instances are determined (see the results at the bottom of this comment). In some cases the spread is not as good as when the instances are registered in ascending order. Nevertheless, once all 3, 5, or 7 instances are added or removed, the spread goes back to the values that we have always seen (below 1%).

For example, in the case of adding 3 instances (with ids 3, 4, 5) in a zone where there are currently 3 instances (with ids 0, 1, 2), permutation [4, 3, 5] gives the worst spread (29.51%). This spread is encountered when instance 4 is added, but once all 3 instances are added, the spread becomes low (0.195%) again.

Q: What does this high spread mean for us? The instance 4 will be in the ring without instances 3 and 5 only for a limited (hopefully very short) time, so the unbalanced distribution due to a random registration of instances is not something permanent.
A by @pracucci: Although the unbalanced distribution is not permanent, for up to 3h the ingester will suffer such unbalance because of all series written to the instance 4 in the short period of time before other ingesters are also registered. These series stays there in-memory until removed from the TSDB head, which will take up to 3h (TSDB head is compacted every 2h for samples with time between -3h and -1h ago).
Q: How can we fix this?
A by @pracucci: Maybe we can force ingesters to register in the ring only after all previous ingesters have been registered?

WorstSpreadComparison

@duricanikolic
Copy link
Contributor

In this PR the SpreadMinimizingTokenGenerator, which follows the previously introduced approach, has been added to the dskit repository.
We wanted to check, for some of our prod cell, how many tokens already present in the existing rings would collide with the tokens generated by SpreadMinimizingTokenGenerator for the same configurations (number of tokens per instance, number of instances per zone and number of zones). This is something we will need to be aware of during a future migration from RandomTokenGenerator (which is currently in use) to SpreadMinimizingTokenGenerator. The following table summarizes the results. Note: the last column does NOT show percentage (%) but per-mille (‰).

cell #repeated tokens #total tokens ‰repeated tokens
cell 1 0 3072 0‰
cell 2 0 13824 0‰
cell 3 2 115200 0.01736111111‰
cell 4 0 9216 0‰
cell 5 1 53760 0.01860119048‰
cell 6 0 15360 0‰
cell 7 3 64512 0.04650297619‰
cell 8 7 165888 0.04219714506‰
cell 9 11 184320 0.05967881944‰
cell 10 0 23040 0‰
cell 11 4 133632 0.02993295019‰
cell 12 0 9216 0‰
cell 13 2 81408 0.02456761006‰
cell 14 6 122880 0.048828125‰
cell 15 0 12288 0‰
cell 16 12 258048 0.04650297619‰
cell 17 8 213504 0.03747002398‰
cell 18 3 82944 0.03616898148‰
cell 19 0 1536 0‰
cell 20 0 4608 0‰
cell 21 7 184320 0.03797743056‰
cell 22 0 56832 0‰
cell 23 1 16896 0.05918560606‰
cell 24 0 56832 0‰
cell 25 0 1536 0‰
cell 26 0 4608 0‰
cell 27 0 9216 0‰
cell 28 0 1536 0‰
cell 29 0 1536 0‰
cell 30 0 1536 0‰
cell 31 0 1536 0‰
cell 32 0 1536 0‰
cell 33 0 1536 0‰
cell 34 0 1536 0‰

RepeatedTokensPerInstance

@duricanikolic
Copy link
Contributor

duricanikolic commented Jun 22, 2023

How can we migrate to the SpreadMinimizingTokenGenerator?

Proposal 1 (working)

A possible way for migrating ingesters from the default, random token generation strategy to the new, spread-minimizing token generation strategy is the following. The steps need to be done zone-by-zone. Suppose we are migrating zone-x, this is what needs to be done:

  1. Set set the ingester PodDisruptionBudget to 0 max unavaiable replicas in a PR.
  2. Exclude zone-x from distributors and rulers in a PR by setting ingester.ring.excluded-zones: zone-x. Before merging the PR, ensure there is no unhealthy (restarting) ingester pods.
  3. Shut down all the ingesters from zone-x by using the shutdown-ingester script to all the ingesters from zone-x (this PR introduced a possibility to run shutdown-ingesters concurrently in order to speed up the shutting down process).
  4. (optional) Delete the tokens files (found on path ingester.ring.tokens-file-path) from all ingesters from zone-x
  5. Apply spread-minimizing configuration on ingesters in zone-x in a PR, by setting:
       ingester.ring.tokens-file-path: ''
       ingester.ring.token-generation-strategy: spread-minimizing
       ingester.ring.spread-minimizing-zones: zone-a,zone-b,zone-c
    
  6. After step 4 is deployed, restart all ingester pods belonging to zone-x.
      kubectl rollout restart statefulset ingester-zone-x -n <namespace>
    
  7. revert steps 2
  8. revert step 1

Proposal 2 (not working because of automated downscaling)

Note: this proposal is possible only if there is no downscaling process (triggered by automated downscaling) going on.

Similarly to Proposal 1, the steps need to be done zone-by-zone. Suppose we are migrating zone-x, this is what needs to be done:

  1. Exclude zone-x from distributors and rulers in a PR by setting ingester.ring.excluded-zones: zone-x. In the same PR set ingester_autoscaling_min_time_between_zones_downscale to a small value (e.g., 1m).
  2. In a PR downscale the ingesters from zone-x to 0.
  3. (optional) Delete the tokens files (found on path ingester.ring.tokens-file-path) from all ingesters from zone-x
  4. Apply spread-minimizing configuration on ingesters in zone-x in a PR, by setting:
       ingester.ring.tokens-file-path: ''
       ingester.ring.token-generation-strategy: spread-minimizing
       ingester.ring.spread-minimizing-zones: zone-a,zone-b,zone-c
    
  5. In a PR scale the ingesters from zone-x to the same number of replicas they had after step 1.
  6. revert step 1

@duricanikolic
Copy link
Contributor

@pracucci do you agree we should close this issue, since spread-minimizing tokens fix the problem?

@pracucci
Copy link
Collaborator Author

@pracucci do you agree we should close this issue, since spread-minimizing tokens fix the problem?

I do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants