-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactor stalls in leadership balancer with large node + partition counts #4172
Comments
AMAZING bug report. |
Thanks @travisdowns this is not surprising, but I expected the partition count to be much larger when we encountered this. I think there are two avenues here:
If someone has the bandwidth to replace the algorithm (and there might be one available via the Fenzo library port which we use in the rebalance) then that is great and we should do it. But as long as the memory utilizing isn't high, 2 will certainly be the quickest path to making progress. We're likely to replace the balancing algorithm at some point either due to runtime complexity or finding that it provides poor balancing in some scenarios, whichever comes first. |
@dotnwat - yeah I agree about the two options. I am replacing the algorithm: it should make the same decisions as before but be much faster (about 200x faster at the size mentioned in the question). That is, the quadratic behavior isn't inherent if we reorganize the loops. I thought this would be less work than (2) but that's mostly due to my lack of experience in (2). Just to be clear about (2) are you suggesting using the I think as long as the algorithm is something like WDYT? |
Add additional test cases to the existing leader balancer test to cover some edge cases as a precursor to changing the balancer implementation. Issue redpanda-data#4172.
Tests the raw CPU cost of a large balance operation, of about 100k and 1152 cores (shards). This is a precursor to modifying the greedy leader balancer to improve performance.. Issue redpanda-data#4172.
In the existing leader balancer code, we loop over all shards twice in a nested loop, which is quadratic in the number of shards. For large number of shards (say > 1000), this can stall the reactor for more than a second, which at best may cause reduced performance for other requests hitting this shard and at worst may cause timeouts which destabilize the cluster. This change switches to an algorithm that is linear in the number of shards but should make the same decisions as the existing one: We loop over the shards from highest to lowest load and then for each raft group (parition) led by this group we check if can move the leader to a lower load shard. We track the best candidate seen so far and at the end of handling each shard if any candidate has been found which improves the error we stop the search and return it. This does not always find the *best* movement (e.g., the next-highest loaded shard may have a better movement since it leads a group which has a replica on a less-loaded shard, with the difference enough to overcome the higher load on this cluster), but it should come close in most pratical situations and it is the same logic as we are currently using. Performance results: Using 72 nodes x 16 shards x 80 replicas: version test median original bench_movement 1.663s original bench_all 1.600s updated bench_movement 3.393ms updated bench_all 6.653ms Here "orignal" is before this change and "updated" is with this change. The bench_movement test is the find_movement function, which had the quadratic behavior, while "all" include also the construction of the balancer object, which was insigificant before but now takes about half the time. Speedup is 490x for these inputs for find_movement, or 250x if you consider initial construction as well. I've checked by varying the parameters that performance appears linear in the number of shards and linear in the number of partitions per shard. So in the usual case of scaling out a deployment by adding more shards and keeping the paritions per shard roughly constant this balancing cost should increase only linearly, giving us a lot of headroom before performance again becomes a problem. Fixes redpanda-data#4172.
Add additional test cases to the existing leader balancer test to cover some edge cases as a precursor to changing the balancer implementation. Issue redpanda-data#4172.
Tests the raw CPU cost of a large balance operation, of about 100k and 1152 cores (shards). This is a precursor to modifying the greedy leader balancer to improve performance.. Issue redpanda-data#4172.
In the existing leader balancer code, we loop over all shards twice in a nested loop, which is quadratic in the number of shards. For large number of shards (say > 1000), this can stall the reactor for more than a second, which at best may cause reduced performance for other requests hitting this shard and at worst may cause timeouts which destabilize the cluster. This change switches to an algorithm that is linear in the number of shards but should make the same decisions as the existing one: We loop over the shards from highest to lowest load and then for each raft group (parition) led by this group we check if can move the leader to a lower load shard. We track the best candidate seen so far and at the end of handling each shard if any candidate has been found which improves the error we stop the search and return it. This does not always find the *best* movement (e.g., the next-highest loaded shard may have a better movement since it leads a group which has a replica on a less-loaded shard, with the difference enough to overcome the higher load on this cluster), but it should come close in most pratical situations and it is the same logic as we are currently using. Performance results: Using 72 nodes x 16 shards x 80 replicas: version test median original bench_movement 1.663s original bench_all 1.600s updated bench_movement 3.393ms updated bench_all 6.653ms Here "orignal" is before this change and "updated" is with this change. The bench_movement test is the find_movement function, which had the quadratic behavior, while "all" include also the construction of the balancer object, which was insigificant before but now takes about half the time. Speedup is 490x for these inputs for find_movement, or 250x if you consider initial construction as well. I've checked by varying the parameters that performance appears linear in the number of shards and linear in the number of partitions per shard. So in the usual case of scaling out a deployment by adding more shards and keeping the paritions per shard roughly constant this balancing cost should increase only linearly, giving us a lot of headroom before performance again becomes a problem. Fixes redpanda-data#4172.
Add additional test cases to the existing leader balancer test to cover some edge cases as a precursor to changing the balancer implementation. Issue redpanda-data#4172.
Tests the raw CPU cost of a large balance operation, of about 100k and 1152 cores (shards). This is a precursor to modifying the greedy leader balancer to improve performance.. Issue redpanda-data#4172.
In the existing leader balancer code, we loop over all shards twice in a nested loop, which is quadratic in the number of shards. For large number of shards (say > 1000), this can stall the reactor for more than a second, which at best may cause reduced performance for other requests hitting this shard and at worst may cause timeouts which destabilize the cluster. This change switches to an algorithm that is linear in the number of shards but should make the same decisions as the existing one: We loop over the shards from highest to lowest load and then for each raft group (parition) led by this group we check if can move the leader to a lower load shard. We track the best candidate seen so far and at the end of handling each shard if any candidate has been found which improves the error we stop the search and return it. This does not always find the *best* movement (e.g., the next-highest loaded shard may have a better movement since it leads a group which has a replica on a less-loaded shard, with the difference enough to overcome the higher load on this cluster), but it should come close in most pratical situations and it is the same logic as we are currently using. Performance results: Using 72 nodes x 16 shards x 80 replicas: version test median original bench_movement 1.663s original bench_all 1.600s updated bench_movement 3.393ms updated bench_all 6.653ms Here "orignal" is before this change and "updated" is with this change. The bench_movement test is the find_movement function, which had the quadratic behavior, while "all" include also the construction of the balancer object, which was insigificant before but now takes about half the time. Speedup is 490x for these inputs for find_movement, or 250x if you consider initial construction as well. I've checked by varying the parameters that performance appears linear in the number of shards and linear in the number of partitions per shard. So in the usual case of scaling out a deployment by adding more shards and keeping the paritions per shard roughly constant this balancing cost should increase only linearly, giving us a lot of headroom before performance again becomes a problem. Fixes redpanda-data#4172.
Tests the raw CPU cost of a large balance operation, of about 100k and 1152 cores (shards). This is a precursor to modifying the greedy leader balancer to improve performance.. Issue redpanda-data#4172.
In the existing leader balancer code, we loop over all shards twice in a nested loop, which is quadratic in the number of shards. For large number of shards (say > 1000), this can stall the reactor for more than a second, which at best may cause reduced performance for other requests hitting this shard and at worst may cause timeouts which destabilize the cluster. This change switches to an algorithm that is linear in the number of shards but should make the same decisions as the existing one: We loop over the shards from highest to lowest load and then for each raft group (parition) led by this group we check if can move the leader to a lower load shard. We track the best candidate seen so far and at the end of handling each shard if any candidate has been found which improves the error we stop the search and return it. This does not always find the *best* movement (e.g., the next-highest loaded shard may have a better movement since it leads a group which has a replica on a less-loaded shard, with the difference enough to overcome the higher load on this cluster), but it should come close in most pratical situations and it is the same logic as we are currently using. Performance results: Using 72 nodes x 16 shards x 80 replicas: version test median original bench_movement 1.663s original bench_all 1.600s updated bench_movement 3.393ms updated bench_all 6.653ms Here "orignal" is before this change and "updated" is with this change. The bench_movement test is the find_movement function, which had the quadratic behavior, while "all" include also the construction of the balancer object, which was insigificant before but now takes about half the time. Speedup is 490x for these inputs for find_movement, or 250x if you consider initial construction as well. I've checked by varying the parameters that performance appears linear in the number of shards and linear in the number of partitions per shard. So in the usual case of scaling out a deployment by adding more shards and keeping the paritions per shard roughly constant this balancing cost should increase only linearly, giving us a lot of headroom before performance again becomes a problem. Fixes redpanda-data#4172.
Add additional test cases to the existing leader balancer test to cover some edge cases as a precursor to changing the balancer implementation. Issue redpanda-data#4172.
Tests the raw CPU cost of a large balance operation, of about 100k and 1152 cores (shards). This is a precursor to modifying the greedy leader balancer to improve performance.. Issue redpanda-data#4172.
In the existing leader balancer code, we loop over all shards twice in a nested loop, which is quadratic in the number of shards. For large number of shards (say > 1000), this can stall the reactor for more than a second, which at best may cause reduced performance for other requests hitting this shard and at worst may cause timeouts which destabilize the cluster. This change switches to an algorithm that is linear in the number of shards but should make the same decisions as the existing one: We loop over the shards from highest to lowest load and then for each raft group (parition) led by this group we check if can move the leader to a lower load shard. We track the best candidate seen so far and at the end of handling each shard if any candidate has been found which improves the error we stop the search and return it. This does not always find the *best* movement (e.g., the next-highest loaded shard may have a better movement since it leads a group which has a replica on a less-loaded shard, with the difference enough to overcome the higher load on this cluster), but it should come close in most pratical situations and it is the same logic as we are currently using. Performance results: Using 72 nodes x 16 shards x 80 replicas: version test median original bench_movement 1.663s original bench_all 1.600s updated bench_movement 3.393ms updated bench_all 6.653ms Here "orignal" is before this change and "updated" is with this change. The bench_movement test is the find_movement function, which had the quadratic behavior, while "all" include also the construction of the balancer object, which was insigificant before but now takes about half the time. Speedup is 490x for these inputs for find_movement, or 250x if you consider initial construction as well. I've checked by varying the parameters that performance appears linear in the number of shards and linear in the number of partitions per shard. So in the usual case of scaling out a deployment by adding more shards and keeping the paritions per shard roughly constant this balancing cost should increase only linearly, giving us a lot of headroom before performance again becomes a problem. Fixes redpanda-data#4172.
Version & Environment
Redpanda version: 21.11.10
What went wrong?
Reactor is stalling for around 1 second or more on the controller when calculating leadership rebalances on cluster with 72 nodes x 16 shards, and 10,000s of partitions in one topic.
What should have happened instead?
No reactor stalls.
How to reproduce the issue?
You can create a cluster of the size described above, but an easier way is to modify the
leader_balancer_test
to use 72 nodes, 16 shards per node, 80 groups per shard and timefind_movement
when no movement occurs (e.g., when the starting state is balanced). It will take ~1 second.Additional information
Here's a typical stack from a stall:
This one is in the
pow()
in the error function (which is calculating how much better a given movement is), but other top-of-stacks are possible too, like this one:This one is in the map
find()
call which checks that for the given {from, to} and the given group, the to node is in the group's replica pair.The underlying problem is that the rebalance check is quadratic in the number of shards (and also linear in the number of groups per shard): we do two nested loops (the from and to loops) over all the shards, and inside that we effectively loop over each group. So for 72 x 16 shards that's over 1,000,000 iterations from/to checks, plus anther factor of 80 (group/shards)for the innermost work, are approaching 100,000,000 total iterations, so 1 second is not surprising (10 ns per innermost iteration). Inside the
error()
function, called from the innermost loop we again iterate over all shards, but I don't think this is a O(n^3) overall because this is gated by the check that from -> to is valid for the group so, we can enter this check at most once for every partition replica in the system (but this is still a lot: 300,000 times for 100k partitions and r=1).The text was updated successfully, but these errors were encountered: