Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactor stalls in leadership balancer with large node + partition counts #4172

Closed
travisdowns opened this issue Apr 3, 2022 · 3 comments · Fixed by #4218
Closed

Reactor stalls in leadership balancer with large node + partition counts #4172

travisdowns opened this issue Apr 3, 2022 · 3 comments · Fixed by #4218
Assignees
Labels
kind/bug Something isn't working

Comments

@travisdowns
Copy link
Member

Version & Environment

Redpanda version: 21.11.10

What went wrong?

Reactor is stalling for around 1 second or more on the controller when calculating leadership rebalances on cluster with 72 nodes x 16 shards, and 10,000s of partitions in one topic.

What should have happened instead?

No reactor stalls.

How to reproduce the issue?

You can create a cluster of the size described above, but an easier way is to modify the leader_balancer_test to use 72 nodes, 16 shards per node, 80 groups per shard and time find_movement when no movement occurs (e.g., when the starting state is balanced). It will take ~1 second.

Additional information

Here's a typical stack from a stall:


{redpanda} 0x30122ca: auto cluster::greedy_balanced_shards::error() const::'lambda'(auto, auto const&)::operator()<double, std::__1::pair<model::broker_shard const, absl::lts_20210324::btree_map<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >, std::__1::less<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > > > > > >(auto, auto const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:48
 (inlined by) auto std::__1::accumulate<absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<model::broker_shard, absl::lts_20210324::btree_map<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >, std::__1::less<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > > > > >, absl::lts_20210324::hash_internal::Hash<model::broker_shard>, std::__1::equal_to<model::broker_shard>, std::__1::allocator<std::__1::pair<model::broker_shard const, absl::lts_20210324::btree_map<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >, std::__1::less<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > > > > > > >::const_iterator, double, cluster::greedy_balanced_shards::error() const::'lambda'(auto, auto const&)>(auto, auto, auto, cluster::greedy_balanced_shards::error() const::'lambda'(auto, auto const&)) at /vectorized/llvm/bin/../include/c++/v1/numeric:184
 (inlined by) cluster::greedy_balanced_shards::error() const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:43
 (inlined by) cluster::greedy_balanced_shards::adjusted_error(model::broker_shard const&, model::broker_shard const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:60
 (inlined by) cluster::greedy_balanced_shards::find_movement(absl::lts_20210324::flat_hash_set<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > > const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:122
 (inlined by) cluster::leader_balancer::balance() at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer.cc:230
{redpanda} 0x301a1ce: cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()::operator()() const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer.cc:139
 (inlined by) seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::futurize<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >::invoke<cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&>(cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&) at /vectorized/include/seastar/core/future.hh:2135
 (inlined by) auto seastar::futurize_invoke<cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&>(cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&) at /vectorized/include/seastar/core/future.hh:2166
 (inlined by) seastar::internal::repeater<cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()>::run_and_dispose() at /vectorized/include/seastar/core/loop.hh:71
{redpanda} 0x38c4024: seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2230

This one is in the pow() in the error function (which is calculating how much better a given movement is), but other top-of-stacks are possible too, like this one:

 (inlined by) absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::find<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&, unsigned long) at /vectorized/include/absl/container/internal/raw_hash_set.h:1373
 (inlined by) absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::const_iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::find<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&, unsigned long) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1391
 (inlined by) absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::const_iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::find<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1395
 (inlined by) bool absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::contains<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1400
 (inlined by) cluster::greedy_balanced_shards::find_movement(absl::lts_20210324::flat_hash_set<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > > const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:106

This one is in the map find() call which checks that for the given {from, to} and the given group, the to node is in the group's replica pair.

The underlying problem is that the rebalance check is quadratic in the number of shards (and also linear in the number of groups per shard): we do two nested loops (the from and to loops) over all the shards, and inside that we effectively loop over each group. So for 72 x 16 shards that's over 1,000,000 iterations from/to checks, plus anther factor of 80 (group/shards)for the innermost work, are approaching 100,000,000 total iterations, so 1 second is not surprising (10 ns per innermost iteration). Inside the error() function, called from the innermost loop we again iterate over all shards, but I don't think this is a O(n^3) overall because this is gated by the check that from -> to is valid for the group so, we can enter this check at most once for every partition replica in the system (but this is still a lot: 300,000 times for 100k partitions and r=1).

@travisdowns travisdowns added the kind/bug Something isn't working label Apr 3, 2022
@travisdowns travisdowns self-assigned this Apr 3, 2022
@emaxerrno
Copy link
Contributor

AMAZING bug report.

@dotnwat
Copy link
Member

dotnwat commented Apr 4, 2022

Thanks @travisdowns this is not surprising, but I expected the partition count to be much larger when we encountered this. I think there are two avenues here:

  1. Use a more efficient algorithm
  2. Futurize the algorithm and run the background fiber at the lowest priority

If someone has the bandwidth to replace the algorithm (and there might be one available via the Fenzo library port which we use in the rebalance) then that is great and we should do it.

But as long as the memory utilizing isn't high, 2 will certainly be the quickest path to making progress. We're likely to replace the balancing algorithm at some point either due to runtime complexity or finding that it provides poor balancing in some scenarios, whichever comes first.

@travisdowns
Copy link
Member Author

travisdowns commented Apr 4, 2022

@dotnwat - yeah I agree about the two options.

I am replacing the algorithm: it should make the same decisions as before but be much faster (about 200x faster at the size mentioned in the question). That is, the quadratic behavior isn't inherent if we reorganize the loops.

I thought this would be less work than (2) but that's mostly due to my lack of experience in (2). Just to be clear about (2) are you suggesting using the seastar::thread stuff so that the algo only has to pause occasionally (but keeps its stack) or the full-monty continuation-based task where we can yield back down to the reactor at some points in the loop?

I think as long as the algorithm is something like O(s * log(s) + p) where s is the total number of shards, and p is the number of partitions (or the total of partitions including replicas, but we can assume these have a constant relationship, e.g., for r=3), we are probably fine for 100k partitions and 1000s of cores.

WDYT?

travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 6, 2022
Add additional test cases to the existing leader balancer test to cover
some edge cases as a precursor to changing the balancer implementation.

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 6, 2022
Tests the raw CPU cost of a large balance operation, of about 100k
and 1152 cores (shards).

This is a precursor to modifying the greedy leader balancer to improve
performance..

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 6, 2022
In the existing leader balancer code, we loop over all shards twice in
a nested loop, which is quadratic in the number of shards.

For large number of shards (say > 1000), this can stall the reactor for
more than a second, which at best may cause reduced performance for
other requests hitting this shard and at worst may cause timeouts which
destabilize the cluster.

This change switches to an algorithm that is linear in the number of
shards but should make the same decisions as the existing one:

We loop over the shards from highest to lowest load and then for each
raft group (parition) led by this group we check if can move the leader
to a lower load shard. We track the best candidate seen so far and
at the end of handling each shard if any candidate has been found which
improves the error we stop the search and return it.

This does not always find the *best* movement (e.g., the next-highest
loaded shard may have a better movement since it leads a group which
has a replica on a less-loaded shard, with the difference enough to
overcome the higher load on this cluster), but it should come close in
most pratical situations and it is the same logic as we are currently
using.

Performance results:

Using 72 nodes x 16 shards x 80 replicas:

version     test              median
original    bench_movement    1.663s
original    bench_all         1.600s
updated     bench_movement   3.393ms
updated     bench_all        6.653ms

Here "orignal" is before this change and "updated" is with this change.

The bench_movement test is the find_movement function, which had the
quadratic behavior, while "all" include also the construction of the
balancer object, which was insigificant before but now takes
about half the time.

Speedup is 490x for these inputs for find_movement, or 250x if you
consider initial construction as well.

I've checked by varying the parameters that performance appears linear
in the number of shards and linear in the number of partitions
per shard. So in the usual case of scaling out a deployment by adding
more shards and keeping the paritions per shard roughly constant
this balancing cost should increase only linearly, giving us a
lot of headroom before performance again becomes a problem.

Fixes redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
Add additional test cases to the existing leader balancer test to cover
some edge cases as a precursor to changing the balancer implementation.

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
Tests the raw CPU cost of a large balance operation, of about 100k
and 1152 cores (shards).

This is a precursor to modifying the greedy leader balancer to improve
performance..

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
In the existing leader balancer code, we loop over all shards twice in
a nested loop, which is quadratic in the number of shards.

For large number of shards (say > 1000), this can stall the reactor for
more than a second, which at best may cause reduced performance for
other requests hitting this shard and at worst may cause timeouts which
destabilize the cluster.

This change switches to an algorithm that is linear in the number of
shards but should make the same decisions as the existing one:

We loop over the shards from highest to lowest load and then for each
raft group (parition) led by this group we check if can move the leader
to a lower load shard. We track the best candidate seen so far and
at the end of handling each shard if any candidate has been found which
improves the error we stop the search and return it.

This does not always find the *best* movement (e.g., the next-highest
loaded shard may have a better movement since it leads a group which
has a replica on a less-loaded shard, with the difference enough to
overcome the higher load on this cluster), but it should come close in
most pratical situations and it is the same logic as we are currently
using.

Performance results:

Using 72 nodes x 16 shards x 80 replicas:

version     test              median
original    bench_movement    1.663s
original    bench_all         1.600s
updated     bench_movement   3.393ms
updated     bench_all        6.653ms

Here "orignal" is before this change and "updated" is with this change.

The bench_movement test is the find_movement function, which had the
quadratic behavior, while "all" include also the construction of the
balancer object, which was insigificant before but now takes
about half the time.

Speedup is 490x for these inputs for find_movement, or 250x if you
consider initial construction as well.

I've checked by varying the parameters that performance appears linear
in the number of shards and linear in the number of partitions
per shard. So in the usual case of scaling out a deployment by adding
more shards and keeping the paritions per shard roughly constant
this balancing cost should increase only linearly, giving us a
lot of headroom before performance again becomes a problem.

Fixes redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
Add additional test cases to the existing leader balancer test to cover
some edge cases as a precursor to changing the balancer implementation.

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
Tests the raw CPU cost of a large balance operation, of about 100k
and 1152 cores (shards).

This is a precursor to modifying the greedy leader balancer to improve
performance..

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
In the existing leader balancer code, we loop over all shards twice in
a nested loop, which is quadratic in the number of shards.

For large number of shards (say > 1000), this can stall the reactor for
more than a second, which at best may cause reduced performance for
other requests hitting this shard and at worst may cause timeouts which
destabilize the cluster.

This change switches to an algorithm that is linear in the number of
shards but should make the same decisions as the existing one:

We loop over the shards from highest to lowest load and then for each
raft group (parition) led by this group we check if can move the leader
to a lower load shard. We track the best candidate seen so far and
at the end of handling each shard if any candidate has been found which
improves the error we stop the search and return it.

This does not always find the *best* movement (e.g., the next-highest
loaded shard may have a better movement since it leads a group which
has a replica on a less-loaded shard, with the difference enough to
overcome the higher load on this cluster), but it should come close in
most pratical situations and it is the same logic as we are currently
using.

Performance results:

Using 72 nodes x 16 shards x 80 replicas:

version     test              median
original    bench_movement    1.663s
original    bench_all         1.600s
updated     bench_movement   3.393ms
updated     bench_all        6.653ms

Here "orignal" is before this change and "updated" is with this change.

The bench_movement test is the find_movement function, which had the
quadratic behavior, while "all" include also the construction of the
balancer object, which was insigificant before but now takes
about half the time.

Speedup is 490x for these inputs for find_movement, or 250x if you
consider initial construction as well.

I've checked by varying the parameters that performance appears linear
in the number of shards and linear in the number of partitions
per shard. So in the usual case of scaling out a deployment by adding
more shards and keeping the paritions per shard roughly constant
this balancing cost should increase only linearly, giving us a
lot of headroom before performance again becomes a problem.

Fixes redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
Tests the raw CPU cost of a large balance operation, of about 100k
and 1152 cores (shards).

This is a precursor to modifying the greedy leader balancer to improve
performance..

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 7, 2022
In the existing leader balancer code, we loop over all shards twice in
a nested loop, which is quadratic in the number of shards.

For large number of shards (say > 1000), this can stall the reactor for
more than a second, which at best may cause reduced performance for
other requests hitting this shard and at worst may cause timeouts which
destabilize the cluster.

This change switches to an algorithm that is linear in the number of
shards but should make the same decisions as the existing one:

We loop over the shards from highest to lowest load and then for each
raft group (parition) led by this group we check if can move the leader
to a lower load shard. We track the best candidate seen so far and
at the end of handling each shard if any candidate has been found which
improves the error we stop the search and return it.

This does not always find the *best* movement (e.g., the next-highest
loaded shard may have a better movement since it leads a group which
has a replica on a less-loaded shard, with the difference enough to
overcome the higher load on this cluster), but it should come close in
most pratical situations and it is the same logic as we are currently
using.

Performance results:

Using 72 nodes x 16 shards x 80 replicas:

version     test              median
original    bench_movement    1.663s
original    bench_all         1.600s
updated     bench_movement   3.393ms
updated     bench_all        6.653ms

Here "orignal" is before this change and "updated" is with this change.

The bench_movement test is the find_movement function, which had the
quadratic behavior, while "all" include also the construction of the
balancer object, which was insigificant before but now takes
about half the time.

Speedup is 490x for these inputs for find_movement, or 250x if you
consider initial construction as well.

I've checked by varying the parameters that performance appears linear
in the number of shards and linear in the number of partitions
per shard. So in the usual case of scaling out a deployment by adding
more shards and keeping the paritions per shard roughly constant
this balancing cost should increase only linearly, giving us a
lot of headroom before performance again becomes a problem.

Fixes redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 12, 2022
Add additional test cases to the existing leader balancer test to cover
some edge cases as a precursor to changing the balancer implementation.

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 12, 2022
Tests the raw CPU cost of a large balance operation, of about 100k
and 1152 cores (shards).

This is a precursor to modifying the greedy leader balancer to improve
performance..

Issue redpanda-data#4172.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 12, 2022
In the existing leader balancer code, we loop over all shards twice in
a nested loop, which is quadratic in the number of shards.

For large number of shards (say > 1000), this can stall the reactor for
more than a second, which at best may cause reduced performance for
other requests hitting this shard and at worst may cause timeouts which
destabilize the cluster.

This change switches to an algorithm that is linear in the number of
shards but should make the same decisions as the existing one:

We loop over the shards from highest to lowest load and then for each
raft group (parition) led by this group we check if can move the leader
to a lower load shard. We track the best candidate seen so far and
at the end of handling each shard if any candidate has been found which
improves the error we stop the search and return it.

This does not always find the *best* movement (e.g., the next-highest
loaded shard may have a better movement since it leads a group which
has a replica on a less-loaded shard, with the difference enough to
overcome the higher load on this cluster), but it should come close in
most pratical situations and it is the same logic as we are currently
using.

Performance results:

Using 72 nodes x 16 shards x 80 replicas:

version     test              median
original    bench_movement    1.663s
original    bench_all         1.600s
updated     bench_movement   3.393ms
updated     bench_all        6.653ms

Here "orignal" is before this change and "updated" is with this change.

The bench_movement test is the find_movement function, which had the
quadratic behavior, while "all" include also the construction of the
balancer object, which was insigificant before but now takes
about half the time.

Speedup is 490x for these inputs for find_movement, or 250x if you
consider initial construction as well.

I've checked by varying the parameters that performance appears linear
in the number of shards and linear in the number of partitions
per shard. So in the usual case of scaling out a deployment by adding
more shards and keeping the paritions per shard roughly constant
this balancing cost should increase only linearly, giving us a
lot of headroom before performance again becomes a problem.

Fixes redpanda-data#4172.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants