Reactor stalls in leadership balancer with large node + partition counts #4172

travisdowns · 2022-04-03T04:54:10Z

Version & Environment

Redpanda version: 21.11.10

What went wrong?

Reactor is stalling for around 1 second or more on the controller when calculating leadership rebalances on cluster with 72 nodes x 16 shards, and 10,000s of partitions in one topic.

What should have happened instead?

No reactor stalls.

How to reproduce the issue?

You can create a cluster of the size described above, but an easier way is to modify the leader_balancer_test to use 72 nodes, 16 shards per node, 80 groups per shard and time find_movement when no movement occurs (e.g., when the starting state is balanced). It will take ~1 second.

Additional information

Here's a typical stack from a stall:


{redpanda} 0x30122ca: auto cluster::greedy_balanced_shards::error() const::'lambda'(auto, auto const&)::operator()<double, std::__1::pair<model::broker_shard const, absl::lts_20210324::btree_map<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >, std::__1::less<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > > > > > >(auto, auto const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:48
 (inlined by) auto std::__1::accumulate<absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<model::broker_shard, absl::lts_20210324::btree_map<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >, std::__1::less<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > > > > >, absl::lts_20210324::hash_internal::Hash<model::broker_shard>, std::__1::equal_to<model::broker_shard>, std::__1::allocator<std::__1::pair<model::broker_shard const, absl::lts_20210324::btree_map<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >, std::__1::less<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const, std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > > > > > > >::const_iterator, double, cluster::greedy_balanced_shards::error() const::'lambda'(auto, auto const&)>(auto, auto, auto, cluster::greedy_balanced_shards::error() const::'lambda'(auto, auto const&)) at /vectorized/llvm/bin/../include/c++/v1/numeric:184
 (inlined by) cluster::greedy_balanced_shards::error() const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:43
 (inlined by) cluster::greedy_balanced_shards::adjusted_error(model::broker_shard const&, model::broker_shard const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:60
 (inlined by) cluster::greedy_balanced_shards::find_movement(absl::lts_20210324::flat_hash_set<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > > const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:122
 (inlined by) cluster::leader_balancer::balance() at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer.cc:230
{redpanda} 0x301a1ce: cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()::operator()() const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer.cc:139
 (inlined by) seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::futurize<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >::invoke<cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&>(cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&) at /vectorized/include/seastar/core/future.hh:2135
 (inlined by) auto seastar::futurize_invoke<cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&>(cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()&) at /vectorized/include/seastar/core/future.hh:2166
 (inlined by) seastar::internal::repeater<cluster::leader_balancer::trigger_balance()::$_0::operator()() const::'lambda'()>::run_and_dispose() at /vectorized/include/seastar/core/loop.hh:71
{redpanda} 0x38c4024: seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2230

This one is in the pow() in the error function (which is calculating how much better a given movement is), but other top-of-stacks are possible too, like this one:

 (inlined by) absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::find<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&, unsigned long) at /vectorized/include/absl/container/internal/raw_hash_set.h:1373
 (inlined by) absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::const_iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::find<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&, unsigned long) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1391
 (inlined by) absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::const_iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::find<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1395
 (inlined by) bool absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashSetPolicy<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > >::contains<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >(detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > const&) const at /vectorized/include/absl/container/internal/raw_hash_set.h:1400
 (inlined by) cluster::greedy_balanced_shards::find_movement(absl::lts_20210324::flat_hash_set<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, absl::lts_20210324::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > > const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cluster/scheduling/leader_balancer_greedy.h:106

This one is in the map find() call which checks that for the given {from, to} and the given group, the to node is in the group's replica pair.

The underlying problem is that the rebalance check is quadratic in the number of shards (and also linear in the number of groups per shard): we do two nested loops (the from and to loops) over all the shards, and inside that we effectively loop over each group. So for 72 x 16 shards that's over 1,000,000 iterations from/to checks, plus anther factor of 80 (group/shards)for the innermost work, are approaching 100,000,000 total iterations, so 1 second is not surprising (10 ns per innermost iteration). Inside the error() function, called from the innermost loop we again iterate over all shards, but I don't think this is a O(n^3) overall because this is gated by the check that from -> to is valid for the group so, we can enter this check at most once for every partition replica in the system (but this is still a lot: 300,000 times for 100k partitions and r=1).

The text was updated successfully, but these errors were encountered:

emaxerrno · 2022-04-03T18:05:03Z

AMAZING bug report.

dotnwat · 2022-04-04T18:07:06Z

Thanks @travisdowns this is not surprising, but I expected the partition count to be much larger when we encountered this. I think there are two avenues here:

Use a more efficient algorithm
Futurize the algorithm and run the background fiber at the lowest priority

If someone has the bandwidth to replace the algorithm (and there might be one available via the Fenzo library port which we use in the rebalance) then that is great and we should do it.

But as long as the memory utilizing isn't high, 2 will certainly be the quickest path to making progress. We're likely to replace the balancing algorithm at some point either due to runtime complexity or finding that it provides poor balancing in some scenarios, whichever comes first.

travisdowns · 2022-04-04T22:37:07Z

@dotnwat - yeah I agree about the two options.

I am replacing the algorithm: it should make the same decisions as before but be much faster (about 200x faster at the size mentioned in the question). That is, the quadratic behavior isn't inherent if we reorganize the loops.

I thought this would be less work than (2) but that's mostly due to my lack of experience in (2). Just to be clear about (2) are you suggesting using the seastar::thread stuff so that the algo only has to pause occasionally (but keeps its stack) or the full-monty continuation-based task where we can yield back down to the reactor at some points in the loop?

I think as long as the algorithm is something like O(s * log(s) + p) where s is the total number of shards, and p is the number of partitions (or the total of partitions including replicas, but we can assume these have a constant relationship, e.g., for r=3), we are probably fine for 100k partitions and 1000s of cores.

WDYT?

Add additional test cases to the existing leader balancer test to cover some edge cases as a precursor to changing the balancer implementation. Issue redpanda-data#4172.