CI Failure (TimeoutError on `_node_leadership_balanced`) in `ManyPartitionsTest.test_many_partitions` #10507

ztlpn · 2023-05-02T19:03:32Z

https://buildkite.com/redpanda/vtools/builds/7354#0187d8ed-2013-4bf1-b94f-97a50e0d3c11

(arm, CDT)

Module: rptest.scale_tests.many_partitions_test
Class:  ManyPartitionsTest
Method: test_many_partitions

test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions
status:     FAIL
run time:   10 minutes 49.762 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 877, in test_many_partitions
    self._test_many_partitions(compacted=False)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1064, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 528, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

The text was updated successfully, but these errors were encountered:

ztlpn · 2023-05-02T19:04:17Z

ManyPartitionsTest.test_many_partitions_compacted failed in that cdt run with the same error.

bharathv · 2023-05-03T19:45:01Z

https://buildkite.com/redpanda/vtools/builds/7366#0187de12-d734-4f34-96e9-01140abecda9

FAIL test: ManyPartitionsTest.test_many_partitions (1/2 runs)
  failure at 2023-05-03T04:38:39.867Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7366#0187de12-d734-4f34-96e9-01140abecda9
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/2 runs)
  failure at 2023-05-03T04:38:39.867Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7366#0187de12-d734-4f34-96e9-01140abecda9

BenPope · 2023-05-08T16:01:30Z

https://buildkite.com/redpanda/vtools/builds/7435#0187f7d2-7690-4a0d-bff2-fb903d8deff3

FAIL test: ManyPartitionsTest.test_many_partitions (2/6 runs)
  failure at 2023-05-08T05:20:40.490Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7435#0187f7d2-7690-4a0d-bff2-fb903d8deff3
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (3/6 runs)
  failure at 2023-05-08T05:20:40.490Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7435#0187f7d2-7690-4a0d-bff2-fb903d8deff3

michael-redpanda · 2023-05-10T18:49:25Z

https://buildkite.com/redpanda/vtools/builds/7373#0187e2a6-e6f1-4382-bf2d-10fc90bfca1c

NyaliaLui · 2023-05-25T15:50:49Z

Another on arm https://buildkite.com/redpanda/vtools/builds/7762#01884f5f-ba84-48c9-9cdc-cd37059b63da

And a similar error with test_many_partitions_compacted on the same build
https://buildkite.com/redpanda/vtools/builds/7762#01884f5f-ba84-48c9-9cdc-cd37059b63da

abhijat · 2023-05-30T09:16:21Z

https://buildkite.com/redpanda/vtools/builds/7827#0188691e-0212-4567-a5bb-e59ae97dd98d

[INFO  - 2023-05-29 20:35:20,828 - runner_client - log - lineno:278]: RunnerClient: rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions_compacted: Summary: TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 879, in test_many_partitions_compacted
    self._test_many_partitions(compacted=True)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1072, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 534, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

michael-redpanda · 2023-06-01T14:07:34Z

Fired up an ARM CDT instance and was able to reproduce the issue. Appears that on ARM, leadership balancing is very slow and the test times out. It's not a case of just adding a little bit of time to the timeout can address it, the leadership balancing effectively stalls.

Attempt to re-run with raft logging set to 'debug' but then the test passed.....

andijcr · 2023-06-05T09:26:45Z

similar failures for
ManyPartitionsTest.test_many_partitions
and
ManyPartitionsTest.test_many_partitions_compacted
in
https://buildkite.com/redpanda/vtools/builds/7919#01888802-8884-46fd-ae0e-980a77943ce5

michael-redpanda · 2023-06-06T20:58:46Z

Created #11242 in response to triaging this issue.

andijcr · 2023-06-09T13:00:35Z

ManyPartitionsTest.test_many_partitions
https://buildkite.com/redpanda/vtools/builds/8001#01889c9e-a0a1-4dca-97c0-f974b062cc80
ManyPartitionsTest.test_many_partitions_compacted
https://buildkite.com/redpanda/vtools/builds/8001#01889c9e-a0a1-4dca-97c0-f974b062cc80

michael-redpanda · 2023-06-12T18:05:04Z

https://buildkite.com/redpanda/vtools/builds/8056#0188ac10-0c63-49be-a38a-f09cc7f629df
https://buildkite.com/redpanda/vtools/builds/8050#0188a6e9-85d8-48ed-81ef-8307498c7772
https://buildkite.com/redpanda/vtools/builds/8040#0188a1c4-e577-42ce-8641-5eadd5124787

michael-redpanda · 2023-06-12T18:10:34Z

https://buildkite.com/redpanda/vtools/builds/8056#0188ac10-0c63-49be-a38a-f09cc7f629df
https://buildkite.com/redpanda/vtools/builds/8050#0188a6e9-85d8-48ed-81ef-8307498c7772
https://buildkite.com/redpanda/vtools/builds/8040#0188a1c4-e577-42ce-8641-5eadd5124787

8056 and 8040 include test_many_partitions_compacted

ztlpn · 2023-06-14T10:03:39Z

https://buildkite.com/redpanda/vtools/builds/8065#0188b139-3f45-49fb-9528-0ce49fefce98

michael-redpanda · 2023-06-15T15:43:42Z

both arm

https://buildkite.com/redpanda/vtools/builds/8095#0188bb87-3f08-4e99-a21d-dcac231c8252
https://buildkite.com/redpanda/vtools/builds/8088#0188b660-206c-4d5b-9054-263a2a463905

ztlpn · 2023-06-16T19:15:14Z

https://buildkite.com/redpanda/vtools/builds/8118#0188c0ae-5ce0-4555-84b4-eed6f4f89e2c

StephanDollberg · 2023-06-30T18:05:04Z

Started looking into this. As Michael posted above the timeout happens because leadership balancing is slow.

I took a profile during that period:

We see that we are very busy on the fetch path (there is a background consumer/producer going on). Also looking at metrics basically all nodes and shards are at 100% reactor util.

This issue was created on the 2nd of May. On the 27th of April we merged: f23f4ba (use a separate fetch scheduling group).

Rerunning the test with use_fetch_scheduler_group=false makes balancing quite fast and the test passes:

tests/results/2023-06-30--012/ManyPartitionsTest/test_many_partitions/1/test_log.info:[INFO - 2023-06-30 15:06:13,062 - many_partitions_test - _single_node_restart - lineno:538]: Leaderships balanced in 56.49 seconds

In the metrics we can also see that the fetch group is quite starved as is main. Left is a run using the extra fetch group and right is one with it disabled (fetch obviously zero in that case but note less starvation for the main group):

So while this would be an easy fix there are things I don't understand yet and I will need to further confirm:

Does replication run in the main group or the raft group (will need to compare raft group starvation as well)? Is replication simply being starved now as the fetch group now gets guaranteed quota? Without the extra group all the fetch stuff isn't as hot in the profile which I think confirms that it doesn't get as much priority then.
Presumably the background producer suffers in throughput with this change?
I would maybe expect the balancing to take twice as long given the fetch and main scheduling group use the same weight but it takes more like 10 times as much. There is lots of timeouts in the logs so possibly we only make very little progress.

Note I also looked at the Intel case. CPU utilization doesn't go as high here and the fetch group isn't as starved. Equally in a profile the fetch functions aren't as hot. A perf annotate in the ARM profile reveals that we are mostly just in abseil hashmaps or comparing ntps so possibly the Intel instances handle those better. I did confirm that we are using the abseil hashmap arm neon optimized paths but possibly we are a bit behind there in terms of flags (can't be as bad as on x86).

cc @travisdowns as you added the fetch group.

vshtokman · 2023-07-03T15:36:52Z

https://buildkite.com/redpanda/vtools/builds/8308#01890e81-8fef-47dd-94a5-31bbc0327a02 (arm)

StephanDollberg · 2023-07-03T18:55:43Z

To answer some of my questions:

Does replication run in the main group or the raft group (will need to compare raft group starvation as well)?

Normal replication on produce does run in the main group. Recovery (when the node comes back online) does run in the raft group.

The raft group is starved a bit in general (not as much as main and fetch) though it's execution time is a lot lower than main and fetch. There isn't much difference in regards to the raft group between the two cases (using an extra fetch group and not).

Nevertheless I will try again tomorrow and double the weight of the raft group.

As mentioned above in the logs there is lots of rpc timeouts and it looks like even raft heartbeats fail at some point. This leads to the nodes being temporarily "muted" causing leadership balance to take a lot longer as well.

Presumably the background producer suffers in throughput with this change?

It's hard to tell as the kgorepeater doesn't log anything in this regard but looking at node exporter tx/rx metrics the difference doesn't seem to be big.

I did also try moving the leader_balancer to a separate scheduling group (using the cluster group) but that doesn't make any difference (makes sense as not that much work happening in there).

Overall it looks to me like the extra scheduling group increases overall cpu util which pushes cpu util to 100% and that then has follow on effects.

StephanDollberg · 2023-07-04T14:45:41Z

Nevertheless I will try again tomorrow and double the weight of the raft group.

This makes no difference. Had a look at the scheduling code in seastar and understand this better now. Starve time will go up as long as there is a single task in the queue but it's not weighted by task count in any sense. This explains why starve time is high but queue size and runtime is very low for the raft group.

So this is purely about the main vs fetch group.

StephanDollberg · 2023-07-04T16:33:03Z

Just did a high partition count OMB test, things to notice:

The amount of time executed in fetch goes up a lot (lots more smaller fetch request and extra work per partition)
Seems to go up more for ARM than for Intel

This I think confirms what we are seeing in the test.

Don't have a metrics screenshot unfortunately as /metrics exporting kinda stops working at high partition count (need bigger prom instance or aggregate more).

This partially reverts 9a93a9c While the original motiviation isn't invalidated we have now found a counter example where the extra fetch groups makes things worse overall. `ManyPartitionTest` fails on ARM with the extra group but passes without. With the group in use CPU util hits 100% and grinds everything to halt. Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share of the extra group the whole system gets affected and hits CPU limits. Because this is incredibly hard to reason about and it wasn't the core fetch optimization we decided to revert back to keeping it disabled by default. We still keep the option around as it might be useful potentially in corner cases. Fixes redpanda-data#10507

StephanDollberg · 2023-07-06T15:19:49Z

Further fetch Intel vs ARM data:

fetch_plan_bench - i3en.xlarge (top) vs. is4gen.xlarge (bottom):

test                                      iterations      median         mad         min         max      allocs       tasks        inst
fetch_plan_fixture.test_fetch_plan          14000000    76.561ns     0.027ns    76.384ns    76.650ns       1.040       0.000         0.0
fetch_plan_fixture.test_fetch_plan           9000000   120.246ns     0.022ns   119.959ns   120.348ns       1.040       0.000       659.7

StephanDollberg · 2023-07-06T15:22:38Z

We discussed this in the team and have decided to change back to disabling the group by default.

Fundamentally it's hard to reason about the scheduling groups and it wasn't the core fetch optimisation from Travis' changes.

This partially reverts 9a93a9c While the original motiviation isn't invalidated we have now found a counter example where the extra fetch groups makes things worse overall. `ManyPartitionTest` fails on ARM with the extra group but passes without. With the group in use CPU util hits 100% and grinds everything to halt. Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share of the extra group the whole system gets affected and hits CPU limits. Because this is incredibly hard to reason about and it wasn't the core fetch optimization we decided to revert back to keeping it disabled by default. We still keep the option around as it might be useful potentially in corner cases. Fixes redpanda-data#10507 (cherry picked from commit 6d1223d)

abhijat · 2023-09-01T09:26:20Z

https://buildkite.com/redpanda/redpanda/builds/36183#018a4d7d-4ccf-4b1d-93f0-cbf6217a7526

====================================================================================================
test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions
status:     FAIL
run time:   10 minutes 16.595 seconds


    TimeoutError('Waiting for leadership balance after restart')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 887, in test_many_partitions
    self._test_many_partitions(compacted=False)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1077, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 533, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Waiting for leadership balance after restart

====================================================================================================
test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions_compacted
status:     FAIL
run time:   9 minutes 49.257 seconds


    TimeoutError('Waiting for leadership balance after restart')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 883, in test_many_partitions_compacted
    self._test_many_partitions(compacted=True)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1077, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 533, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Waiting for leadership balance after restart

seems to have re-occured

…idth So far we had not been giving a target bandwidth to the background kgo-repeater traffic. As a result the clients would just spam as fast as possible and make the test flaky and unreliable as this might overload RP. Especially on Arm this causes leadership to not stabilize. Note we also change the final soak phase to run for two minutes instead of a number of bytes and calculate this based on the expected bandwidth. This again makes the test more predictable independent of where it's running. Fixes #10507

rystsov · 2023-09-01T18:03:56Z

pandatriage has data/conflicting-open-issues.json file as one of the reports and when it isn't empty it means that there are issues related to the same test and pointing to the same build; it could be a duplicate but most likely it's an artifact of the umbrella ci-issue anti-pattern when we just dump every failure related to a given test under the same issue; pandatriage uses info from an issue to link failures with issues and the umbrella tickets mess with it. I'll be closing the conflicting issues, leaving a comment and marking them with do-not-reopen label. If you feel like you want to reopen this issue - just create a new issue and link the closed one

…idth So far we had not been giving a target bandwidth to the background kgo-repeater traffic. As a result the clients would just spam as fast as possible and make the test flaky and unreliable as this might overload RP. Especially on Arm this causes leadership to not stabilize. Note we also change the final soak phase to run for two minutes instead of a number of bytes and calculate this based on the expected bandwidth. This again makes the test more predictable independent of where it's running. Fixes redpanda-data#10507 (cherry picked from commit 02338f5)

ztlpn added kind/bug Something isn't working ci-failure labels May 2, 2023

piyushredpanda added the area/controller label May 10, 2023

michael-redpanda self-assigned this May 31, 2023

michael-redpanda mentioned this issue Jun 6, 2023

Slow leadership rebalancing on ARM #11242

Closed

michael-redpanda added performance and removed area/controller labels Jun 6, 2023

michael-redpanda removed their assignment Jun 12, 2023

piyushredpanda assigned StephanDollberg Jun 25, 2023

jcsp mentioned this issue Jul 5, 2023

CI Failure (timeout transferring leadership after restart) in ManyPartitionsTest.test_many_partitions #11881

Closed

StephanDollberg mentioned this issue Jul 6, 2023

kafka: Disable use of separate fetch scheduling group #11919

Merged

7 tasks

This was referenced Jul 6, 2023

CI Failure (node_leadership_evacuated timeout) in ManyPartitionsTest.test_many_partitions #8684

Closed

CI Failure (TimeoutError on _write_and_random_read) in ManyPartitionsTest.test_many_partitions #10790

Closed

piyushredpanda closed this as completed in #11919 Jul 11, 2023

vbotbuildovich mentioned this issue Jul 11, 2023

[v23.1.x] CI Failure (TimeoutError on _node_leadership_balanced) in ManyPartitionsTest.test_many_partitions #12034

Closed

abhijat reopened this Sep 1, 2023

rystsov added do-not-reopen ci-ignore Automatic ci analysis tools ignore this issue labels Sep 1, 2023

rystsov closed this as completed Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (TimeoutError on `_node_leadership_balanced`) in `ManyPartitionsTest.test_many_partitions` #10507

CI Failure (TimeoutError on `_node_leadership_balanced`) in `ManyPartitionsTest.test_many_partitions` #10507

ztlpn commented May 2, 2023

ztlpn commented May 2, 2023

bharathv commented May 3, 2023

BenPope commented May 8, 2023

michael-redpanda commented May 10, 2023

NyaliaLui commented May 25, 2023 •

edited

Loading

abhijat commented May 30, 2023

michael-redpanda commented Jun 1, 2023

andijcr commented Jun 5, 2023 •

edited

Loading

michael-redpanda commented Jun 6, 2023

andijcr commented Jun 9, 2023

michael-redpanda commented Jun 12, 2023

michael-redpanda commented Jun 12, 2023

ztlpn commented Jun 14, 2023

michael-redpanda commented Jun 15, 2023 •

edited

Loading

ztlpn commented Jun 16, 2023

StephanDollberg commented Jun 30, 2023 •

edited

Loading

vshtokman commented Jul 3, 2023

StephanDollberg commented Jul 3, 2023

StephanDollberg commented Jul 4, 2023

StephanDollberg commented Jul 4, 2023 •

edited

Loading

StephanDollberg commented Jul 6, 2023

StephanDollberg commented Jul 6, 2023

abhijat commented Sep 1, 2023 •

edited

Loading

rystsov commented Sep 1, 2023

CI Failure (TimeoutError on _node_leadership_balanced) in ManyPartitionsTest.test_many_partitions #10507

CI Failure (TimeoutError on _node_leadership_balanced) in ManyPartitionsTest.test_many_partitions #10507

Comments

ztlpn commented May 2, 2023

ztlpn commented May 2, 2023

bharathv commented May 3, 2023

BenPope commented May 8, 2023

michael-redpanda commented May 10, 2023

NyaliaLui commented May 25, 2023 • edited Loading

abhijat commented May 30, 2023

michael-redpanda commented Jun 1, 2023

andijcr commented Jun 5, 2023 • edited Loading

michael-redpanda commented Jun 6, 2023

andijcr commented Jun 9, 2023

michael-redpanda commented Jun 12, 2023

michael-redpanda commented Jun 12, 2023

ztlpn commented Jun 14, 2023

michael-redpanda commented Jun 15, 2023 • edited Loading

ztlpn commented Jun 16, 2023

StephanDollberg commented Jun 30, 2023 • edited Loading

vshtokman commented Jul 3, 2023

StephanDollberg commented Jul 3, 2023

StephanDollberg commented Jul 4, 2023

StephanDollberg commented Jul 4, 2023 • edited Loading

StephanDollberg commented Jul 6, 2023

StephanDollberg commented Jul 6, 2023

abhijat commented Sep 1, 2023 • edited Loading

rystsov commented Sep 1, 2023

CI Failure (TimeoutError on `_node_leadership_balanced`) in `ManyPartitionsTest.test_many_partitions` #10507

CI Failure (TimeoutError on `_node_leadership_balanced`) in `ManyPartitionsTest.test_many_partitions` #10507

NyaliaLui commented May 25, 2023 •

edited

Loading

andijcr commented Jun 5, 2023 •

edited

Loading

michael-redpanda commented Jun 15, 2023 •

edited

Loading

StephanDollberg commented Jun 30, 2023 •

edited

Loading

StephanDollberg commented Jul 4, 2023 •

edited

Loading

abhijat commented Sep 1, 2023 •

edited

Loading