Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (TimeoutError on _node_leadership_balanced) in ManyPartitionsTest.test_many_partitions #10507

Closed
ztlpn opened this issue May 2, 2023 · 24 comments · Fixed by #11919
Assignees
Labels
ci-failure ci-ignore Automatic ci analysis tools ignore this issue do-not-reopen kind/bug Something isn't working performance

Comments

@ztlpn
Copy link
Contributor

ztlpn commented May 2, 2023

https://buildkite.com/redpanda/vtools/builds/7354#0187d8ed-2013-4bf1-b94f-97a50e0d3c11

(arm, CDT)

Module: rptest.scale_tests.many_partitions_test
Class:  ManyPartitionsTest
Method: test_many_partitions
test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions
status:     FAIL
run time:   10 minutes 49.762 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 877, in test_many_partitions
    self._test_many_partitions(compacted=False)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1064, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 528, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
@ztlpn ztlpn added kind/bug Something isn't working ci-failure labels May 2, 2023
@ztlpn
Copy link
Contributor Author

ztlpn commented May 2, 2023

ManyPartitionsTest.test_many_partitions_compacted failed in that cdt run with the same error.

@bharathv
Copy link
Contributor

bharathv commented May 3, 2023

https://buildkite.com/redpanda/vtools/builds/7366#0187de12-d734-4f34-96e9-01140abecda9

FAIL test: ManyPartitionsTest.test_many_partitions (1/2 runs)
  failure at 2023-05-03T04:38:39.867Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7366#0187de12-d734-4f34-96e9-01140abecda9
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/2 runs)
  failure at 2023-05-03T04:38:39.867Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7366#0187de12-d734-4f34-96e9-01140abecda9

@BenPope
Copy link
Member

BenPope commented May 8, 2023

https://buildkite.com/redpanda/vtools/builds/7435#0187f7d2-7690-4a0d-bff2-fb903d8deff3

FAIL test: ManyPartitionsTest.test_many_partitions (2/6 runs)
  failure at 2023-05-08T05:20:40.490Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7435#0187f7d2-7690-4a0d-bff2-fb903d8deff3
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (3/6 runs)
  failure at 2023-05-08T05:20:40.490Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/7435#0187f7d2-7690-4a0d-bff2-fb903d8deff3

@michael-redpanda
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

NyaliaLui commented May 25, 2023

@abhijat
Copy link
Contributor

abhijat commented May 30, 2023

https://buildkite.com/redpanda/vtools/builds/7827#0188691e-0212-4567-a5bb-e59ae97dd98d

[INFO  - 2023-05-29 20:35:20,828 - runner_client - log - lineno:278]: RunnerClient: rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions_compacted: Summary: TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 879, in test_many_partitions_compacted
    self._test_many_partitions(compacted=True)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1072, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 534, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

@michael-redpanda michael-redpanda self-assigned this May 31, 2023
@michael-redpanda
Copy link
Contributor

Fired up an ARM CDT instance and was able to reproduce the issue. Appears that on ARM, leadership balancing is very slow and the test times out. It's not a case of just adding a little bit of time to the timeout can address it, the leadership balancing effectively stalls.

Attempt to re-run with raft logging set to 'debug' but then the test passed.....

@andijcr
Copy link
Contributor

andijcr commented Jun 5, 2023

similar failures for
ManyPartitionsTest.test_many_partitions
and
ManyPartitionsTest.test_many_partitions_compacted
in
https://buildkite.com/redpanda/vtools/builds/7919#01888802-8884-46fd-ae0e-980a77943ce5

@michael-redpanda
Copy link
Contributor

Created #11242 in response to triaging this issue.

@andijcr
Copy link
Contributor

andijcr commented Jun 9, 2023

@michael-redpanda michael-redpanda removed their assignment Jun 12, 2023
@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 14, 2023

@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 16, 2023

@StephanDollberg
Copy link
Member

StephanDollberg commented Jun 30, 2023

Started looking into this. As Michael posted above the timeout happens because leadership balancing is slow.

I took a profile during that period:

image

We see that we are very busy on the fetch path (there is a background consumer/producer going on). Also looking at metrics basically all nodes and shards are at 100% reactor util.

This issue was created on the 2nd of May. On the 27th of April we merged: f23f4ba (use a separate fetch scheduling group).

Rerunning the test with use_fetch_scheduler_group=false makes balancing quite fast and the test passes:

tests/results/2023-06-30--012/ManyPartitionsTest/test_many_partitions/1/test_log.info:[INFO - 2023-06-30 15:06:13,062 - many_partitions_test - _single_node_restart - lineno:538]: Leaderships balanced in 56.49 seconds

In the metrics we can also see that the fetch group is quite starved as is main. Left is a run using the extra fetch group and right is one with it disabled (fetch obviously zero in that case but note less starvation for the main group):

image

So while this would be an easy fix there are things I don't understand yet and I will need to further confirm:

  • Does replication run in the main group or the raft group (will need to compare raft group starvation as well)? Is replication simply being starved now as the fetch group now gets guaranteed quota? Without the extra group all the fetch stuff isn't as hot in the profile which I think confirms that it doesn't get as much priority then.
  • Presumably the background producer suffers in throughput with this change?
  • I would maybe expect the balancing to take twice as long given the fetch and main scheduling group use the same weight but it takes more like 10 times as much. There is lots of timeouts in the logs so possibly we only make very little progress.

Note I also looked at the Intel case. CPU utilization doesn't go as high here and the fetch group isn't as starved. Equally in a profile the fetch functions aren't as hot. A perf annotate in the ARM profile reveals that we are mostly just in abseil hashmaps or comparing ntps so possibly the Intel instances handle those better. I did confirm that we are using the abseil hashmap arm neon optimized paths but possibly we are a bit behind there in terms of flags (can't be as bad as on x86).

cc @travisdowns as you added the fetch group.

@vshtokman
Copy link
Contributor

@StephanDollberg
Copy link
Member

To answer some of my questions:

Does replication run in the main group or the raft group (will need to compare raft group starvation as well)?

Normal replication on produce does run in the main group. Recovery (when the node comes back online) does run in the raft group.

The raft group is starved a bit in general (not as much as main and fetch) though it's execution time is a lot lower than main and fetch. There isn't much difference in regards to the raft group between the two cases (using an extra fetch group and not).

image image

Nevertheless I will try again tomorrow and double the weight of the raft group.

As mentioned above in the logs there is lots of rpc timeouts and it looks like even raft heartbeats fail at some point. This leads to the nodes being temporarily "muted" causing leadership balance to take a lot longer as well.

Presumably the background producer suffers in throughput with this change?

It's hard to tell as the kgorepeater doesn't log anything in this regard but looking at node exporter tx/rx metrics the difference doesn't seem to be big.

I did also try moving the leader_balancer to a separate scheduling group (using the cluster group) but that doesn't make any difference (makes sense as not that much work happening in there).

Overall it looks to me like the extra scheduling group increases overall cpu util which pushes cpu util to 100% and that then has follow on effects.

@StephanDollberg
Copy link
Member

Nevertheless I will try again tomorrow and double the weight of the raft group.

This makes no difference. Had a look at the scheduling code in seastar and understand this better now. Starve time will go up as long as there is a single task in the queue but it's not weighted by task count in any sense. This explains why starve time is high but queue size and runtime is very low for the raft group.

So this is purely about the main vs fetch group.

@StephanDollberg
Copy link
Member

StephanDollberg commented Jul 4, 2023

Just did a high partition count OMB test, things to notice:

  • The amount of time executed in fetch goes up a lot (lots more smaller fetch request and extra work per partition)
  • Seems to go up more for ARM than for Intel

This I think confirms what we are seeing in the test.

Don't have a metrics screenshot unfortunately as /metrics exporting kinda stops working at high partition count (need bigger prom instance or aggregate more).

StephanDollberg added a commit to StephanDollberg/redpanda that referenced this issue Jul 6, 2023
This partially reverts 9a93a9c

While the original motiviation isn't invalidated we have now found a
counter example where the extra fetch groups makes things worse overall.

`ManyPartitionTest` fails on ARM with the extra group but passes
without. With the group in use CPU util hits 100% and grinds everything
to halt.

Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share
of the extra group the whole system gets affected and hits CPU limits.

Because this is incredibly hard to reason about and it wasn't the core
fetch optimization we decided to revert back to keeping it disabled by
default.

We still keep the option around as it might be useful potentially in
corner cases.

Fixes redpanda-data#10507
@StephanDollberg
Copy link
Member

Further fetch Intel vs ARM data:

fetch_plan_bench - i3en.xlarge (top) vs. is4gen.xlarge (bottom):

test                                      iterations      median         mad         min         max      allocs       tasks        inst
fetch_plan_fixture.test_fetch_plan          14000000    76.561ns     0.027ns    76.384ns    76.650ns       1.040       0.000         0.0
fetch_plan_fixture.test_fetch_plan           9000000   120.246ns     0.022ns   119.959ns   120.348ns       1.040       0.000       659.7

@StephanDollberg
Copy link
Member

We discussed this in the team and have decided to change back to disabling the group by default.

Fundamentally it's hard to reason about the scheduling groups and it wasn't the core fetch optimisation from Travis' changes.

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jul 11, 2023
This partially reverts 9a93a9c

While the original motiviation isn't invalidated we have now found a
counter example where the extra fetch groups makes things worse overall.

`ManyPartitionTest` fails on ARM with the extra group but passes
without. With the group in use CPU util hits 100% and grinds everything
to halt.

Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share
of the extra group the whole system gets affected and hits CPU limits.

Because this is incredibly hard to reason about and it wasn't the core
fetch optimization we decided to revert back to keeping it disabled by
default.

We still keep the option around as it might be useful potentially in
corner cases.

Fixes redpanda-data#10507

(cherry picked from commit 6d1223d)
mmaslankaprv pushed a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
This partially reverts 9a93a9c

While the original motiviation isn't invalidated we have now found a
counter example where the extra fetch groups makes things worse overall.

`ManyPartitionTest` fails on ARM with the extra group but passes
without. With the group in use CPU util hits 100% and grinds everything
to halt.

Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share
of the extra group the whole system gets affected and hits CPU limits.

Because this is incredibly hard to reason about and it wasn't the core
fetch optimization we decided to revert back to keeping it disabled by
default.

We still keep the option around as it might be useful potentially in
corner cases.

Fixes redpanda-data#10507

(cherry picked from commit 6d1223d)
@abhijat
Copy link
Contributor

abhijat commented Sep 1, 2023

https://buildkite.com/redpanda/redpanda/builds/36183#018a4d7d-4ccf-4b1d-93f0-cbf6217a7526

====================================================================================================
test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions
status:     FAIL
run time:   10 minutes 16.595 seconds


    TimeoutError('Waiting for leadership balance after restart')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 887, in test_many_partitions
    self._test_many_partitions(compacted=False)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1077, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 533, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Waiting for leadership balance after restart
====================================================================================================
test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions_compacted
status:     FAIL
run time:   9 minutes 49.257 seconds


    TimeoutError('Waiting for leadership balance after restart')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 883, in test_many_partitions_compacted
    self._test_many_partitions(compacted=True)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 1077, in _test_many_partitions
    self._single_node_restart(scale, topic_names, n_partitions)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 533, in _single_node_restart
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Waiting for leadership balance after restart

seems to have re-occured

@abhijat abhijat reopened this Sep 1, 2023
StephanDollberg added a commit that referenced this issue Sep 1, 2023
…idth

So far we had not been giving a target bandwidth to the background
kgo-repeater traffic.

As a result the clients would just spam as fast as possible and make the
test flaky and unreliable as this might overload RP.

Especially on Arm this causes leadership to not stabilize.

Note we also change the final soak phase to run for two minutes instead
of a number of bytes and calculate this based on the expected bandwidth.
This again makes the test more predictable independent of where it's
running.

Fixes #10507
StephanDollberg added a commit that referenced this issue Sep 1, 2023
…idth

So far we had not been giving a target bandwidth to the background
kgo-repeater traffic.

As a result the clients would just spam as fast as possible and make the
test flaky and unreliable as this might overload RP.

Especially on Arm this causes leadership to not stabilize.

Note we also change the final soak phase to run for two minutes instead
of a number of bytes and calculate this based on the expected bandwidth.
This again makes the test more predictable independent of where it's
running.

Fixes #10507
@rystsov rystsov added do-not-reopen ci-ignore Automatic ci analysis tools ignore this issue labels Sep 1, 2023
@rystsov
Copy link
Contributor

rystsov commented Sep 1, 2023

pandatriage has data/conflicting-open-issues.json file as one of the reports and when it isn't empty it means that there are issues related to the same test and pointing to the same build; it could be a duplicate but most likely it's an artifact of the umbrella ci-issue anti-pattern when we just dump every failure related to a given test under the same issue; pandatriage uses info from an issue to link failures with issues and the umbrella tickets mess with it. I'll be closing the conflicting issues, leaving a comment and marking them with do-not-reopen label. If you feel like you want to reopen this issue - just create a new issue and link the closed one

@rystsov rystsov closed this as completed Sep 1, 2023
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Sep 5, 2023
…idth

So far we had not been giving a target bandwidth to the background
kgo-repeater traffic.

As a result the clients would just spam as fast as possible and make the
test flaky and unreliable as this might overload RP.

Especially on Arm this causes leadership to not stabilize.

Note we also change the final soak phase to run for two minutes instead
of a number of bytes and calculate this based on the expected bandwidth.
This again makes the test more predictable independent of where it's
running.

Fixes redpanda-data#10507

(cherry picked from commit 02338f5)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Sep 5, 2023
…idth

So far we had not been giving a target bandwidth to the background
kgo-repeater traffic.

As a result the clients would just spam as fast as possible and make the
test flaky and unreliable as this might overload RP.

Especially on Arm this causes leadership to not stabilize.

Note we also change the final soak phase to run for two minutes instead
of a number of bytes and calculate this based on the expected bandwidth.
This again makes the test more predictable independent of where it's
running.

Fixes redpanda-data#10507

(cherry picked from commit 02338f5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure ci-ignore Automatic ci analysis tools ignore this issue do-not-reopen kind/bug Something isn't working performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.