-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (TimeoutError on _node_leadership_balanced
) in ManyPartitionsTest.test_many_partitions
#10507
Comments
ManyPartitionsTest.test_many_partitions_compacted failed in that cdt run with the same error. |
https://buildkite.com/redpanda/vtools/builds/7366#0187de12-d734-4f34-96e9-01140abecda9
|
https://buildkite.com/redpanda/vtools/builds/7435#0187f7d2-7690-4a0d-bff2-fb903d8deff3
|
Another on arm https://buildkite.com/redpanda/vtools/builds/7762#01884f5f-ba84-48c9-9cdc-cd37059b63da And a similar error with |
https://buildkite.com/redpanda/vtools/builds/7827#0188691e-0212-4567-a5bb-e59ae97dd98d
|
Fired up an ARM CDT instance and was able to reproduce the issue. Appears that on ARM, leadership balancing is very slow and the test times out. It's not a case of just adding a little bit of time to the timeout can address it, the leadership balancing effectively stalls. Attempt to re-run with raft logging set to 'debug' but then the test passed..... |
similar failures for |
Created #11242 in response to triaging this issue. |
ManyPartitionsTest.test_many_partitions |
Started looking into this. As Michael posted above the timeout happens because leadership balancing is slow. I took a profile during that period: We see that we are very busy on the fetch path (there is a background consumer/producer going on). Also looking at metrics basically all nodes and shards are at 100% reactor util. This issue was created on the 2nd of May. On the 27th of April we merged: f23f4ba (use a separate fetch scheduling group). Rerunning the test with
In the metrics we can also see that the fetch group is quite starved as is main. Left is a run using the extra fetch group and right is one with it disabled (fetch obviously zero in that case but note less starvation for the main group): So while this would be an easy fix there are things I don't understand yet and I will need to further confirm:
Note I also looked at the Intel case. CPU utilization doesn't go as high here and the fetch group isn't as starved. Equally in a profile the fetch functions aren't as hot. A perf annotate in the ARM profile reveals that we are mostly just in abseil hashmaps or comparing cc @travisdowns as you added the fetch group. |
This makes no difference. Had a look at the scheduling code in seastar and understand this better now. Starve time will go up as long as there is a single task in the queue but it's not weighted by task count in any sense. This explains why starve time is high but queue size and runtime is very low for the raft group. So this is purely about the main vs fetch group. |
Just did a high partition count OMB test, things to notice:
This I think confirms what we are seeing in the test. Don't have a metrics screenshot unfortunately as /metrics exporting kinda stops working at high partition count (need bigger prom instance or aggregate more). |
This partially reverts 9a93a9c While the original motiviation isn't invalidated we have now found a counter example where the extra fetch groups makes things worse overall. `ManyPartitionTest` fails on ARM with the extra group but passes without. With the group in use CPU util hits 100% and grinds everything to halt. Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share of the extra group the whole system gets affected and hits CPU limits. Because this is incredibly hard to reason about and it wasn't the core fetch optimization we decided to revert back to keeping it disabled by default. We still keep the option around as it might be useful potentially in corner cases. Fixes redpanda-data#10507
Further fetch Intel vs ARM data: fetch_plan_bench - i3en.xlarge (top) vs. is4gen.xlarge (bottom):
|
We discussed this in the team and have decided to change back to disabling the group by default. Fundamentally it's hard to reason about the scheduling groups and it wasn't the core fetch optimisation from Travis' changes. |
This partially reverts 9a93a9c While the original motiviation isn't invalidated we have now found a counter example where the extra fetch groups makes things worse overall. `ManyPartitionTest` fails on ARM with the extra group but passes without. With the group in use CPU util hits 100% and grinds everything to halt. Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share of the extra group the whole system gets affected and hits CPU limits. Because this is incredibly hard to reason about and it wasn't the core fetch optimization we decided to revert back to keeping it disabled by default. We still keep the option around as it might be useful potentially in corner cases. Fixes redpanda-data#10507 (cherry picked from commit 6d1223d)
This partially reverts 9a93a9c While the original motiviation isn't invalidated we have now found a counter example where the extra fetch groups makes things worse overall. `ManyPartitionTest` fails on ARM with the extra group but passes without. With the group in use CPU util hits 100% and grinds everything to halt. Fetch seems to be a lot slower on ARM. Hence, with the guaranteed share of the extra group the whole system gets affected and hits CPU limits. Because this is incredibly hard to reason about and it wasn't the core fetch optimization we decided to revert back to keeping it disabled by default. We still keep the option around as it might be useful potentially in corner cases. Fixes redpanda-data#10507 (cherry picked from commit 6d1223d)
https://buildkite.com/redpanda/redpanda/builds/36183#018a4d7d-4ccf-4b1d-93f0-cbf6217a7526
seems to have re-occured |
…idth So far we had not been giving a target bandwidth to the background kgo-repeater traffic. As a result the clients would just spam as fast as possible and make the test flaky and unreliable as this might overload RP. Especially on Arm this causes leadership to not stabilize. Note we also change the final soak phase to run for two minutes instead of a number of bytes and calculate this based on the expected bandwidth. This again makes the test more predictable independent of where it's running. Fixes #10507
…idth So far we had not been giving a target bandwidth to the background kgo-repeater traffic. As a result the clients would just spam as fast as possible and make the test flaky and unreliable as this might overload RP. Especially on Arm this causes leadership to not stabilize. Note we also change the final soak phase to run for two minutes instead of a number of bytes and calculate this based on the expected bandwidth. This again makes the test more predictable independent of where it's running. Fixes #10507
|
…idth So far we had not been giving a target bandwidth to the background kgo-repeater traffic. As a result the clients would just spam as fast as possible and make the test flaky and unreliable as this might overload RP. Especially on Arm this causes leadership to not stabilize. Note we also change the final soak phase to run for two minutes instead of a number of bytes and calculate this based on the expected bandwidth. This again makes the test more predictable independent of where it's running. Fixes redpanda-data#10507 (cherry picked from commit 02338f5)
…idth So far we had not been giving a target bandwidth to the background kgo-repeater traffic. As a result the clients would just spam as fast as possible and make the test flaky and unreliable as this might overload RP. Especially on Arm this causes leadership to not stabilize. Note we also change the final soak phase to run for two minutes instead of a number of bytes and calculate this based on the expected bandwidth. This again makes the test more predictable independent of where it's running. Fixes redpanda-data#10507 (cherry picked from commit 02338f5)
https://buildkite.com/redpanda/vtools/builds/7354#0187d8ed-2013-4bf1-b94f-97a50e0d3c11
(arm, CDT)
The text was updated successfully, but these errors were encountered: