Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) #4373

Closed
jcsp opened this issue Apr 21, 2022 · 9 comments · Fixed by #4529
Closed

Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) #4373

jcsp opened this issue Apr 21, 2022 · 9 comments · Fixed by #4529

Comments

@jcsp
Copy link
Contributor

jcsp commented Apr 21, 2022

This is timing out trying to produce data, where the test expects to be able to achieve an average rate of 10MB/s
https://buildkite.com/redpanda/redpanda/builds/9213#4986e508-a6ea-4f58-8497-0c20ee249cf4

I'm going to check the log to make sure the cluster is really progressing, but this is probably a case where we just need to lower our expectations a bit further for the performance available in the docker test environment, and earmark this test for running on nightly EC2 runs instead of docker runs.

@jcsp
Copy link
Contributor Author

jcsp commented Apr 21, 2022

The last produce messages that each node sees are

TRACE 2022-04-20 17:24:06,908 [shard 1] kafka - produce.cc:525 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=5000 topics={{name={scale_000000} partitions={{partition_index=676 records={batch 131160 v2_format true valid_crc true}}}}}}
TRACE 2022-04-20 17:24:06,908 [shard 0] kafka - produce.cc:525 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=5000 topics={{name={scale_000000} partitions={{partition_index=933 records={batch 131160 v2_format true valid_crc true}}}}}}
TRACE 2022-04-20 17:23:57,778 [shard 1] kafka - produce.cc:525 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=5000 topics={{name={scale_000000} partitions={{partition_index=707 records={batch 131160 v2_format true valid_crc true}}, {partition_index=110 records={batch 131160 v2_format true valid_crc true}}, {partition_index=462 records={batch 131160 v2_format true valid_crc true}}}}}}

But the timeout is all the way forward at 17:27:04,555. So this isn't a simple slowness thing.

@jcsp jcsp changed the title Timeout in ManyPartitionsTest Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) May 3, 2022
@jcsp
Copy link
Contributor Author

jcsp commented May 3, 2022

In the interim between the original failure and the latest failures of this test, the logging was reduced from trace to info, so there is less info available in the latest failures than in the original ones. In the instances I've looked at, it has always been the n=1023,n_topics=1 variant of the test that's failing, which would support the possibility that this is resulting from an overloaded test system.

I'm going to set up a looping run of this test on clustered ducktape with trace logs enabled, to get a better sense of whether this is a real issue or a ghost from an overloaded docker environment.

@Lazin
Copy link
Contributor

Lazin commented May 4, 2022

@dotnwat
Copy link
Member

dotnwat commented May 4, 2022

New occurence

Exact same message as in #4522 but it looks like that was closed as a duplicate of this one.

@twmb
Copy link
Contributor

twmb commented May 5, 2022

@dotnwat
Copy link
Member

dotnwat commented May 5, 2022

moving this out the milestone. appears to be your friendly, routine ci-failure (that says hi a lot)

@jcsp
Copy link
Contributor Author

jcsp commented May 6, 2022

We just need to get review on 👉 #4529

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants