Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) #4373

jcsp · 2022-04-21T15:59:51Z

This is timing out trying to produce data, where the test expects to be able to achieve an average rate of 10MB/s
https://buildkite.com/redpanda/redpanda/builds/9213#4986e508-a6ea-4f58-8497-0c20ee249cf4

I'm going to check the log to make sure the cluster is really progressing, but this is probably a case where we just need to lower our expectations a bit further for the performance available in the docker test environment, and earmark this test for running on nightly EC2 runs instead of docker runs.

jcsp · 2022-04-21T16:26:26Z

The last produce messages that each node sees are

TRACE 2022-04-20 17:24:06,908 [shard 1] kafka - produce.cc:525 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=5000 topics={{name={scale_000000} partitions={{partition_index=676 records={batch 131160 v2_format true valid_crc true}}}}}}
TRACE 2022-04-20 17:24:06,908 [shard 0] kafka - produce.cc:525 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=5000 topics={{name={scale_000000} partitions={{partition_index=933 records={batch 131160 v2_format true valid_crc true}}}}}}
TRACE 2022-04-20 17:23:57,778 [shard 1] kafka - produce.cc:525 - handling produce request {transactional_id={nullopt} acks=-1 timeout_ms=5000 topics={{name={scale_000000} partitions={{partition_index=707 records={batch 131160 v2_format true valid_crc true}}, {partition_index=110 records={batch 131160 v2_format true valid_crc true}}, {partition_index=462 records={batch 131160 v2_format true valid_crc true}}}}}}

But the timeout is all the way forward at 17:27:04,555. So this isn't a simple slowness thing.

jcsp · 2022-05-03T11:00:55Z

In the interim between the original failure and the latest failures of this test, the logging was reduced from trace to info, so there is less info available in the latest failures than in the original ones. In the instances I've looked at, it has always been the n=1023,n_topics=1 variant of the test that's failing, which would support the possibility that this is resulting from an overloaded test system.

I'm going to set up a looping run of this test on clustered ducktape with trace logs enabled, to get a better sense of whether this is a real issue or a ghost from an overloaded docker environment.

Lazin · 2022-05-04T15:47:58Z

New occurrence https://buildkite.com/redpanda/redpanda/builds/9661#1702dd89-3269-4bff-8e2a-147f5bda5513/1471-9301

dotnwat · 2022-05-04T20:40:39Z

New occurence

Exact same message as in #4522 but it looks like that was closed as a duplicate of this one.

dotnwat · 2022-05-04T21:52:02Z

twmb · 2022-05-05T20:21:34Z

In #4583,
https://buildkite.com/redpanda/redpanda/builds/9788#a620f2eb-5f5c-4c82-a516-d500ccd23a13

dotnwat · 2022-05-05T21:05:04Z

moving this out the milestone. appears to be your friendly, routine ci-failure (that says hi a lot)

jcsp · 2022-05-06T09:55:22Z

We just need to get review on 👉 #4529

abhijat · 2022-05-06T10:36:52Z

seen on #4404

https://buildkite.com/redpanda/redpanda/builds/9821#8c6f4461-d5c8-41b0-a537-aecba2095564/1562-7648
https://buildkite.com/redpanda/redpanda/builds/9821#8724112e-2ecf-4a7d-a535-30f169973957/1561-7746

jcsp added area/tests ci-failure labels Apr 21, 2022

jcsp mentioned this issue May 3, 2022

many_partitions_test.ManyPartitionsTest.test_many_partitions #4522

Closed

jcsp changed the title ~~Timeout in ManyPartitionsTest~~ Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) May 3, 2022

piyushredpanda assigned jcsp May 4, 2022

jcsp mentioned this issue May 4, 2022

tests: segregate scale tests #4529

Merged

dotnwat mentioned this issue May 4, 2022

k/fetch: validate fetch offset against high watermark #4201

Merged

dotnwat mentioned this issue May 4, 2022

Added consumer group migration test when group topic is not present #4547

Merged

dotnwat added this to the v22.1.1 milestone May 4, 2022

This was referenced May 4, 2022

tests/offset_for_leader_epoch: increase start timeout #4559

Merged

r/recovery_stm: changed recovery stm append entries failure logging sev #4561

Merged

test: create directory for certs before copying #4569

Merged

dotnwat mentioned this issue May 5, 2022

rpk topic consume patch #4583

Merged

dotnwat removed this from the v22.1.1 milestone May 5, 2022

twmb mentioned this issue May 6, 2022

rpk: fix data race in container purge and stop #4586

Merged

abhijat mentioned this issue May 6, 2022

tests: Add random action injector #4404

Merged

jcsp closed this as completed in #4529 May 6, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) #4373

Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) #4373

jcsp commented Apr 21, 2022

jcsp commented Apr 21, 2022

jcsp commented May 3, 2022

Lazin commented May 4, 2022

dotnwat commented May 4, 2022

dotnwat commented May 4, 2022 •

edited

Loading

twmb commented May 5, 2022

dotnwat commented May 5, 2022

jcsp commented May 6, 2022

abhijat commented May 6, 2022 •

edited

Loading

Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) #4373

Timeout in ManyPartitionsTest (ManyPartitionsTest.test_many_partitions) #4373

Comments

jcsp commented Apr 21, 2022

jcsp commented Apr 21, 2022

jcsp commented May 3, 2022

Lazin commented May 4, 2022

dotnwat commented May 4, 2022

dotnwat commented May 4, 2022 • edited Loading

twmb commented May 5, 2022

dotnwat commented May 5, 2022

jcsp commented May 6, 2022

abhijat commented May 6, 2022 • edited Loading

dotnwat commented May 4, 2022 •

edited

Loading

abhijat commented May 6, 2022 •

edited

Loading