Failure in `ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic` #4848

dimitriscruz · 2022-05-20T23:32:21Z

Build: https://buildkite.com/redpanda/redpanda/builds/10332#6cce2894-07b2-4a29-a557-277cc40e8beb

FAIL test: ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic (1/48 runs)
  failure at 2022-05-20T07:59:23.647Z: TimeoutError('Cluster membership did not stabilize')
      in job https://buildkite.com/redpanda/redpanda/builds/10332#6cce2894-07b2-4a29-a557-277cc40e8beb

Error:

test_id:    rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic
--
  | status:     FAIL
  | run time:   1 minute 20.305 seconds
  |  
  |  
  | TimeoutError('Cluster membership did not stabilize')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/consumer_offsets_migration_test.py", line 160, in test_cluster_is_available_during_upgrade_without_group_topic
  | self.redpanda.start()
  | File "/root/tests/rptest/services/redpanda.py", line 614, in start
  | wait_until(lambda: {n
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError: Cluster membership did not stabilize

The text was updated successfully, but these errors were encountered:

andrwng · 2022-07-05T20:13:19Z

Just chiming in here to say that I looked into this a while ago but haven't followed up yet. It looks like Redpanda is taking a while to start up with 5 brokers and 16 group topic partitions with a replication factor of 3, timing out in 30 seconds.

Most other tests don't start out with so many partitions, so this isn't really an issue elsewhere. I'm not sure if there was ever a time when this test wasn't flaky; will do a bit more digging to determine whether we should start with fewer group topic partitions or just bump the timeout.

andrwng · 2022-07-28T15:11:21Z

This looks like will actually be fixed by #5681. I noticed the truncation did take 1 minute to complete in at least one instance of this failure.

dotnwat · 2022-07-28T21:26:02Z

This looks like will actually be fixed by #5681. I noticed the truncation did take 1 minute to complete in at least one instance of this failure.

oh it's great you noticed that magic 1 minute timeout! wdyt @mmaslankaprv can we close this?

andrwng · 2022-07-28T21:44:52Z

Yea almost exactly, from https://ci-artifacts.dev.vectorized.cloud/redpanda/6cce2894-07b2-4a29-a557-277cc40e8beb/vbuild/ducktape/results/2022-05-20--001/ConsumerOffsetsMigrationTest/test_cluster_is_available_during_upgrade_without_group_topic/72/RedpandaService-0-140304473353520/docker-rp-12/redpanda.log:

INFO  2022-05-20 07:23:43,755 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1635 - Truncating log in term: 1, Request previous log index: 22 is earlier than log end offset: 23. Truncating to: 23
DEBUG 2022-05-20 07:23:43,757 [shard 0] storage - readers_cache.cc:189 - {redpanda/controller/0} - evicting reader truncate 23
TRACE 2022-05-20 07:23:43,758 [shard 0] storage - readers_cache.cc:88 - {redpanda/controller/0} - trying to get reader for: {start_offset:{17}, max_offset:{22}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-05-20 07:23:43,758 [shard 0] storage - readers_cache.cc:117 - {redpanda/controller/0} - reader cache miss for: {start_offset:{17}, max_offset:{22}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
...
TRACE 2022-05-20 07:24:41,419 [shard 0] storage-gc - disk_log_impl.cc:589 - [{redpanda/controller/0}] skipped log deletion, internal topic
TRACE 2022-05-20 07:24:43,402 [shard 0] storage - readers_cache.cc:274 - {redpanda/controller/0} - removing reader: [0,23] lower_bound: 23
TRACE 2022-05-20 07:24:43,405 [shard 0] raft - [group_id:0, {redpanda/controller/0}] configuration_manager.cc:45 - Truncating configurations at 23

I can re-run the test a few dozen times to check that the fix works

andrwng · 2022-07-28T23:39:38Z

I ran this locally 50 times and didn't see anything

dimitriscruz added kind/bug Something isn't working ci-failure labels May 20, 2022

piyushredpanda assigned andrwng May 22, 2022

mmedenjak added the area/tests label Jul 5, 2022

dotnwat mentioned this issue Jul 21, 2022

rpc: reset transport version via reconnect_transport #5520

Merged

andrwng mentioned this issue Jul 22, 2022

tests: start using RedpandaInstaller in more tests #5282

Merged

dotnwat assigned mmaslankaprv Jul 28, 2022

andrwng closed this as completed Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic` #4848

Failure in `ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic` #4848

dimitriscruz commented May 20, 2022

andrwng commented Jul 5, 2022

andrwng commented Jul 28, 2022 •

edited

Loading

dotnwat commented Jul 28, 2022

andrwng commented Jul 28, 2022

andrwng commented Jul 28, 2022

Failure in ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic #4848

Failure in ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic #4848

Comments

dimitriscruz commented May 20, 2022

andrwng commented Jul 5, 2022

andrwng commented Jul 28, 2022 • edited Loading

dotnwat commented Jul 28, 2022

andrwng commented Jul 28, 2022

andrwng commented Jul 28, 2022

Failure in `ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic` #4848

Failure in `ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic` #4848

andrwng commented Jul 28, 2022 •

edited

Loading