Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic #4848

Closed
dimitriscruz opened this issue May 20, 2022 · 5 comments
Assignees
Labels

Comments

@dimitriscruz
Copy link
Contributor

Build: https://buildkite.com/redpanda/redpanda/builds/10332#6cce2894-07b2-4a29-a557-277cc40e8beb

FAIL test: ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic (1/48 runs)
  failure at 2022-05-20T07:59:23.647Z: TimeoutError('Cluster membership did not stabilize')
      in job https://buildkite.com/redpanda/redpanda/builds/10332#6cce2894-07b2-4a29-a557-277cc40e8beb

Error:

test_id:    rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_cluster_is_available_during_upgrade_without_group_topic
--
  | status:     FAIL
  | run time:   1 minute 20.305 seconds
  |  
  |  
  | TimeoutError('Cluster membership did not stabilize')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/consumer_offsets_migration_test.py", line 160, in test_cluster_is_available_during_upgrade_without_group_topic
  | self.redpanda.start()
  | File "/root/tests/rptest/services/redpanda.py", line 614, in start
  | wait_until(lambda: {n
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError: Cluster membership did not stabilize
@andrwng
Copy link
Contributor

andrwng commented Jul 5, 2022

Just chiming in here to say that I looked into this a while ago but haven't followed up yet. It looks like Redpanda is taking a while to start up with 5 brokers and 16 group topic partitions with a replication factor of 3, timing out in 30 seconds.

Most other tests don't start out with so many partitions, so this isn't really an issue elsewhere. I'm not sure if there was ever a time when this test wasn't flaky; will do a bit more digging to determine whether we should start with fewer group topic partitions or just bump the timeout.

@andrwng
Copy link
Contributor

andrwng commented Jul 28, 2022

This looks like will actually be fixed by #5681. I noticed the truncation did take 1 minute to complete in at least one instance of this failure.

@dotnwat
Copy link
Member

dotnwat commented Jul 28, 2022

This looks like will actually be fixed by #5681. I noticed the truncation did take 1 minute to complete in at least one instance of this failure.

oh it's great you noticed that magic 1 minute timeout! wdyt @mmaslankaprv can we close this?

@andrwng
Copy link
Contributor

andrwng commented Jul 28, 2022

Yea almost exactly, from https://ci-artifacts.dev.vectorized.cloud/redpanda/6cce2894-07b2-4a29-a557-277cc40e8beb/vbuild/ducktape/results/2022-05-20--001/ConsumerOffsetsMigrationTest/test_cluster_is_available_during_upgrade_without_group_topic/72/RedpandaService-0-140304473353520/docker-rp-12/redpanda.log:

INFO  2022-05-20 07:23:43,755 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1635 - Truncating log in term: 1, Request previous log index: 22 is earlier than log end offset: 23. Truncating to: 23
DEBUG 2022-05-20 07:23:43,757 [shard 0] storage - readers_cache.cc:189 - {redpanda/controller/0} - evicting reader truncate 23
TRACE 2022-05-20 07:23:43,758 [shard 0] storage - readers_cache.cc:88 - {redpanda/controller/0} - trying to get reader for: {start_offset:{17}, max_offset:{22}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2022-05-20 07:23:43,758 [shard 0] storage - readers_cache.cc:117 - {redpanda/controller/0} - reader cache miss for: {start_offset:{17}, max_offset:{22}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
...
TRACE 2022-05-20 07:24:41,419 [shard 0] storage-gc - disk_log_impl.cc:589 - [{redpanda/controller/0}] skipped log deletion, internal topic
TRACE 2022-05-20 07:24:43,402 [shard 0] storage - readers_cache.cc:274 - {redpanda/controller/0} - removing reader: [0,23] lower_bound: 23
TRACE 2022-05-20 07:24:43,405 [shard 0] raft - [group_id:0, {redpanda/controller/0}] configuration_manager.cc:45 - Truncating configurations at 23

I can re-run the test a few dozen times to check that the fix works

@andrwng
Copy link
Contributor

andrwng commented Jul 28, 2022

I ran this locally 50 times and didn't see anything

@andrwng andrwng closed this as completed Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants