Failure in `ControllerUpgradeTest` `test_updating_cluster_when_executing_operations` #5378

jcsp · 2022-07-07T14:50:52Z

The AdminOperationsFuzzer retries operations 5 times before giving up. In this instance it is retrying a configuration update 5 times. The fuzzer starts trying just at the same time as the first node in the cluster is restarted, and retries every 1 second for 5 seconds. During this period the updates continue to fail, which is a little unexpected (one would expect fresh controller leader election to be quicker than this), but not inconceivable given the slow/noisy runtime environment.

I think this is a case where the AdminOperationsFuzzer needs a higher number of retries (or a longer delay between them): it is not really a failure for the cluster to go 5 seconds without being able to service a controller write, in this environment.

https://buildkite.com/redpanda/redpanda/builds/12215#0181d6c3-6183-4b8d-8897-60c2ad205a02

LenaAn · 2022-07-28T13:31:21Z

happened just now, my branch is rebased to fresh dev https://buildkite.com/redpanda/redpanda/builds/13246#0182447c-878b-49aa-968d-7643cdab143b

andrwng · 2022-07-28T17:09:20Z

Before, the test failed to complete some number of operations within the timeout. Looks like we're hitting a new error now:

[INFO  - 2022-07-28 11:54:16,194 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations: FAIL: <BadLogLines nodes=docker-rp-2(2) example="ERROR 2022-07-28 11:51:58,825 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:69">
Traceback (most recent call last):                                                                   
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run   
    data = self.run_test()                                                                           
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)                                                     
  File "/root/tests/rptest/services/cluster.py", line 48, in wrapped                                 
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)                                       
  File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs                    
    raise BadLogLines(bad_lines)                                                                     
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-2(2) example="ERROR 2022-07-28 11:51:58,825 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:69">

I'll dig in.

VadimPlh · 2022-07-29T10:57:49Z

Again
https://buildkite.com/redpanda/redpanda/builds/13296#01824894-8ea1-4d18-b783-dd53fa11a4c0

Error is

<BadLogLines nodes=docker-rp-13(1) example="ERROR 2022-07-29 06:55:36,935 [shard 0] rpc - Service handler threw an exception: seastar::condition_variable_timed_out (Condition variable timed out)">

abhijat · 2022-08-02T17:01:57Z

seen on #5757

https://buildkite.com/redpanda/redpanda/builds/13415#018259ea-dff0-4d4e-8c72-89c30874ce5f

test_id: rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
status: FAIL
run time: 5 minutes 0.884 seconds

<BadLogLines nodes=docker-rp-19(1) example="ERROR 2022-08-01 15:46:09,378 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:19">

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines:

#5742 should fix this as it will downgrade the error to debug

r-vasquez · 2022-08-03T00:29:52Z

Seen in #5779

test_id:    rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
--
  | status:     FAIL
  | run time:   4 minutes 38.694 seconds
  |  
  |  
  | <BadLogLines nodes=docker-rp-8(1) example="ERROR 2022-08-02 22:25:37,980 [shard 1] r/heartbeat - heartbeat_manager.cc:285 - cannot find consensus group:29">
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
  | self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  | File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
  | raise BadLogLines(bad_lines)
  | rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-8(1) example="ERROR 2022-08-02 22:25:37,980 [shard 1] r/heartbeat - heartbeat_manager.cc:285 - cannot find consensus group:29">

jcsp · 2022-08-04T10:42:50Z

This is still open until #5742

andrwng · 2022-08-24T22:25:52Z

Saw this again in the vtools build:
https://ci-artifacts.dev.vectorized.cloud/vtools/0182d0d9-4e23-4c19-a141-97d516db894d/vbuild/ducktape/results/2022-08-24--001/report.html

Will put up a patch to catch the log, since this is an upgrade test.

jcsp added kind/bug Something isn't working area/tests ci-failure labels Jul 7, 2022

piyushredpanda assigned andrwng Jul 7, 2022

andrwng mentioned this issue Jul 8, 2022

tests: start using RedpandaInstaller in more tests #5282

Merged

andrwng closed this as completed in #5282 Jul 27, 2022

LenaAn reopened this Jul 28, 2022

LenaAn mentioned this issue Jul 28, 2022

ducky: test read_replica even if not data is uploaded yet #5620

Merged

This was referenced Jul 29, 2022

consensus: fix expected condition variable exception #5734

Merged

heartbeat_manager: reduce log severity when missing partition #5742

Merged

andrwng closed this as completed in #5734 Jul 29, 2022

This was referenced Jul 30, 2022

[v22.1.x] Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations #5745

Closed

[v21.11.x] Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations #5747

Closed

abhijat mentioned this issue Aug 2, 2022

cloud_storage/tests: wait for partition metadata during topic recovery tests #5757

Merged

r-vasquez mentioned this issue Aug 3, 2022

rpk: set cluster properties on start for sandbox mode #5779

Merged

2 tasks

jcsp reopened this Aug 4, 2022

jcsp mentioned this issue Aug 4, 2022

tests: update ManyPartitionsTest #5816

Merged

5 tasks

andrwng mentioned this issue Aug 4, 2022

compat: add support for some tx types #5832

Merged

5 tasks

jcsp closed this as completed in #5742 Aug 5, 2022

andrwng reopened this Aug 24, 2022

andrwng mentioned this issue Aug 24, 2022

controller_upgrade_test: add message to allowed list #6218

Merged

6 tasks

andrwng closed this as completed in #6218 Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `ControllerUpgradeTest` `test_updating_cluster_when_executing_operations` #5378

Failure in `ControllerUpgradeTest` `test_updating_cluster_when_executing_operations` #5378

jcsp commented Jul 7, 2022

LenaAn commented Jul 28, 2022

andrwng commented Jul 28, 2022

VadimPlh commented Jul 29, 2022

abhijat commented Aug 2, 2022

r-vasquez commented Aug 3, 2022

jcsp commented Aug 4, 2022

andrwng commented Aug 24, 2022

Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations #5378

Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations #5378

Comments

jcsp commented Jul 7, 2022

LenaAn commented Jul 28, 2022

andrwng commented Jul 28, 2022

VadimPlh commented Jul 29, 2022

abhijat commented Aug 2, 2022

r-vasquez commented Aug 3, 2022

jcsp commented Aug 4, 2022

andrwng commented Aug 24, 2022

Failure in `ControllerUpgradeTest` `test_updating_cluster_when_executing_operations` #5378

Failure in `ControllerUpgradeTest` `test_updating_cluster_when_executing_operations` #5378