Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations #5378

Closed
jcsp opened this issue Jul 7, 2022 · 7 comments · Fixed by #5282, #5734, #5742 or #6218
Closed

Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations #5378

jcsp opened this issue Jul 7, 2022 · 7 comments · Fixed by #5282, #5734, #5742 or #6218
Assignees
Labels

Comments

@jcsp
Copy link
Contributor

jcsp commented Jul 7, 2022

The AdminOperationsFuzzer retries operations 5 times before giving up. In this instance it is retrying a configuration update 5 times. The fuzzer starts trying just at the same time as the first node in the cluster is restarted, and retries every 1 second for 5 seconds. During this period the updates continue to fail, which is a little unexpected (one would expect fresh controller leader election to be quicker than this), but not inconceivable given the slow/noisy runtime environment.

I think this is a case where the AdminOperationsFuzzer needs a higher number of retries (or a longer delay between them): it is not really a failure for the cluster to go 5 seconds without being able to service a controller write, in this environment.

https://buildkite.com/redpanda/redpanda/builds/12215#0181d6c3-6183-4b8d-8897-60c2ad205a02

@LenaAn
Copy link
Contributor

LenaAn commented Jul 28, 2022

happened just now, my branch is rebased to fresh dev https://buildkite.com/redpanda/redpanda/builds/13246#0182447c-878b-49aa-968d-7643cdab143b

@andrwng
Copy link
Contributor

andrwng commented Jul 28, 2022

Before, the test failed to complete some number of operations within the timeout. Looks like we're hitting a new error now:

[INFO  - 2022-07-28 11:54:16,194 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations: FAIL: <BadLogLines nodes=docker-rp-2(2) example="ERROR 2022-07-28 11:51:58,825 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:69">
Traceback (most recent call last):                                                                   
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run   
    data = self.run_test()                                                                           
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)                                                     
  File "/root/tests/rptest/services/cluster.py", line 48, in wrapped                                 
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)                                       
  File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs                    
    raise BadLogLines(bad_lines)                                                                     
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-2(2) example="ERROR 2022-07-28 11:51:58,825 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:69">

I'll dig in.

@VadimPlh
Copy link
Contributor

Again
https://buildkite.com/redpanda/redpanda/builds/13296#01824894-8ea1-4d18-b783-dd53fa11a4c0

Error is

<BadLogLines nodes=docker-rp-13(1) example="ERROR 2022-07-29 06:55:36,935 [shard 0] rpc - Service handler threw an exception: seastar::condition_variable_timed_out (Condition variable timed out)">

@abhijat
Copy link
Contributor

abhijat commented Aug 2, 2022

seen on #5757

https://buildkite.com/redpanda/redpanda/builds/13415#018259ea-dff0-4d4e-8c72-89c30874ce5f

test_id: rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
status: FAIL
run time: 5 minutes 0.884 seconds

<BadLogLines nodes=docker-rp-19(1) example="ERROR 2022-08-01 15:46:09,378 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:19">

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines:

#5742 should fix this as it will downgrade the error to debug

@r-vasquez
Copy link
Contributor

Seen in #5779

test_id:    rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
--
  | status:     FAIL
  | run time:   4 minutes 38.694 seconds
  |  
  |  
  | <BadLogLines nodes=docker-rp-8(1) example="ERROR 2022-08-02 22:25:37,980 [shard 1] r/heartbeat - heartbeat_manager.cc:285 - cannot find consensus group:29">
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
  | self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  | File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
  | raise BadLogLines(bad_lines)
  | rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-8(1) example="ERROR 2022-08-02 22:25:37,980 [shard 1] r/heartbeat - heartbeat_manager.cc:285 - cannot find consensus group:29">


@jcsp
Copy link
Contributor Author

jcsp commented Aug 4, 2022

This is still open until #5742

@andrwng
Copy link
Contributor

andrwng commented Aug 24, 2022

Saw this again in the vtools build:
https://ci-artifacts.dev.vectorized.cloud/vtools/0182d0d9-4e23-4c19-a141-97d516db894d/vbuild/ducktape/results/2022-08-24--001/report.html

Will put up a patch to catch the log, since this is an upgrade test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment