Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in MaintenanceTest.test_maintenance_sticky.use_rpk=True #4772

Closed
dimitriscruz opened this issue May 17, 2022 · 9 comments · Fixed by #5426
Closed

Failure in MaintenanceTest.test_maintenance_sticky.use_rpk=True #4772

dimitriscruz opened this issue May 17, 2022 · 9 comments · Fixed by #5426
Assignees
Labels

Comments

@dimitriscruz
Copy link
Contributor

Build (v22.1.x): https://buildkite.com/redpanda/redpanda/builds/10184#2e1a4495-7ef5-4a77-9d1c-0d1002b71ed8

FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/3 runs)
  failure at 2022-05-16T14:38:18.499Z: AssertionError('`rpk cluster maintenance status` has changed: [\'Request\', \'error,\', \'trying\', \'another\', \'node:\', \'request\', \'failed:\', \'Service\', \'Unavailable,\', \'body:\', \'"{\\\\"message\\\\":\', \'\\\\"Unable\', \'to\', \'get\', \'cluster\', \'health:\', \'Currently\', \'there\', \'is\', \'no\', \'leader\', \'controller\', \'elected\', \'in\', \'the\', \'cluster\\\\",\', \'\\\\"code\\\\":\', \'503}"\']')
      in job https://buildkite.com/redpanda/redpanda/builds/10184#2e1a4495-7ef5-4a77-9d1c-0d1002b71ed8

Error

test_id:    rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True
--
  | status:     FAIL
  | run time:   9 minutes 1.238 seconds
  |  
  |  
  | AssertionError('`rpk cluster maintenance status` has changed: [\'Request\', \'error,\', \'trying\', \'another\', \'node:\', \'request\', \'failed:\', \'Service\', \'Unavailable,\', \'body:\', \'"{\\\\"message\\\\":\', \'\\\\"Unable\', \'to\', \'get\', \'cluster\', \'health:\', \'Currently\', \'there\', \'is\', \'no\', \'leader\', \'controller\', \'elected\', \'in\', \'the\', \'cluster\\\\",\', \'\\\\"code\\\\":\', \'503}"\']')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/maintenance_test.py", line 225, in test_maintenance_sticky
  | self._verify_cluster(None, False)
  | File "/root/tests/rptest/tests/maintenance_test.py", line 175, in _verify_cluster
  | wait_until(
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until
  | raise e
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until
  | if condition():
  | File "/root/tests/rptest/tests/maintenance_test.py", line 176, in <lambda>
  | lambda: self._verify_maintenance_status(node, expect),
  | File "/root/tests/rptest/tests/maintenance_test.py", line 87, in _verify_maintenance_status
  | statuses = self.rpk.cluster_maintenance_status()
  | File "/root/tests/rptest/clients/rpk.py", line 553, in cluster_maintenance_status
  | return list(filter(None, map(parse, output.splitlines())))
  | File "/root/tests/rptest/clients/rpk.py", line 528, in parse
  | assert len(
  | AssertionError: `rpk cluster maintenance status` has changed: ['Request', 'error,', 'trying', 'another', 'node:', 'request', 'failed:', 'Service', 'Unavailable,', 'body:', '"{\\"message\\":', '\\"Unable', 'to', 'get', 'cluster', 'health:', 'Currently', 'there', 'is', 'no', 'leader', 'controller', 'elected', 'in', 'the', 'cluster\\",', '\\"code\\":', '503}"']
  |  
@dimitriscruz dimitriscruz added kind/bug Something isn't working ci-failure labels May 17, 2022
@dotnwat dotnwat self-assigned this May 17, 2022
@jcsp
Copy link
Contributor

jcsp commented Jul 8, 2022

This is a TimeoutError but it's in a part of the test waiting for leadership, so conceivably the same issue:
https://buildkite.com/redpanda/redpanda/builds/12286#0181dc4d-53e3-40ee-b939-2a2e58085864

@jcsp
Copy link
Contributor

jcsp commented Jul 8, 2022

Hmm, this just failed twice on dev right after I merged #5159, so maybe that wasn't a coincidence.

https://buildkite.com/redpanda/redpanda/builds/12297#0181dd0a-fbe1-4f73-915e-5e5e8dd3ad0c

Although that particular PR shouldn't have made leadership balancer any more aggressive, it was about throttling it.

@jcsp
Copy link
Contributor

jcsp commented Jul 8, 2022

Looking at a failure log, it seems like leader balancer is trying to move leaderships to a node in maintenance mode, which fails, those groups get muted, and then when the tests expects them to get migrated after maintenance mode is over, that doesn't happen because we're still in the mute period.

Could be that the test used to work because leader balances weren't fast enough to all run through and trigger mutes right away.

@ballard26 ballard26 self-assigned this Jul 8, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Jul 8, 2022
The real fix will be to make the leader balancer aware
of maintenance mode, but the test has become much more
unstable since recent leader balancer changes to do
more movements concurrently, so for the moment just
run the test with the leader balancer disabled.

Related: redpanda-data#4772
jcsp added a commit to jcsp/redpanda that referenced this issue Jul 8, 2022
The real fix will be to make the leader balancer aware
of maintenance mode, but the test has become much more
unstable since recent leader balancer changes to do
more movements concurrently, so for the moment just
run the test with the leader balancer disabled.

Related: redpanda-data#4772
jcsp added a commit to jcsp/redpanda that referenced this issue Jul 8, 2022
The real fix will be to make the leader balancer aware
of maintenance mode, but the test has become much more
unstable since recent leader balancer changes to do
more movements concurrently, so its worth mitigating
that.

The workaround is to set a short mute timeout so that
muting nodes has no real effect, and a short idle timeout
so that post-maintenance leader movements happen promptly.

Related: redpanda-data#4772
BenPope pushed a commit to BenPope/redpanda that referenced this issue Jul 13, 2022
The real fix will be to make the leader balancer aware
of maintenance mode, but the test has become much more
unstable since recent leader balancer changes to do
more movements concurrently, so its worth mitigating
that.

The workaround is to set a short mute timeout so that
muting nodes has no real effect, and a short idle timeout
so that post-maintenance leader movements happen promptly.

Related: redpanda-data#4772
@LenaAn
Copy link
Contributor

LenaAn commented Jul 19, 2022

@twmb
Copy link
Contributor

twmb commented Jul 20, 2022

@twmb
Copy link
Contributor

twmb commented Jul 20, 2022

@twmb
Copy link
Contributor

twmb commented Jul 20, 2022

@LenaAn
Copy link
Contributor

LenaAn commented Jul 21, 2022

@rystsov
Copy link
Contributor

rystsov commented Jul 22, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants