Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test_maintenance_sticky: unable find controller leader #4566

Closed
dotnwat opened this issue May 4, 2022 · 2 comments · Fixed by #4625
Closed

test: test_maintenance_sticky: unable find controller leader #4566

dotnwat opened this issue May 4, 2022 · 2 comments · Fixed by #4625
Assignees
Labels
ci-failure kind/bug Something isn't working

Comments

@dotnwat
Copy link
Member

dotnwat commented May 4, 2022

*https://buildkite.com/redpanda/redpanda/builds/9727#e03f53ba-c231-44b4-a2e9-6b62faaedbfe
*#4517

Appears to be an instance related to #3615 where on ARM we are encountering more and more delays and this delay is resulting in a leadership election for the controller that doesn't complete within the period of time that that rpk will retry on 503.

test_id:    rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True
--
  | status:     FAIL
  | run time:   8 minutes 34.030 seconds
  |  
  |  
  | RpkException('command /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-05a93799c220159ae-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk --api-urls docker-rp-4:9644,docker-rp-31:9644,docker-rp-35:9644 cluster maintenance status returned 1, output: Request error, trying another node: request   failed: Service Unavailable, body: "{\\"message\\": \\"Unable to get cluster health: Currently there is no leader controller elected in the cluster\\", \\"code\\": 503}"\nRequest error, trying another node: request   failed: Service Unavailable, body: "{\\"message\\": \\"Unable to get cluster health: Currently there is no leader controller elected in the cluster\\", \\"code\\": 503}"\n', 'unable to request brokers: request   failed: Service Unavailable, body: "{\\"message\\": \\"Unable to get cluster health: Currently there is no leader controller elected in the cluster\\", \\"code\\": 503}"\n')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/maintenance_test.py", line 225, in test_maintenance_sticky
  | self._verify_cluster(None, False)
  | File "/root/tests/rptest/tests/maintenance_test.py", line 175, in _verify_cluster
  | wait_until(
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until
  | raise e
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until
  | if condition():
  | File "/root/tests/rptest/tests/maintenance_test.py", line 176, in <lambda>
  | lambda: self._verify_maintenance_status(node, expect),
  | File "/root/tests/rptest/tests/maintenance_test.py", line 87, in _verify_maintenance_status
  | statuses = self.rpk.cluster_maintenance_status()
  | File "/root/tests/rptest/clients/rpk.py", line 552, in cluster_maintenance_status
  | output = self._execute(cmd)
  | File "/root/tests/rptest/clients/rpk.py", line 488, in _execute
  | raise RpkException(
  | rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-05a93799c220159ae-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk --api-urls docker-rp-4:9644,docker-rp-31:9644,docker-rp-35:9644 cluster maintenance status returned 1, output: Request error, trying another node: request   failed: Service Unavailable, body: "{\"message\": \"Unable to get cluster health: Currently there is no leader controller elected in the cluster\", \"code\": 503}"
  | Request error, trying another node: request   failed: Service Unavailable, body: "{\"message\": \"Unable to get cluster health: Currently there is no leader controller elected in the cluster\", \"code\": 503}"
  | error: unable to request brokers: request   failed: Service Unavailable, body: "{\"message\": \"Unable to get cluster health: Currently there is no leader controller elected in the cluster\", \"code\": 503}"
  | >
@dotnwat dotnwat added kind/bug Something isn't working ci-failure labels May 4, 2022
@dotnwat dotnwat changed the title test: test_maintenance_sticky: unable find controller leader test: test_maintenance_sticky: unable find controller leader (ARM slow?) May 4, 2022
@piyushredpanda
Copy link
Contributor

Requesting if @rystsov could look into this given there was a similar ARM issue he had debugged recently, IIRC.

@jcsp
Copy link
Contributor

jcsp commented May 9, 2022

Looks to be not ARM specific, seen on amd64 here:
https://buildkite.com/redpanda/redpanda/builds/9892#742a766b-0702-47ef-beb4-64caf3e9f317

jcsp added a commit to jcsp/redpanda that referenced this issue May 9, 2022
If controller leader election is a little slow, rpk will
see 503s and retry.  It prints output about these request
errors, which woulds previously trip up the output parsing.

Fixes redpanda-data#4566
@jcsp jcsp assigned jcsp and unassigned rystsov May 9, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue May 9, 2022
If controller leader election is a little slow, rpk will
see 503s and retry.  It prints output about these request
errors, which woulds previously trip up the output parsing.

Fixes redpanda-data#4566
jcsp added a commit to jcsp/redpanda that referenced this issue May 9, 2022
If controller leader election is a little slow, rpk will
see 503s and retry.  It prints output about these request
errors, which woulds previously trip up the output parsing.

Fixes redpanda-data#4566
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 10, 2022
If controller leader election is a little slow, rpk will
see 503s and retry.  It prints output about these request
errors, which woulds previously trip up the output parsing.

Fixes redpanda-data#4566

(cherry picked from commit e0e6cd4)
@jcsp jcsp changed the title test: test_maintenance_sticky: unable find controller leader (ARM slow?) test: test_maintenance_sticky: unable find controller leader May 11, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Jun 1, 2022
If controller leader election is a little slow, rpk will
see 503s and retry.  It prints output about these request
errors, which woulds previously trip up the output parsing.

Fixes redpanda-data#4566

(cherry picked from commit e0e6cd4)
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants