Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in PartitionBalancerTest.test_unavailable_nodes #5471

Closed
r-vasquez opened this issue Jul 14, 2022 · 15 comments · Fixed by #5541
Closed

Failure in PartitionBalancerTest.test_unavailable_nodes #5471

r-vasquez opened this issue Jul 14, 2022 · 15 comments · Fixed by #5541
Assignees

Comments

@r-vasquez
Copy link
Contributor

Build: https://buildkite.com/redpanda/redpanda/builds/12554#0181fda7-1a9d-47cd-86fb-775767011acb

test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes
--
  | status:     FAIL
  | run time:   6 minutes 33.620 seconds
  |  
  |  
  | NodeCrash([(<ducktape.cluster.cluster.ClusterNode object at 0x7ff7d826b0a0>, 'Redpanda process unexpectedly stopped')])
  | Traceback (most recent call last):
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 126, in test_unavailable_nodes
  | self.wait_until_ready()
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 86, in wait_until_ready
  | return self.wait_until_status(
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 79, in wait_until_status
  | return wait_until_result(
  | File "/root/tests/rptest/util.py", line 110, in wait_until_result
  | wait_until(wrapped_condition, *args, **kwargs)
  | File "/root/tests/rptest/util.py", line 68, in wait_until
  | raise e
  | File "/root/tests/rptest/util.py", line 61, in wait_until
  | if condition():
  | File "/root/tests/rptest/util.py", line 97, in wrapped_condition
  | cond = condition()
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 71, in check
  | status = admin.get_partition_balancer_status(timeout=1)
  | File "/root/tests/rptest/services/admin.py", line 702, in get_partition_balancer_status
  | return self._request("GET",
  | File "/root/tests/rptest/services/admin.py", line 334, in _request
  | r.raise_for_status()
  | File "/usr/local/lib/python3.9/dist-packages/requests/models.py", line 941, in raise_for_status
  | raise HTTPError(http_error_msg, response=self)
  | requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://docker-rp-6:9644/v1/cluster/partition_balancer/status
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 38, in wrapped
  | self.redpanda.raise_on_crash()
  | File "/root/tests/rptest/services/redpanda.py", line 1030, in raise_on_crash
  | raise NodeCrash(crashes)
  | rptest.services.utils.NodeCrash: <NodeCrash docker-rp-5: Redpanda process unexpectedly stopped>
@VladLazar
Copy link
Contributor

@ZeDRoman
Copy link
Contributor

Reason of Failure:
We are trying to get status from node-rp-1, which is not controller. So node-rp-1 is proxying request to node-rp-9, which was controller. But node-rp-9 is turned off in test. Leader table didn't update, so we get this error.

To fix it we can add retries to status request

@piyushredpanda
Copy link
Contributor

@ZeDRoman : Thanks, but why the crash in RP?

@ZeDRoman
Copy link
Contributor

ZeDRoman commented Jul 19, 2022

@ZeDRoman : Thanks, but why the crash in RP?

Last crash in log happens because we don't start node if test is failed in the middle of evaluation
We stop nodes to test rebalancing. So if test is crashed before we start node again, we will see that error
This error will not occur if there won't be any other error

@piyushredpanda
Copy link
Contributor

Thanks, @ZeDRoman : so the "crash" here is what we did to bring Redpanda down to test rebalancing. Is that correct?
I wonder if all test failures will show up as RP crash, then? If yes, is there a way for the test to be structured that this is not the case (it would raise alarms very quickly on seeing RP crash in test).

@ZeDRoman
Copy link
Contributor

so the "crash" here is what we did to bring Redpanda down to test rebalancing. Is that correct?

Yes. We crash nodes on our own to trigger rebalancing

I wonder if all test failures will show up as RP crash, then? If yes, is there a way for the test to be structured that this is not the case (it would raise alarms very quickly on seeing RP crash in test).

As I know, now every test will show that log if redpanda node is down at the end of the test evaluation.
This check is part of out implementation of redpanda service in ducktape. I didn't see any approach to avoid this.

@piyushredpanda
Copy link
Contributor

Let's chat with the team on some ideas to make this better.

For this ticket: so it is a test-issue that you plan to have a PR to retry status, yes?

@ZeDRoman
Copy link
Contributor

ZeDRoman commented Jul 19, 2022

For this ticket: so it is a test-issue that you plan to have a PR to retry status, yes?

yes I will create pr soon

@VadimPlh
Copy link
Contributor

@ztlpn
Copy link
Contributor

ztlpn commented Jul 25, 2022

As per the discussion with @mmaslankaprv this morning, missing_node_rpc_client error can be a consequence of the node trying to make an rpc request to itself. This can happen if the node is a provisional controller leader (hasn't yet committed the first batch in its leadership term).

We need to check for this situation (leader_id == self, but is_leader == false) in admin_server.cc and return cluster::errc::no_leader_controller

@ajfabbri
Copy link
Contributor

Seen in CI failures today here.

@ztlpn
Copy link
Contributor

ztlpn commented Aug 10, 2022

[DEBUG - 2022-08-09 03:07:42,301 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-13:9644/v1/cluster/partition_balancer/status
[WARNING - 2022-08-09 03:07:43,271 - admin - _request - lineno:321]: Response 500: {"message": "Unexpected error: rpc::errc::disconnected_endpoint(node down)", "code": 500}

There was a problem with completing rpc requests around this time. The problem is rather inexplicable because both nodes were online and rpc requests timed out only in one direction. So my bet would be on an rpc bug conjectured in #5608 (comment)

@ztlpn
Copy link
Contributor

ztlpn commented Aug 10, 2022

BTW the test will be "fixed" once #5916 merges (admin server will return 503 which is a retryable error code).

@ztlpn
Copy link
Contributor

ztlpn commented Aug 15, 2022

Admin server will now return 503 and deeper rpc problems are tracked in #6005. Going to close.

@ztlpn ztlpn closed this as completed Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants