Failure in PartitionBalancerTest.test_unavailable_nodes #5471

r-vasquez · 2022-07-14T18:56:29Z

Build: https://buildkite.com/redpanda/redpanda/builds/12554#0181fda7-1a9d-47cd-86fb-775767011acb

test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes
--
  | status:     FAIL
  | run time:   6 minutes 33.620 seconds
  |  
  |  
  | NodeCrash([(<ducktape.cluster.cluster.ClusterNode object at 0x7ff7d826b0a0>, 'Redpanda process unexpectedly stopped')])
  | Traceback (most recent call last):
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 126, in test_unavailable_nodes
  | self.wait_until_ready()
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 86, in wait_until_ready
  | return self.wait_until_status(
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 79, in wait_until_status
  | return wait_until_result(
  | File "/root/tests/rptest/util.py", line 110, in wait_until_result
  | wait_until(wrapped_condition, *args, **kwargs)
  | File "/root/tests/rptest/util.py", line 68, in wait_until
  | raise e
  | File "/root/tests/rptest/util.py", line 61, in wait_until
  | if condition():
  | File "/root/tests/rptest/util.py", line 97, in wrapped_condition
  | cond = condition()
  | File "/root/tests/rptest/tests/partition_balancer_test.py", line 71, in check
  | status = admin.get_partition_balancer_status(timeout=1)
  | File "/root/tests/rptest/services/admin.py", line 702, in get_partition_balancer_status
  | return self._request("GET",
  | File "/root/tests/rptest/services/admin.py", line 334, in _request
  | r.raise_for_status()
  | File "/usr/local/lib/python3.9/dist-packages/requests/models.py", line 941, in raise_for_status
  | raise HTTPError(http_error_msg, response=self)
  | requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://docker-rp-6:9644/v1/cluster/partition_balancer/status
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 38, in wrapped
  | self.redpanda.raise_on_crash()
  | File "/root/tests/rptest/services/redpanda.py", line 1030, in raise_on_crash
  | raise NodeCrash(crashes)
  | rptest.services.utils.NodeCrash: <NodeCrash docker-rp-5: Redpanda process unexpectedly stopped>

The text was updated successfully, but these errors were encountered:

VladLazar · 2022-07-19T12:30:03Z

another failure: https://buildkite.com/redpanda/vtools/builds/2897#0182159d-22f0-4bf8-93c9-317e0a909eea

ZeDRoman · 2022-07-19T15:52:12Z

Reason of Failure:
We are trying to get status from node-rp-1, which is not controller. So node-rp-1 is proxying request to node-rp-9, which was controller. But node-rp-9 is turned off in test. Leader table didn't update, so we get this error.

To fix it we can add retries to status request

piyushredpanda · 2022-07-19T20:08:46Z

@ZeDRoman : Thanks, but why the crash in RP?

ZeDRoman · 2022-07-19T21:09:55Z

@ZeDRoman : Thanks, but why the crash in RP?

Last crash in log happens because we don't start node if test is failed in the middle of evaluation
We stop nodes to test rebalancing. So if test is crashed before we start node again, we will see that error
This error will not occur if there won't be any other error

piyushredpanda · 2022-07-19T21:13:03Z

Thanks, @ZeDRoman : so the "crash" here is what we did to bring Redpanda down to test rebalancing. Is that correct?
I wonder if all test failures will show up as RP crash, then? If yes, is there a way for the test to be structured that this is not the case (it would raise alarms very quickly on seeing RP crash in test).

ZeDRoman · 2022-07-19T21:22:49Z

so the "crash" here is what we did to bring Redpanda down to test rebalancing. Is that correct?

Yes. We crash nodes on our own to trigger rebalancing

I wonder if all test failures will show up as RP crash, then? If yes, is there a way for the test to be structured that this is not the case (it would raise alarms very quickly on seeing RP crash in test).

As I know, now every test will show that log if redpanda node is down at the end of the test evaluation.
This check is part of out implementation of redpanda service in ducktape. I didn't see any approach to avoid this.

piyushredpanda · 2022-07-19T21:28:43Z

Let's chat with the team on some ideas to make this better.

For this ticket: so it is a test-issue that you plan to have a PR to retry status, yes?

ZeDRoman · 2022-07-19T21:32:02Z

For this ticket: so it is a test-issue that you plan to have a PR to retry status, yes?

yes I will create pr soon

VadimPlh · 2022-07-25T10:52:47Z

Again https://buildkite.com/redpanda/redpanda/builds/12972#018229b7-81d0-4ed8-9129-de98e1bb9e5d

ztlpn · 2022-07-25T13:34:53Z

As per the discussion with @mmaslankaprv this morning, missing_node_rpc_client error can be a consequence of the node trying to make an rpc request to itself. This can happen if the node is a provisional controller leader (hasn't yet committed the first batch in its leadership term).

We need to check for this situation (leader_id == self, but is_leader == false) in admin_server.cc and return cluster::errc::no_leader_controller

ajfabbri · 2022-07-28T04:33:16Z

Seen in CI failures today here.

BenPope · 2022-08-09T15:06:24Z

This one looks the same: https://ci-artifacts.dev.vectorized.cloud/redpanda/0182806b-2b0f-4933-9c42-ab3ef015faab/vbuild/ducktape/results/2022-08-09--001/PartitionBalancerTest/test_rack_awareness/34/

ztlpn · 2022-08-10T02:17:05Z

[DEBUG - 2022-08-09 03:07:42,301 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-13:9644/v1/cluster/partition_balancer/status
[WARNING - 2022-08-09 03:07:43,271 - admin - _request - lineno:321]: Response 500: {"message": "Unexpected error: rpc::errc::disconnected_endpoint(node down)", "code": 500}

There was a problem with completing rpc requests around this time. The problem is rather inexplicable because both nodes were online and rpc requests timed out only in one direction. So my bet would be on an rpc bug conjectured in #5608 (comment)

ztlpn · 2022-08-10T23:00:40Z

BTW the test will be "fixed" once #5916 merges (admin server will return 503 which is a retryable error code).

ztlpn · 2022-08-15T13:29:27Z

Admin server will now return 503 and deeper rpc problems are tracked in #6005. Going to close.

r-vasquez added the kind/bug Something isn't working label Jul 14, 2022

r-vasquez mentioned this issue Jul 14, 2022

rpk: add license commands (set and info) #5223

Merged

r-vasquez added ci-failure labels Jul 14, 2022

mmedenjak added area/redpanda area/tests labels Jul 15, 2022

piyushredpanda assigned ZeDRoman Jul 19, 2022

This was referenced Jul 20, 2022

Failure in PartitionBalancerTest.test_movement_cancellations #5531

Closed

Partition_balancer: fix leader_id if leader not stable #5541

Merged

dotnwat mentioned this issue Jul 21, 2022

serde: Support absl::flat_hash_map and absl::btree_set #5431

Merged

andrwng mentioned this issue Jul 22, 2022

tests: start using RedpandaInstaller in more tests #5282

Merged

jcsp mentioned this issue Jul 25, 2022

tests: fix RpkTool.group_describe failure #5610

Merged

ZeDRoman closed this as completed in #5541 Jul 28, 2022

BenPope reopened this Aug 9, 2022

piyushredpanda assigned ztlpn and unassigned ZeDRoman Aug 10, 2022

ztlpn mentioned this issue Aug 10, 2022

Failure in PartitionMoveInterruption.test_cancelling_partition_move_x_core #5608

Closed

ztlpn closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in PartitionBalancerTest.test_unavailable_nodes #5471

Failure in PartitionBalancerTest.test_unavailable_nodes #5471

r-vasquez commented Jul 14, 2022

VladLazar commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022

piyushredpanda commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022 •

edited

Loading

piyushredpanda commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022

piyushredpanda commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022 •

edited

Loading

VadimPlh commented Jul 25, 2022

ztlpn commented Jul 25, 2022

ajfabbri commented Jul 28, 2022

BenPope commented Aug 9, 2022

ztlpn commented Aug 10, 2022

ztlpn commented Aug 10, 2022 •

edited

Loading

ztlpn commented Aug 15, 2022

Failure in PartitionBalancerTest.test_unavailable_nodes #5471

Failure in PartitionBalancerTest.test_unavailable_nodes #5471

Comments

r-vasquez commented Jul 14, 2022

VladLazar commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022

piyushredpanda commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022 • edited Loading

piyushredpanda commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022

piyushredpanda commented Jul 19, 2022

ZeDRoman commented Jul 19, 2022 • edited Loading

VadimPlh commented Jul 25, 2022

ztlpn commented Jul 25, 2022

ajfabbri commented Jul 28, 2022

BenPope commented Aug 9, 2022

ztlpn commented Aug 10, 2022

ztlpn commented Aug 10, 2022 • edited Loading

ztlpn commented Aug 15, 2022

ZeDRoman commented Jul 19, 2022 •

edited

Loading

ZeDRoman commented Jul 19, 2022 •

edited

Loading

ztlpn commented Aug 10, 2022 •

edited

Loading