Failure in `PartitionBalancerTest`.`test_full_nodes` (check_no_replicas_on_node) #6810

BenPope · 2022-10-18T13:24:27Z

Occurred in 13/22 runs in the last 24h.

https://buildkite.com/redpanda/redpanda/builds/16852#0183e99e-0e8a-4ea7-8459-eadfa388a35d

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 511, in test_full_nodes
    self.check_no_replicas_on_node(ns.cur_failure.node)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 183, in check_no_replicas_on_node
    assert self.redpanda.idx(node) not in node2pc
AssertionError

Could be related: #5968

The text was updated successfully, but these errors were encountered:

jcsp · 2022-10-18T13:36:25Z

This was unmuted here, but there was a big gap between the tests on that PR runnig, and it merging, so something presumably merged in the interim.
#6641

This PR touched the test recently, although in a relatively superficial way #6641

This is too frequent of a failure to tolerate on the tip of dev. It was recently re-enabled in redpanda-data#6653 Related: redpanda-data#6810

jcsp · 2022-10-18T13:40:12Z

This is happening very frequently, so disabling test #6811

ztlpn · 2022-10-18T14:02:57Z

I think this is arm-only, so my guess is that something is wrong with arm runners.

jcsp · 2022-10-18T14:09:42Z

This does have failures on amd64 as well, but they came from an issue with #6757 where the ValueError('max_workers must be greater than 0') is emitted when test_full_nodes tries to skip itself. @graphcareful could you do a one liner to make the trim_logs function drop out if it sees zero nodes on the redpanda instance?

ztlpn · 2022-10-18T14:24:39Z

Producer started at 07:09:30:

time="2022-10-18T07:09:30Z" level=info msg="Producing 19000 messages (102400 bytes)"

but even before that something was eating up half a gig of space on the /var/lib/redpanda volume:

DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 1: bytes used: 562839552, bytes total: 2147483648, used ratio: 0.2621
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 2: bytes used: 562315264, bytes total: 2147483648, used ratio: 0.2618
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 5: bytes used: 562249728, bytes total: 2147483648, used ratio: 0.2618
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 3: bytes used: 562356224, bytes total: 2147483648, used ratio: 0.2619
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 4: bytes used: 562315264, bytes total: 2147483648, used ratio: 0.2618

by contrast, at the beginning of an amd run we have

DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 3: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 5: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 2: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 4: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 1: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461

hmm

ztlpn · 2022-10-19T10:09:51Z

From Solonas:

we had a look yesterday.. it doesn't look like something is wrong with the instance (or filesystem)

BenPope · 2022-10-19T13:41:25Z

ValueError('max_workers must be greater than 0'):
https://buildkite.com/redpanda/redpanda/builds/16878#0183ebb7-feb4-485c-abf5-03aa8be3ebd2

ztlpn · 2022-10-21T10:10:00Z

The discrepancy is already there for empty filesystems. Here is df output for arm:

Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/md0p18    450981836 483104 450498732   1% /var/lib/redpanda

and for amd64:

Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/md0p6      64460032 97856  64362176   1% /var/lib/redpanda

The reason for it is likely because arm partition is simply bigger (430GiB vs. 61GiB) and for a bigger partition the filesystem needs more space for bookkeeping data.

Roman is working on making the test a little bit smarter in choosing the amount of produced data.

travisdowns · 2023-02-13T19:57:06Z

The assert self.redpanda.idx(node) not in node2pc variant seems to be occurring on 22.2.x branch at least:

https://buildkite.com/redpanda/redpanda/builds/23044#018640f3-782b-4ce9-afd3-26711887aa40

... despite that the fix was apparently backported there in #7353.

BenPope added kind/bug Something isn't working ci-failure labels Oct 18, 2022

BenPope changed the title ~~CI Failure~~ CI Failure : check_no_replicas_on_node Oct 18, 2022

jcsp changed the title ~~CI Failure : check_no_replicas_on_node~~ Failure in PartitionBalancerTest.test_full_nodes (check_no_replicas_on_node) Oct 18, 2022

jcsp added a commit to jcsp/redpanda that referenced this issue Oct 18, 2022

tests: mark PartitionBalancerTest.test_full_nodes ok_to_fail

fa19cb6

This is too frequent of a failure to tolerate on the tip of dev. It was recently re-enabled in redpanda-data#6653 Related: redpanda-data#6810

jcsp added a commit to jcsp/redpanda that referenced this issue Oct 18, 2022

tests: mark PartitionBalancerTest.test_full_nodes ok_to_fail

591aeab

This is too frequent of a failure to tolerate on the tip of dev. It was recently re-enabled in redpanda-data#6653 Related: redpanda-data#6810

jcsp mentioned this issue Oct 18, 2022

tests: mark PartitionBalancerTest.test_full_nodes ok_to_fail #6811

Merged

6 tasks

mmedenjak assigned ZeDRoman Oct 18, 2022

mmedenjak added area/redpanda kind/bug Something isn't working area/tests and removed kind/bug Something isn't working labels Oct 18, 2022

This was referenced Oct 18, 2022

tests: remove stale ok_to_fail markers on cloud storage tests #6813

Merged

cloud_storage: delete objects from S3 on deletion of tiered storage topics #6683

Merged

ZeDRoman mentioned this issue Oct 19, 2022

ducktape: partition balancer test_full_nodes calculate producer size #6832

Merged

bharathv mentioned this issue Oct 24, 2022

tests/debug: Add debug logs to TxAbortSnapshotTest #6800

Merged

6 tasks

ztlpn mentioned this issue Nov 8, 2022

Failure on ARM PartitionBalancerTest.test_full_nodes #7135

Closed

ZeDRoman closed this as completed in #6832 Nov 9, 2022

BenPope mentioned this issue Nov 17, 2022

[v22.2.x] kafka/client/consumer: Shutdown and fetch improvements #7302

Merged

7 tasks

ZeDRoman mentioned this issue Nov 17, 2022

[v22.2.x] ducktape: partition_balancer test_full_nodes calculate messages amount #7353

Closed

dlex mentioned this issue Feb 10, 2023

[v22.2.x] k/group: bump error log messages to info #8761

Merged

travisdowns mentioned this issue Feb 13, 2023

[v22.2.x] TS timequeries: bail on if segment not found #8813

Merged

6 tasks

This was referenced Mar 10, 2023

[v22.2.x] Filter out version queries when asking for Redpanda pid #8829

Merged

[v22.2.x] fixed serialization of mixed failed and successful heartbeat replies #8913

Merged

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `PartitionBalancerTest`.`test_full_nodes` (check_no_replicas_on_node) #6810

Failure in `PartitionBalancerTest`.`test_full_nodes` (check_no_replicas_on_node) #6810

BenPope commented Oct 18, 2022 •

edited by jcsp

Loading

jcsp commented Oct 18, 2022

jcsp commented Oct 18, 2022

ztlpn commented Oct 18, 2022

jcsp commented Oct 18, 2022

ztlpn commented Oct 18, 2022

ztlpn commented Oct 19, 2022 •

edited

Loading

BenPope commented Oct 19, 2022

ztlpn commented Oct 21, 2022

travisdowns commented Feb 13, 2023

Failure in PartitionBalancerTest.test_full_nodes (check_no_replicas_on_node) #6810

Failure in PartitionBalancerTest.test_full_nodes (check_no_replicas_on_node) #6810

Comments

BenPope commented Oct 18, 2022 • edited by jcsp Loading

jcsp commented Oct 18, 2022

jcsp commented Oct 18, 2022

ztlpn commented Oct 18, 2022

jcsp commented Oct 18, 2022

ztlpn commented Oct 18, 2022

ztlpn commented Oct 19, 2022 • edited Loading

BenPope commented Oct 19, 2022

ztlpn commented Oct 21, 2022

travisdowns commented Feb 13, 2023

Failure in `PartitionBalancerTest`.`test_full_nodes` (check_no_replicas_on_node) #6810

Failure in `PartitionBalancerTest`.`test_full_nodes` (check_no_replicas_on_node) #6810

BenPope commented Oct 18, 2022 •

edited by jcsp

Loading

ztlpn commented Oct 19, 2022 •

edited

Loading