Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in PartitionBalancerTest.test_full_nodes (check_no_replicas_on_node) #6810

Closed
BenPope opened this issue Oct 18, 2022 · 9 comments · Fixed by #6832
Closed

Failure in PartitionBalancerTest.test_full_nodes (check_no_replicas_on_node) #6810

BenPope opened this issue Oct 18, 2022 · 9 comments · Fixed by #6832
Assignees

Comments

@BenPope
Copy link
Member

BenPope commented Oct 18, 2022

Occurred in 13/22 runs in the last 24h.

https://buildkite.com/redpanda/redpanda/builds/16852#0183e99e-0e8a-4ea7-8459-eadfa388a35d

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 511, in test_full_nodes
    self.check_no_replicas_on_node(ns.cur_failure.node)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 183, in check_no_replicas_on_node
    assert self.redpanda.idx(node) not in node2pc
AssertionError

Could be related: #5968

@BenPope BenPope added kind/bug Something isn't working ci-failure labels Oct 18, 2022
@BenPope BenPope changed the title CI Failure CI Failure : check_no_replicas_on_node Oct 18, 2022
@jcsp jcsp changed the title CI Failure : check_no_replicas_on_node Failure in PartitionBalancerTest.test_full_nodes (check_no_replicas_on_node) Oct 18, 2022
@jcsp
Copy link
Contributor

jcsp commented Oct 18, 2022

This was unmuted here, but there was a big gap between the tests on that PR runnig, and it merging, so something presumably merged in the interim.
#6641

This PR touched the test recently, although in a relatively superficial way #6641

jcsp added a commit to jcsp/redpanda that referenced this issue Oct 18, 2022
This is too frequent of a failure to tolerate on the
tip of dev.

It was recently re-enabled in redpanda-data#6653

Related: redpanda-data#6810
jcsp added a commit to jcsp/redpanda that referenced this issue Oct 18, 2022
This is too frequent of a failure to tolerate on the
tip of dev.

It was recently re-enabled in redpanda-data#6653

Related: redpanda-data#6810
@jcsp
Copy link
Contributor

jcsp commented Oct 18, 2022

This is happening very frequently, so disabling test #6811

@ztlpn
Copy link
Contributor

ztlpn commented Oct 18, 2022

I think this is arm-only, so my guess is that something is wrong with arm runners.

@jcsp
Copy link
Contributor

jcsp commented Oct 18, 2022

This does have failures on amd64 as well, but they came from an issue with #6757 where the ValueError('max_workers must be greater than 0') is emitted when test_full_nodes tries to skip itself. @graphcareful could you do a one liner to make the trim_logs function drop out if it sees zero nodes on the redpanda instance?

@ztlpn
Copy link
Contributor

ztlpn commented Oct 18, 2022

Producer started at 07:09:30:

time="2022-10-18T07:09:30Z" level=info msg="Producing 19000 messages (102400 bytes)"

but even before that something was eating up half a gig of space on the /var/lib/redpanda volume:

DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 1: bytes used: 562839552, bytes total: 2147483648, used ratio: 0.2621
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 2: bytes used: 562315264, bytes total: 2147483648, used ratio: 0.2618
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 5: bytes used: 562249728, bytes total: 2147483648, used ratio: 0.2618
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 3: bytes used: 562356224, bytes total: 2147483648, used ratio: 0.2619
DEBUG 2022-10-18 07:09:27,548 [shard 0] cluster - partition_balancer_planner.cc:138 - node 4: bytes used: 562315264, bytes total: 2147483648, used ratio: 0.2618

by contrast, at the beginning of an amd run we have

DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 3: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 5: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 2: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 4: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461
DEBUG 2022-09-19 16:15:51,231 [shard 0] cluster - partition_balancer_planner.cc:137 - node 1: bytes used: 203169792, bytes total: 2147483648, used ratio: 0.09461

hmm

@ztlpn
Copy link
Contributor

ztlpn commented Oct 19, 2022

From Solonas:

we had a look yesterday.. it doesn't look like something is wrong with the instance (or filesystem)

@BenPope
Copy link
Member Author

BenPope commented Oct 19, 2022

ValueError('max_workers must be greater than 0'):
https://buildkite.com/redpanda/redpanda/builds/16878#0183ebb7-feb4-485c-abf5-03aa8be3ebd2

@ztlpn
Copy link
Contributor

ztlpn commented Oct 21, 2022

The discrepancy is already there for empty filesystems. Here is df output for arm:

Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/md0p18    450981836 483104 450498732   1% /var/lib/redpanda

and for amd64:

Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/md0p6      64460032 97856  64362176   1% /var/lib/redpanda

The reason for it is likely because arm partition is simply bigger (430GiB vs. 61GiB) and for a bigger partition the filesystem needs more space for bookkeeping data.

Roman is working on making the test a little bit smarter in choosing the amount of produced data.

@travisdowns
Copy link
Member

The assert self.redpanda.idx(node) not in node2pc variant seems to be occurring on 22.2.x branch at least:

https://buildkite.com/redpanda/redpanda/builds/23044#018640f3-782b-4ce9-afd3-26711887aa40

... despite that the fix was apparently backported there in #7353.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants