Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure in PartitionBalancerTest.test_fuzz_admin_ops #5950

Closed
BenPope opened this issue Aug 11, 2022 · 8 comments
Closed

CI Failure in PartitionBalancerTest.test_fuzz_admin_ops #5950

BenPope opened this issue Aug 11, 2022 · 8 comments
Assignees

Comments

@BenPope
Copy link
Member

BenPope commented Aug 11, 2022

Version & Environment

Redpanda version: dev

https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd0-4ba1-af35-0b9b4ce7f959

What went wrong?

CI Failure

What should have happened instead?

CI Success

How to reproduce the issue?

???

Additional information

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 369, in test_fuzz_admin_ops
    assert self.redpanda.idx(node) not in node2partition_count
AssertionError
@BenPope BenPope added kind/bug Something isn't working ci-failure labels Aug 11, 2022
@BenPope
Copy link
Member Author

BenPope commented Aug 11, 2022

This one looks very similar:

Module: rptest.tests.partition_balancer_test
Class:  PartitionBalancerTest
Method: test_unavailable_nodes

https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd3-4c4e-a58c-0d79399d2070

The stack trace is different, but looks like it might be the same underlying cause:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 225, in test_unavailable_nodes
    self.check_no_replicas_on_node(node)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 137, in check_no_replicas_on_node
    assert self.redpanda.idx(node) not in node2pc
AssertionError

@BenPope
Copy link
Member Author

BenPope commented Aug 11, 2022

https://buildkite.com/redpanda/redpanda/builds/13959#01828b64-8232-4ed7-86a9-3e7b577fa83d

Module: rptest.tests.partition_balancer_test
Class:  PartitionBalancerTest
Method: test_maintenance_mode
Arguments:
{
  "kill_same_node": false
}

@NyaliaLui
Copy link
Contributor

@ztlpn
Copy link
Contributor

ztlpn commented Aug 11, 2022

The problem (once again) was that the controller raft group had an election while we were waiting for the unavailability timeout to expire and the election reset the timeout. This is expected and test code should be more robust about it, I'll think about how to fix this.

@jcsp
Copy link
Contributor

jcsp commented Aug 12, 2022

Seen an alternative failure mode (TimeoutError after 25 minutes) here:
https://buildkite.com/redpanda/redpanda/builds/14027#01828e47-1be2-4da5-8a52-3565c0d43c1f

That is on a PR branch where the PR is to change redpanda startup to happen in parallel rather than serially, so it's possible that has something to do with it, but all other tests passed.

Given that the usual runtime of the test when it passes is more like 5 minutes, it seems like we have an excessive timeout in here somewhere. ducktape times out internally after 30 minutes running a test, so timeouts need to be well within that

jcsp added a commit to jcsp/redpanda that referenced this issue Aug 12, 2022
This can drop out earlier when there is an error,
rather than waiting in vain for the execution count
to reach a target that it never will.

Related: redpanda-data#5950
@BenPope
Copy link
Member Author

BenPope commented Aug 12, 2022

https://buildkite.com/redpanda/redpanda/builds/14049#01829076-c4f5-4770-8d90-33099a3f8078

Module: rptest.tests.partition_balancer_test
Class:  PartitionBalancerTest
Method: test_fuzz_admin_ops

https://buildkite.com/redpanda/redpanda/builds/14050#0182907f-ebac-4341-a9fb-e8577b90f1d5

Module: rptest.tests.partition_balancer_test
Class:  PartitionBalancerTest
Method: test_unavailable_nodes

jcsp added a commit to jcsp/redpanda that referenced this issue Aug 12, 2022
This can drop out earlier when there is an error,
rather than waiting in vain for the execution count
to reach a target that it never will.

Related: redpanda-data#5950
@VladLazar
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/14018#01828dfb-4a16-42dd-8d8c-5e1e67a37de7

Module: rptest.tests.partition_balancer_test
Class:  PartitionBalancerTest
Method: test_maintenance_mode
Arguments:
{
  "kill_same_node": false
}

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Aug 15, 2022
This can drop out earlier when there is an error,
rather than waiting in vain for the execution count
to reach a target that it never will.

Related: redpanda-data#5950
(cherry picked from commit e0ff6b9)
ztlpn added a commit that referenced this issue Aug 15, 2022
More robust waiting for the quiescent state in partition balancer tests
@ztlpn
Copy link
Contributor

ztlpn commented Aug 15, 2022

fixed by #6007

@ztlpn ztlpn closed this as completed Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants