Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in NodesDecommisioningTest.test_decommissioning_working_node #2388

Closed
jcsp opened this issue Sep 22, 2021 · 7 comments
Closed

Failure in NodesDecommisioningTest.test_decommissioning_working_node #2388

jcsp opened this issue Sep 22, 2021 · 7 comments
Assignees

Comments

@jcsp
Copy link
Contributor

jcsp commented Sep 22, 2021

https://buildkite.com/vectorized/vtools/builds/306#3d5fa10a-5652-48fe-8e90-28832108c106

http://ci-artifacts.dev.vectorized.cloud/vtools/3d5fa10a-5652-48fe-8e90-28832108c106/vbuild/ducktape/results/2021-09-22--001/report.html

From the test log, it looks like the partitions were correctly moved away from the node, but the node itself remained in the list of brokers and the test timed out waiting for it to go away.

@jcsp jcsp added kind/bug Something isn't working area/raft ci-failure labels Sep 22, 2021
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 22, 2021
Pending redpanda-data#2388

Signed-off-by: John Spray <jcs@vectorized.io>
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 23, 2021
Pending redpanda-data#2388

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp
Copy link
Contributor Author

jcsp commented Sep 23, 2021

This looks like a genuine redpanda bug.

On a passing run of the test, the time between "changing node {} membership state to: draining" and "decommissioning finished, removing node {} from cluster" is only 10 seconds.

On this failure, we're never getting the second message, and the timeout is 120 seconds.

I do notice that the leadership balancer is doing transfers at the same time as the decommissioning is going on.

@jcsp
Copy link
Contributor Author

jcsp commented Sep 23, 2021

This test is now disabled in dev, PR that fixes it should re-enable test too.

@jcsp
Copy link
Contributor Author

jcsp commented Sep 28, 2021

This is hard to reproduce. Has not failed in several days of nightly test-staging runs. The original failure still looks like an authentic issue.

@jcsp
Copy link
Contributor Author

jcsp commented Sep 29, 2021

@jcsp
Copy link
Contributor Author

jcsp commented Sep 29, 2021

Created #2478 for what is believed to be the underlying bug.

@jcsp jcsp assigned jcsp and unassigned mmaslankaprv Sep 29, 2021
@jcsp jcsp assigned ztlpn and unassigned jcsp Dec 2, 2021
@jcsp
Copy link
Contributor Author

jcsp commented Dec 2, 2021

Should be fixed by #3125

@jcsp
Copy link
Contributor Author

jcsp commented Dec 2, 2021

Fixed via #3125 via #2478

@jcsp jcsp closed this as completed Dec 2, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants