-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure "Node 6 decommissioning stopped making progress" in random_node_operations_test.RandomNodeOperationsTest.test_node_operations #9052
Comments
Seems like a legit decommission hang.. its stuck with non empty allocator loop (allocator_empty: false) and no subsequent re-allocations.
Not obvious to me what that stuck allocation is. |
sev/medium: legit decommission hang (UX problem). (workaround: restart the partition leader broker) |
I've looked into the details on this problem, nothing obvious blocking allocator |
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24692#0186c50a-2f7b-440c-ab52-d2843790a604 |
Could be an issue in the test: https://github.com/mmaslankaprv/redpanda/blob/35fe04328688e68e158f22bd1a53afe1b4261f54/tests/rptest/utils/node_operations.py#L150 may need |
Doubt it..things are genuinely stuck (based on logging) with a pending allocation unit .. I think the issue is the following.. It seems like a race condition where add/decomm/topic delete all race at the same time. Consider the following series of commands that run very close to each other.
With cancel_update we swap the current assignment to reflect the state of the original move but we never swap the replicas of in_progress operation so I think the delete deallocation logic is tripping and deallocating fewer units than it should. |
And on v22.3.14-rc3 CDT: https://buildkite.com/redpanda/vtools/builds/6711#0186e2a3-fb06-4a7a-8288-539d4b760777
|
Seems to have occurred again https://buildkite.com/redpanda/vtools/builds/7155#01879366-c29c-4c54-8d3c-d18a47ffc63b |
|
Let me dig in more in the morning, the linked logs look working as intended, lookup the pid then kill it, and exit when there is no more matching pid
|
@rockwotj The problem is not that redpanda is not being killed, but that it is not started again. There is a concurrent thread that sometimes runs @piyushredpanda no, we shouldn't revert, because that PR fixes another issue with redpanda process detection, we should think of a better way. |
One ugly alternative could be to rename this redpanda process to escape pgrep. a softlink works too |
We can also lookup the full path and skip if --version is in the full command. Why is a concurrent thread checking for versions? Can we not do that? |
I opened a PR to exclude --version commands from pid detection |
Yeah I wish I've known the exact sequence... I think the test checks the version to write the config file before starting redpanda, but in this case starting is skipped, so that's a bit strange. I guess we'll need better debug logging if this resurfaces. Here is the old issue for the reference #8753 |
$ egrep "partition_balancer|redpanda/controller/0}] vote_stm.cc" RedpandaService-0-139865974157136/docker-rp-19/redpanda.log
|
Different node number but presumabgly the same issue here: https://ci-artifacts.dev.vectorized.cloud/redpanda/32529/01892670-5744-43e2-8c71-4969d9cdaf4d/vbuild/ducktape/results/2023-07-05--001/report.html |
According to
|
The above linked builds are from September. Closing. |
https://buildkite.com/redpanda/redpanda/builds/23623#01867359-b3a6-4354-b2c7-7a018c31dd5a
The text was updated successfully, but these errors were encountered: