-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations
#8559
Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations
#8559
Conversation
NodeOperationFuzzyTest.test_node_operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a step back, this change somewhat reduces our test coverage for mixed versions, e.g. we'll never test adding an old node to the cluster, since nodes will be new upon restarting. It'd be nice to preserve that kind of coverage where we can
41f6095
to
6576986
Compare
Modified the solution to instead never fully upgrade the cluster |
/ci-repeat 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution should be enough to avoid the troubles reported in the issue.
As far as test coverage is concerned, I think we'd get a little more by comparing the nodes in redpanda.started_nodes()
with upgraded_nodes
in each iteration of the execution loop, and if the former is a subset of the latter (and if we haven't already), upgrade all nodes.
6576986
to
b4e78aa
Compare
CI failure seems to be an instance of: #8589 |
b4e78aa
to
645ed64
Compare
Rebased dev |
/ci-repeat 2 |
/ci-repeat 1 |
645ed64
to
8f6b975
Compare
/ci-repeat 1 |
@graphcareful this is good to go? Ran into this issue again yesterday. |
I'm fine with the fix here, but seems like the repeats are unhappy. I recall @graphcareful mentioning some other flakiness, though it wasn't clear whether that was caused by this PR or not. |
Taking another look now |
Upon further investigation looks like the most recent CI failures include assertions that didn't exist at the time of writing the initial fix. There have since been edits to this test and other supporting shared utils and an assertion in one of those is firing.
All of the failures share the same assertion. The logic I introduced will only be enabled in the case |
- The node_operations_fuzzy_test.test_node_operations test was failing due to the cluster eventually fully upgrading and having nodes with older versions of rp restart and attempt to join. - The fix is to have one node never restart so that the cluster will always stay in a partically upgraded state. - Fixes: redpanda-data#6320
8f6b975
to
6e093f8
Compare
/ci-repeat 1 |
/ci-repeat 1 |
The issue observed is also seen in dev #9052 |
The pull request is not merged yet. Cancelling backport... |
Test no longer exists, closing |
The test is failing due to the cluster having been fully upgraded already, a node joins with an older version of rp, this join request is rejected.
This is occurring because the test randomly adds/recommissions/decommissions nodes. This particular run test starts up with a 5 node cluster 3 of which are upgraded to HEAD version, 2 having the previous version.
The randomness introduced is why sometimes this test passes. During the test random operations are performed on the cluster, if the case occurs in which the 2 nodes having older versions are decommissioned, then the cluster is fully upgraded. When these 2 nodes are reused for ADD operations the situation occurs that a node with an older version of rp is attempting to join.
The fix is to never fully upgrade the cluster. One of the non-upgraded nodes will be taken out of consideration for having decommission requests sent to it.
Fixes: #6320
Backports Required
Release Notes