Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations #8559

Conversation

graphcareful
Copy link
Contributor

@graphcareful graphcareful commented Feb 1, 2023

  • The test is failing due to the cluster having been fully upgraded already, a node joins with an older version of rp, this join request is rejected.

  • This is occurring because the test randomly adds/recommissions/decommissions nodes. This particular run test starts up with a 5 node cluster 3 of which are upgraded to HEAD version, 2 having the previous version.

  • The randomness introduced is why sometimes this test passes. During the test random operations are performed on the cluster, if the case occurs in which the 2 nodes having older versions are decommissioned, then the cluster is fully upgraded. When these 2 nodes are reused for ADD operations the situation occurs that a node with an older version of rp is attempting to join.

  • The fix is to never fully upgrade the cluster. One of the non-upgraded nodes will be taken out of consideration for having decommission requests sent to it.

Fixes: #6320

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

Release Notes

  • none

@graphcareful graphcareful changed the title rptest: Cluster eventually upgrade in nodeops_test Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations Feb 1, 2023
andrwng
andrwng previously approved these changes Feb 1, 2023
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back, this change somewhat reduces our test coverage for mixed versions, e.g. we'll never test adding an old node to the cluster, since nodes will be new upon restarting. It'd be nice to preserve that kind of coverage where we can

@graphcareful
Copy link
Contributor Author

Modified the solution to instead never fully upgrade the cluster

@graphcareful
Copy link
Contributor Author

/ci-repeat 2
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

andrwng
andrwng previously approved these changes Feb 2, 2023
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution should be enough to avoid the troubles reported in the issue.

As far as test coverage is concerned, I think we'd get a little more by comparing the nodes in redpanda.started_nodes() with upgraded_nodes in each iteration of the execution loop, and if the former is a subset of the latter (and if we haven't already), upgrade all nodes.

tests/rptest/scale_tests/node_operations_fuzzy_test.py Outdated Show resolved Hide resolved
@graphcareful
Copy link
Contributor Author

CI failure seems to be an instance of: #8589

@graphcareful
Copy link
Contributor Author

Rebased dev

@graphcareful
Copy link
Contributor Author

/ci-repeat 2
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

andrwng
andrwng previously approved these changes Feb 3, 2023
@mmaslankaprv
Copy link
Member

/ci-repeat 1
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

@graphcareful
Copy link
Contributor Author

/ci-repeat 1
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

@bharathv
Copy link
Contributor

@graphcareful this is good to go? Ran into this issue again yesterday.

andrwng
andrwng previously approved these changes Feb 22, 2023
@andrwng
Copy link
Contributor

andrwng commented Feb 22, 2023

I'm fine with the fix here, but seems like the repeats are unhappy. I recall @graphcareful mentioning some other flakiness, though it wasn't clear whether that was caused by this PR or not.

@graphcareful
Copy link
Contributor Author

Taking another look now

@graphcareful
Copy link
Contributor Author

graphcareful commented Feb 22, 2023

Upon further investigation looks like the most recent CI failures include assertions that didn't exist at the time of writing the initial fix. There have since been edits to this test and other supporting shared utils and an assertion in one of those is firing.

[INFO:2023-02-17 17:03:49,034]: RunnerClient: rptest.scale_tests.node_operations_fuzzy_test.NodeOperationFuzzyTest.test_node_operations.enable_failures=False.num_to_upgrade=0.compacted_topics=False: Summary: AssertionError('Node 9 decommissioning stopped making progress')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/scale_tests/node_operations_fuzzy_test.py", line 187, in test_node_operations
  | executor.execute_operation(op)
  | File "/root/tests/rptest/utils/node_operations.py", line 395, in execute_operation
  | self.wait_for_removed(node_id)
  | File "/root/tests/rptest/utils/node_operations.py", line 258, in wait_for_removed
  | waiter.wait_for_removal()
  | File "/root/tests/rptest/utils/node_operations.py", line 176, in wait_for_removal
  | assert self._made_progress(
  | AssertionError: Node 9 decommissioning stopped making progress

All of the failures share the same assertion. The logic I introduced will only be enabled in the case num_to_upgrade is > 0, however these failures are observed when num_to_upgrade is 0 or greater then 0. Its possible the fix works, but there is another unrelated bug here.

- The node_operations_fuzzy_test.test_node_operations test was failing
due to the cluster eventually fully upgrading and having nodes with
older versions of rp restart and attempt to join.

- The fix is to have one node never restart so that the cluster will
always stay in a partically upgraded state.

- Fixes: redpanda-data#6320
@graphcareful
Copy link
Contributor Author

/ci-repeat 1
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

@graphcareful
Copy link
Contributor Author

/ci-repeat 1
skip-units
dt-repeat=1
tests/rptest/scale_tests/node_operations_fuzzy_test.py

@graphcareful
Copy link
Contributor Author

The issue observed is also seen in dev #9052

@vbotbuildovich
Copy link
Collaborator

The pull request is not merged yet. Cancelling backport...

Workflow run logs.

@graphcareful
Copy link
Contributor Author

Test no longer exists, closing

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[23.1.x] CI Failure (Rejecting join request) in NodeOperationFuzzyTest.test_node_operations
5 participants