Fix for hang on node joining cluster in `NodeOperationFuzzyTest.test_node_operations` #8559

graphcareful · 2023-02-01T19:12:32Z

The test is failing due to the cluster having been fully upgraded already, a node joins with an older version of rp, this join request is rejected.
This is occurring because the test randomly adds/recommissions/decommissions nodes. This particular run test starts up with a 5 node cluster 3 of which are upgraded to HEAD version, 2 having the previous version.
The randomness introduced is why sometimes this test passes. During the test random operations are performed on the cluster, if the case occurs in which the 2 nodes having older versions are decommissioned, then the cluster is fully upgraded. When these 2 nodes are reused for ADD operations the situation occurs that a node with an older version of rp is attempting to join.
The fix is to never fully upgrade the cluster. One of the non-upgraded nodes will be taken out of consideration for having decommission requests sent to it.

Backports Required

Release Notes

none

andrwng

Taking a step back, this change somewhat reduces our test coverage for mixed versions, e.g. we'll never test adding an old node to the cluster, since nodes will be new upon restarting. It'd be nice to preserve that kind of coverage where we can

graphcareful · 2023-02-01T20:46:44Z

Modified the solution to instead never fully upgrade the cluster

graphcareful · 2023-02-02T02:37:58Z

/ci-repeat 2
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

andrwng

This solution should be enough to avoid the troubles reported in the issue.

As far as test coverage is concerned, I think we'd get a little more by comparing the nodes in redpanda.started_nodes() with upgraded_nodes in each iteration of the execution loop, and if the former is a subset of the latter (and if we haven't already), upgrade all nodes.

tests/rptest/scale_tests/node_operations_fuzzy_test.py

graphcareful · 2023-02-03T15:48:45Z

CI failure seems to be an instance of: #8589

graphcareful · 2023-02-03T15:51:07Z

Rebased dev

graphcareful · 2023-02-03T19:50:16Z

/ci-repeat 2
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

mmaslankaprv · 2023-02-09T13:00:40Z

/ci-repeat 1
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

graphcareful · 2023-02-17T16:15:55Z

/ci-repeat 1
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

bharathv · 2023-02-22T17:52:38Z

@graphcareful this is good to go? Ran into this issue again yesterday.

andrwng · 2023-02-22T18:38:38Z

I'm fine with the fix here, but seems like the repeats are unhappy. I recall @graphcareful mentioning some other flakiness, though it wasn't clear whether that was caused by this PR or not.

graphcareful · 2023-02-22T21:26:42Z

Taking another look now

graphcareful · 2023-02-22T23:14:02Z

Upon further investigation looks like the most recent CI failures include assertions that didn't exist at the time of writing the initial fix. There have since been edits to this test and other supporting shared utils and an assertion in one of those is firing.

[INFO:2023-02-17 17:03:49,034]: RunnerClient: rptest.scale_tests.node_operations_fuzzy_test.NodeOperationFuzzyTest.test_node_operations.enable_failures=False.num_to_upgrade=0.compacted_topics=False: Summary: AssertionError('Node 9 decommissioning stopped making progress')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/scale_tests/node_operations_fuzzy_test.py", line 187, in test_node_operations
  | executor.execute_operation(op)
  | File "/root/tests/rptest/utils/node_operations.py", line 395, in execute_operation
  | self.wait_for_removed(node_id)
  | File "/root/tests/rptest/utils/node_operations.py", line 258, in wait_for_removed
  | waiter.wait_for_removal()
  | File "/root/tests/rptest/utils/node_operations.py", line 176, in wait_for_removal
  | assert self._made_progress(
  | AssertionError: Node 9 decommissioning stopped making progress

All of the failures share the same assertion. The logic I introduced will only be enabled in the case num_to_upgrade is > 0, however these failures are observed when num_to_upgrade is 0 or greater then 0. Its possible the fix works, but there is another unrelated bug here.

- The node_operations_fuzzy_test.test_node_operations test was failing due to the cluster eventually fully upgrading and having nodes with older versions of rp restart and attempt to join. - The fix is to have one node never restart so that the cluster will always stay in a partically upgraded state. - Fixes: redpanda-data#6320

graphcareful · 2023-02-22T23:26:42Z

/ci-repeat 1
skip-units
dt-repeat=25
tests/rptest/scale_tests/node_operations_fuzzy_test.py

graphcareful · 2023-02-23T21:06:25Z

/ci-repeat 1
skip-units
dt-repeat=1
tests/rptest/scale_tests/node_operations_fuzzy_test.py

graphcareful · 2023-02-24T14:17:28Z

The issue observed is also seen in dev #9052

vbotbuildovich · 2023-07-27T16:55:58Z

The pull request is not merged yet. Cancelling backport...

Workflow run logs.

graphcareful · 2023-08-01T14:44:10Z

Test no longer exists, closing

graphcareful requested review from andrwng and mmaslankaprv February 1, 2023 19:12

graphcareful changed the title ~~rptest: Cluster eventually upgrade in nodeops_test~~ Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations Feb 1, 2023

andrwng previously approved these changes Feb 1, 2023

View reviewed changes

andrwng reviewed Feb 1, 2023

View reviewed changes

graphcareful dismissed andrwng’s stale review via 6576986 February 1, 2023 20:46

graphcareful force-pushed the fix-node-operations-fuzzy-test branch from 41f6095 to 6576986 Compare February 1, 2023 20:46

graphcareful requested a review from andrwng February 1, 2023 20:47

andrwng previously approved these changes Feb 2, 2023

View reviewed changes

tests/rptest/scale_tests/node_operations_fuzzy_test.py Outdated Show resolved Hide resolved

graphcareful dismissed andrwng’s stale review via b4e78aa February 2, 2023 14:49

graphcareful force-pushed the fix-node-operations-fuzzy-test branch from 6576986 to b4e78aa Compare February 2, 2023 14:49

graphcareful force-pushed the fix-node-operations-fuzzy-test branch from b4e78aa to 645ed64 Compare February 3, 2023 15:50

andrwng previously approved these changes Feb 3, 2023

View reviewed changes

graphcareful dismissed andrwng’s stale review via 8f6b975 February 17, 2023 16:14

graphcareful force-pushed the fix-node-operations-fuzzy-test branch from 645ed64 to 8f6b975 Compare February 17, 2023 16:14

bharathv requested a review from andrwng February 22, 2023 17:52

andrwng previously approved these changes Feb 22, 2023

View reviewed changes

graphcareful dismissed andrwng’s stale review via 6e093f8 February 22, 2023 23:26

graphcareful force-pushed the fix-node-operations-fuzzy-test branch from 8f6b975 to 6e093f8 Compare February 22, 2023 23:26

michael-redpanda assigned graphcareful May 30, 2023

vbotbuildovich mentioned this pull request Jul 27, 2023

[v23.1.x] Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations #12491

Closed

graphcareful closed this Aug 1, 2023

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for hang on node joining cluster in `NodeOperationFuzzyTest.test_node_operations` #8559

Fix for hang on node joining cluster in `NodeOperationFuzzyTest.test_node_operations` #8559

graphcareful commented Feb 1, 2023 •

edited by mmaslankaprv

Loading

andrwng left a comment

graphcareful commented Feb 1, 2023

graphcareful commented Feb 2, 2023

andrwng left a comment •

edited

Loading

graphcareful commented Feb 3, 2023

graphcareful commented Feb 3, 2023

graphcareful commented Feb 3, 2023

mmaslankaprv commented Feb 9, 2023

graphcareful commented Feb 17, 2023

bharathv commented Feb 22, 2023

andrwng commented Feb 22, 2023

graphcareful commented Feb 22, 2023

graphcareful commented Feb 22, 2023 •

edited

Loading

graphcareful commented Feb 22, 2023

graphcareful commented Feb 23, 2023

graphcareful commented Feb 24, 2023

vbotbuildovich commented Jul 27, 2023

graphcareful commented Aug 1, 2023

Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations #8559

Fix for hang on node joining cluster in NodeOperationFuzzyTest.test_node_operations #8559

Conversation

graphcareful commented Feb 1, 2023 • edited by mmaslankaprv Loading

Backports Required

Release Notes

andrwng left a comment

Choose a reason for hiding this comment

graphcareful commented Feb 1, 2023

graphcareful commented Feb 2, 2023

andrwng left a comment • edited Loading

Choose a reason for hiding this comment

graphcareful commented Feb 3, 2023

graphcareful commented Feb 3, 2023

graphcareful commented Feb 3, 2023

mmaslankaprv commented Feb 9, 2023

graphcareful commented Feb 17, 2023

bharathv commented Feb 22, 2023

andrwng commented Feb 22, 2023

graphcareful commented Feb 22, 2023

graphcareful commented Feb 22, 2023 • edited Loading

graphcareful commented Feb 22, 2023

graphcareful commented Feb 23, 2023

graphcareful commented Feb 24, 2023

vbotbuildovich commented Jul 27, 2023

graphcareful commented Aug 1, 2023

Fix for hang on node joining cluster in `NodeOperationFuzzyTest.test_node_operations` #8559

Fix for hang on node joining cluster in `NodeOperationFuzzyTest.test_node_operations` #8559

graphcareful commented Feb 1, 2023 •

edited by mmaslankaprv

Loading

andrwng left a comment •

edited

Loading

graphcareful commented Feb 22, 2023 •

edited

Loading