tests: adds support for node decommission, leadership transfer to context managers #4610

abhijat · 2022-05-06T13:29:06Z

Cover letter

Adds support to randomly perform leadership transfer and node decommission while code is in a python context. The node decommission operation is reversed at the end of a context, where the node is restarted and re-introduced to the cluster with a node id.

The context can be used as:

with random_decommissions(self.redpanda) as ctx:
            wait_for_segments_removal(redpanda=self.redpanda,
                                      topic=self.topic,
                                      partition_idx=0,
                                      count=6)

and

# leadership transfer requires a target topic
with random_leadership_transfers(self.redpanda, self.topics[0]) as ctx:
    wait_for_segments_removal(redpanda=self.redpanda,
                              topic=self.topic,
                              partition_idx=0,
                              count=6)

The new tests which use these options and also use franz-go based producers and consumers are excluded from CI run and can be tested manually.

adds contexts for leadership transfer and node decommission to the disruptive actions context managers. node decommission: the forward action decommissions a random node in cluster using the admin api. the reverse action restarts the node and adds it back to cluster with a new node id. leadership transfer: a test topic is supplied to this context which has already been created. a specific partition (0) of this topic is the target of a leadership transfer to some other node which holds replicas of the partition. there is no reverse action and no restore action for this operation, as it does not reduce or increase the cluster capacity like process kill or node decommission.

abhijat · 2022-05-20T06:18:05Z

error instance of #4807



The following tests FAILED:
--
  | 39 - test_cluster_rpunit (Failed)



*** No errors detected
--
  | *** 1 abandoned failed future(s) detected
  | Failing the test because fail was requested by --fail-on-abandoned-failed-futures
  | Test Exit code 3

abhijat · 2022-05-24T13:16:48Z

error is instance of #4634

https://buildkite.com/redpanda/redpanda/builds/10456#67021836-3a5b-4ff2-8870-da9acf6552cf

NyaliaLui · 2022-05-24T13:50:11Z

tests/rptest/tests/e2e_shadow_indexing_test.py

+        ctx.assert_actions_triggered()
+
+    @cluster(num_nodes=6)
+    def test_write_with_leadership_transfers(self):


Why is the test_write_with_* tests necessary when the additions to franz_go_verifiable_test also do writes?

The tests added to the franz-go ecosystem are are long running and take more time, so they are skipped in CI runs,they were used while doing manual testing, and once we have some kind of nightly run in place for these long running tests the distruptive tests can be added there.

The tests in e2e module such as test_write_with_leadership_transfers are fast and are part of the CI run, so they are kept separate. They are expected to run and pass with each CI build.

Would be good to add a docstring to this comment to explain that it's purpose is to be a mini version of the more full-powered tests in scale_tests/

added docstring to the test suite

jcsp · 2022-05-30T08:56:35Z

tests/rptest/scale_tests/franz_go_verifiable_test.py

+
+class FranzGoVerifiableWithSiAndDisruptions(FranzGoVerifiableBase):
+    MSG_SIZE = 100000
+    PRODUCE_COUNT = 20000


This produce may not be running for long enough for any disruptive actions to happen: 100KiB * 20,000 messages is only 2GB, which is a few seconds of IO when running on dedicated EC2 nodes.

Suggest increasing the size to long enough to get at least a couple of minutes of runtime, and maybe it would also be good for the context decorators to assert that they did at least one action while running, so that people writing tests can't accidentally use them across short time periods + create the illusion that they're testing failures when really they're not.

I kept the message size at 100kb after a discussion with @Lazin and @ztlpn where we discussed realistic message sizes in redpanda, but I see the point to ensure some actions are triggered, I will try to increase produce count by a factor of 100 and test locally.

We do have assertion in test that action was triggered here, I did not put it in the context manager itself to give control to the user.

But it might be useful to assert on exiting the context (perhaps on the existence of a flag), will explore this.

jcsp · 2022-05-30T08:57:53Z

tests/rptest/scale_tests/franz_go_verifiable_test.py

+                # of restarts with lots of partitions, and current high
+                # metric counts make that sometimes cause reactor stalls
+                # during shutdown on debug builds.
+                'disable_metrics': True,


You can strip out this extra_rp_conf -- it became redundant when these tests were moved from tests/ to scale_tests/ , they should all work with defaults now.

removed, I assume just this one setting has to be removed and not the entire dict?

Sorry, I was vague -- I meant the disable_metrics and also the timeout settings. I'm not sure about the replication count setting, it might also be possible to remove that if the topics created during the test have explicit replication counts

Thanks, I'll try removing them

jcsp · 2022-05-30T08:59:27Z

tests/rptest/test_suite_quick.yml

@@ -17,3 +17,4 @@ quick:
  - tests/wasm_identity_test.py
  - tests/wasm_partition_movement_test.py
  - tests/wasm_redpanda_failure_recovery_test.py
+  - tests/franz_go_verifiable_test.py::FranzGoVerifiableWithSiAndDisruptions


This shouldn't be needed any more -- it's pointing to tests/franz_go_verifiable_test.py but that is now in scale_tests, so implicitly not run as part of this suite.

thanks for the explanation, removed from the file

jcsp · 2022-05-30T09:00:17Z

tests/rptest/tests/e2e_shadow_indexing_test.py

+        ctx.assert_actions_triggered()
+
+    @cluster(num_nodes=6)
+    def test_write_with_leadership_transfers(self):


Would be good to add a docstring to this comment to explain that it's purpose is to be a mini version of the more full-powered tests in scale_tests/

node decommission and leadership transfer tests are added to SI e2e tests.

abhijat · 2022-07-18T09:57:46Z

closing this PR for now as there are some concerns around how useful this set of changes is due to the non deterministic nature of failures injected. also there are no tests planning to use these changes at this time.

abhijat force-pushed the support-node-decom-leadership-transfer branch from 4656c8d to ed19024 Compare May 20, 2022 04:28

github-actions bot added area/build area/k8s area/redpanda area/rpk labels May 20, 2022

abhijat force-pushed the support-node-decom-leadership-transfer branch from ed19024 to 4c1c2b3 Compare May 20, 2022 05:08

github-actions bot removed area/k8s area/build area/redpanda area/rpk labels May 20, 2022

abhijat marked this pull request as ready for review May 20, 2022 05:11

abhijat requested review from dotnwat and NyaliaLui as code owners May 20, 2022 05:11

abhijat force-pushed the support-node-decom-leadership-transfer branch from 4c1c2b3 to e866652 Compare May 20, 2022 05:12

abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 20, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 20, 2022

abhijat force-pushed the support-node-decom-leadership-transfer branch from e866652 to 2e89096 Compare May 24, 2022 08:17

NyaliaLui reviewed May 24, 2022

View reviewed changes

jcsp reviewed May 30, 2022

View reviewed changes

abhijat force-pushed the support-node-decom-leadership-transfer branch from 2e89096 to 0cf9124 Compare May 31, 2022 07:44

tests: adds tests for new disruptive operations

27d25b8

node decommission and leadership transfer tests are added to SI e2e tests.

abhijat force-pushed the support-node-decom-leadership-transfer branch from 0cf9124 to 27d25b8 Compare May 31, 2022 07:46

abhijat closed this Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: adds support for node decommission, leadership transfer to context managers #4610

tests: adds support for node decommission, leadership transfer to context managers #4610

abhijat commented May 6, 2022 •

edited

Loading

abhijat commented May 20, 2022

abhijat commented May 24, 2022

NyaliaLui May 24, 2022

abhijat May 24, 2022

jcsp May 30, 2022

abhijat May 31, 2022

jcsp May 30, 2022

abhijat May 31, 2022

jcsp May 30, 2022

abhijat May 31, 2022

jcsp May 31, 2022

abhijat May 31, 2022

jcsp May 30, 2022

abhijat May 31, 2022

jcsp May 30, 2022

abhijat commented Jul 18, 2022

tests: adds support for node decommission, leadership transfer to context managers #4610

tests: adds support for node decommission, leadership transfer to context managers #4610

Conversation

abhijat commented May 6, 2022 • edited Loading

Cover letter

abhijat commented May 20, 2022

abhijat commented May 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhijat commented Jul 18, 2022

abhijat commented May 6, 2022 •

edited

Loading