Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: adds support for node decommission, leadership transfer to context managers #4610

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented May 6, 2022

Cover letter

Adds support to randomly perform leadership transfer and node decommission while code is in a python context. The node decommission operation is reversed at the end of a context, where the node is restarted and re-introduced to the cluster with a node id.

The context can be used as:

with random_decommissions(self.redpanda) as ctx:
            wait_for_segments_removal(redpanda=self.redpanda,
                                      topic=self.topic,
                                      partition_idx=0,
                                      count=6)

and

# leadership transfer requires a target topic
with random_leadership_transfers(self.redpanda, self.topics[0]) as ctx:
    wait_for_segments_removal(redpanda=self.redpanda,
                              topic=self.topic,
                              partition_idx=0,
                              count=6)

The new tests which use these options and also use franz-go based producers and consumers are excluded from CI run and can be tested manually.

adds contexts for leadership transfer and node decommission to the
disruptive actions context managers.

node decommission: the forward action decommissions a random node
in cluster using the admin api. the reverse action restarts the
node and adds it back to cluster with a new node id.

leadership transfer: a test topic is supplied to this context
which has already been created. a specific partition (0) of this
topic is the target of a leadership transfer to some other node
which holds replicas of the partition. there is no reverse action
and no restore action for this operation, as it does not reduce or
increase the cluster capacity like process kill or node decommission.
@abhijat abhijat force-pushed the support-node-decom-leadership-transfer branch from ed19024 to 4c1c2b3 Compare May 20, 2022 05:08
@abhijat abhijat marked this pull request as ready for review May 20, 2022 05:11
@abhijat abhijat force-pushed the support-node-decom-leadership-transfer branch from 4c1c2b3 to e866652 Compare May 20, 2022 05:12
@abhijat
Copy link
Contributor Author

abhijat commented May 20, 2022

error instance of #4807



The following tests FAILED:
--
  | 39 - test_cluster_rpunit (Failed)



*** No errors detected
--
  | *** 1 abandoned failed future(s) detected
  | Failing the test because fail was requested by --fail-on-abandoned-failed-futures
  | Test Exit code 3

@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 20, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 20, 2022
@abhijat abhijat force-pushed the support-node-decom-leadership-transfer branch from e866652 to 2e89096 Compare May 24, 2022 08:17
@abhijat
Copy link
Contributor Author

abhijat commented May 24, 2022

ctx.assert_actions_triggered()

@cluster(num_nodes=6)
def test_write_with_leadership_transfers(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the test_write_with_* tests necessary when the additions to franz_go_verifiable_test also do writes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests added to the franz-go ecosystem are are long running and take more time, so they are skipped in CI runs,they were used while doing manual testing, and once we have some kind of nightly run in place for these long running tests the distruptive tests can be added there.

The tests in e2e module such as test_write_with_leadership_transfers are fast and are part of the CI run, so they are kept separate. They are expected to run and pass with each CI build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a docstring to this comment to explain that it's purpose is to be a mini version of the more full-powered tests in scale_tests/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstring to the test suite


class FranzGoVerifiableWithSiAndDisruptions(FranzGoVerifiableBase):
MSG_SIZE = 100000
PRODUCE_COUNT = 20000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produce may not be running for long enough for any disruptive actions to happen: 100KiB * 20,000 messages is only 2GB, which is a few seconds of IO when running on dedicated EC2 nodes.

Suggest increasing the size to long enough to get at least a couple of minutes of runtime, and maybe it would also be good for the context decorators to assert that they did at least one action while running, so that people writing tests can't accidentally use them across short time periods + create the illusion that they're testing failures when really they're not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the message size at 100kb after a discussion with @Lazin and @ztlpn where we discussed realistic message sizes in redpanda, but I see the point to ensure some actions are triggered, I will try to increase produce count by a factor of 100 and test locally.

We do have assertion in test that action was triggered here, I did not put it in the context manager itself to give control to the user.

But it might be useful to assert on exiting the context (perhaps on the existence of a flag), will explore this.

# of restarts with lots of partitions, and current high
# metric counts make that sometimes cause reactor stalls
# during shutdown on debug builds.
'disable_metrics': True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can strip out this extra_rp_conf -- it became redundant when these tests were moved from tests/ to scale_tests/ , they should all work with defaults now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, I assume just this one setting has to be removed and not the entire dict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was vague -- I meant the disable_metrics and also the timeout settings. I'm not sure about the replication count setting, it might also be possible to remove that if the topics created during the test have explicit replication counts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll try removing them

@@ -17,3 +17,4 @@ quick:
- tests/wasm_identity_test.py
- tests/wasm_partition_movement_test.py
- tests/wasm_redpanda_failure_recovery_test.py
- tests/franz_go_verifiable_test.py::FranzGoVerifiableWithSiAndDisruptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed any more -- it's pointing to tests/franz_go_verifiable_test.py but that is now in scale_tests, so implicitly not run as part of this suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation, removed from the file

ctx.assert_actions_triggered()

@cluster(num_nodes=6)
def test_write_with_leadership_transfers(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a docstring to this comment to explain that it's purpose is to be a mini version of the more full-powered tests in scale_tests/

@abhijat abhijat force-pushed the support-node-decom-leadership-transfer branch from 2e89096 to 0cf9124 Compare May 31, 2022 07:44
node decommission and leadership transfer tests are added to SI
e2e tests.
@abhijat abhijat force-pushed the support-node-decom-leadership-transfer branch from 0cf9124 to 27d25b8 Compare May 31, 2022 07:46
@abhijat
Copy link
Contributor Author

abhijat commented Jul 18, 2022

closing this PR for now as there are some concerns around how useful this set of changes is due to the non deterministic nature of failures injected. also there are no tests planning to use these changes at this time.

@abhijat abhijat closed this Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants