-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: adds support for node decommission, leadership transfer to context managers #4610
tests: adds support for node decommission, leadership transfer to context managers #4610
Conversation
4656c8d
to
ed19024
Compare
adds contexts for leadership transfer and node decommission to the disruptive actions context managers. node decommission: the forward action decommissions a random node in cluster using the admin api. the reverse action restarts the node and adds it back to cluster with a new node id. leadership transfer: a test topic is supplied to this context which has already been created. a specific partition (0) of this topic is the target of a leadership transfer to some other node which holds replicas of the partition. there is no reverse action and no restore action for this operation, as it does not reduce or increase the cluster capacity like process kill or node decommission.
ed19024
to
4c1c2b3
Compare
4c1c2b3
to
e866652
Compare
error instance of #4807
|
e866652
to
2e89096
Compare
ctx.assert_actions_triggered() | ||
|
||
@cluster(num_nodes=6) | ||
def test_write_with_leadership_transfers(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the test_write_with_*
tests necessary when the additions to franz_go_verifiable_test also do writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests added to the franz-go ecosystem are are long running and take more time, so they are skipped in CI runs,they were used while doing manual testing, and once we have some kind of nightly run in place for these long running tests the distruptive tests can be added there.
The tests in e2e module such as test_write_with_leadership_transfers
are fast and are part of the CI run, so they are kept separate. They are expected to run and pass with each CI build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a docstring to this comment to explain that it's purpose is to be a mini version of the more full-powered tests in scale_tests/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docstring to the test suite
|
||
class FranzGoVerifiableWithSiAndDisruptions(FranzGoVerifiableBase): | ||
MSG_SIZE = 100000 | ||
PRODUCE_COUNT = 20000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This produce may not be running for long enough for any disruptive actions to happen: 100KiB * 20,000 messages is only 2GB, which is a few seconds of IO when running on dedicated EC2 nodes.
Suggest increasing the size to long enough to get at least a couple of minutes of runtime, and maybe it would also be good for the context decorators to assert that they did at least one action while running, so that people writing tests can't accidentally use them across short time periods + create the illusion that they're testing failures when really they're not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the message size at 100kb after a discussion with @Lazin and @ztlpn where we discussed realistic message sizes in redpanda, but I see the point to ensure some actions are triggered, I will try to increase produce count by a factor of 100 and test locally.
We do have assertion in test that action was triggered here, I did not put it in the context manager itself to give control to the user.
But it might be useful to assert on exiting the context (perhaps on the existence of a flag), will explore this.
# of restarts with lots of partitions, and current high | ||
# metric counts make that sometimes cause reactor stalls | ||
# during shutdown on debug builds. | ||
'disable_metrics': True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can strip out this extra_rp_conf -- it became redundant when these tests were moved from tests/ to scale_tests/ , they should all work with defaults now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, I assume just this one setting has to be removed and not the entire dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was vague -- I meant the disable_metrics and also the timeout settings. I'm not sure about the replication count setting, it might also be possible to remove that if the topics created during the test have explicit replication counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll try removing them
tests/rptest/test_suite_quick.yml
Outdated
@@ -17,3 +17,4 @@ quick: | |||
- tests/wasm_identity_test.py | |||
- tests/wasm_partition_movement_test.py | |||
- tests/wasm_redpanda_failure_recovery_test.py | |||
- tests/franz_go_verifiable_test.py::FranzGoVerifiableWithSiAndDisruptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed any more -- it's pointing to tests/franz_go_verifiable_test.py but that is now in scale_tests, so implicitly not run as part of this suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation, removed from the file
ctx.assert_actions_triggered() | ||
|
||
@cluster(num_nodes=6) | ||
def test_write_with_leadership_transfers(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a docstring to this comment to explain that it's purpose is to be a mini version of the more full-powered tests in scale_tests/
2e89096
to
0cf9124
Compare
node decommission and leadership transfer tests are added to SI e2e tests.
0cf9124
to
27d25b8
Compare
closing this PR for now as there are some concerns around how useful this set of changes is due to the non deterministic nature of failures injected. also there are no tests planning to use these changes at this time. |
Cover letter
Adds support to randomly perform leadership transfer and node decommission while code is in a python context. The node decommission operation is reversed at the end of a context, where the node is restarted and re-introduced to the cluster with a node id.
The context can be used as:
and
The new tests which use these options and also use franz-go based producers and consumers are excluded from CI run and can be tested manually.