Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Add random action injector #4404

Merged
merged 2 commits into from
May 7, 2022

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Apr 25, 2022

Cover letter

Adds random action injector utilities, aimed to be used for e2e tests. The failures could be process failures, node decommission, leadership transfer etc, controlled by a context manager.

The action injector runs on a thread and periodically introduces changes on randomly selected nodes in the cluster.

Features

  • The new context can be used as following
with random_process_kills(self.redpanda):
    do_something()

Changes in force push

  • clean up unused code based on review comments
  • simplify test cases based on review comments

Changes in force push

  • clean up result handling

Changes in force push:

  • resolve conflicts with upstream dev

Changes in force push

  • fold everything back into two logical commits
  • remove changes which were only import reordering

Changes in force push

  • system restoration code removed
  • fix assertion to only check if action was triggered by test

@abhijat abhijat marked this pull request as draft April 25, 2022 05:17
@abhijat abhijat force-pushed the add-random-failure-injector branch 2 times, most recently from d95c822 to 4911872 Compare April 26, 2022 11:50
@abhijat abhijat marked this pull request as ready for review April 26, 2022 15:22
@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Apr 26, 2022
graphcareful
graphcareful previously approved these changes Apr 26, 2022
Copy link
Contributor

@graphcareful graphcareful left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice job, LGTM

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure looks great.

Some comments - I think I might have commented on code that isn't used yet (e.g. the decom/recom actions)

tests/rptest/services/redpanda.py Outdated Show resolved Hide resolved
tests/rptest/tests/e2e_shadow_indexing_test.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
@jcsp
Copy link
Contributor

jcsp commented Apr 26, 2022

Thank you for adding release notes, but those are usually used for customer-facing release documentation, so we generally don't mention changes to our internal test code.

tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Apr 27, 2022
@abhijat abhijat force-pushed the add-random-failure-injector branch from 1fa228d to bfff79c Compare April 27, 2022 07:06
@abhijat
Copy link
Contributor Author

abhijat commented Apr 27, 2022

@jcsp @rystsov I will focus this PR on the action which kills processes. I have reduced the NodeDecommission and LeadershipTransfer actions to skeleton classes, and will focus on concrete implementation for them in subsequent PRs after more research based on the comments here.

If it is better that I should remove the skeleton classes altogether, please let me know.

@abhijat abhijat force-pushed the add-random-failure-injector branch from 31df1a8 to eb9ecb7 Compare April 27, 2022 08:31
@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Apr 27, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Apr 27, 2022
@graphcareful
Copy link
Contributor

One comment about our PR flow, when iterating we don't push commits on top of already reviewed commits, we just edit/rebase existing commits. If you rebase with --keep-base reviewers will be able to see the diff of the PR between force pushes.

jcsp
jcsp previously approved these changes Apr 27, 2022
Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM when CI passes: I like the overall structure + am okay with having the not-yet-implemented actions in there as stubs.

Needs a re-check from @rystsov

Copy link
Contributor

@rystsov rystsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to remove the code we don't use. Code has gravity so the more we push the harder it change later; also it creates more load on the reviewer and if we never follow up it's instant dead code.

Also please follow @graphcareful advice and rewrite the history, we tend to care about the shape of the commit history - https://github.com/redpanda-data/redpanda/blob/dev/CONTRIBUTING.md#commit-history

tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
tests/rptest/services/admin.py Outdated Show resolved Hide resolved
tests/rptest/tests/e2e_shadow_indexing_test.py Outdated Show resolved Hide resolved
@abhijat
Copy link
Contributor Author

abhijat commented Apr 28, 2022

One comment about our PR flow, when iterating we don't push commits on top of already reviewed commits, we just edit/rebase existing commits. If you rebase with --keep-base reviewers will be able to see the diff of the PR between force pushes.

@graphcareful I have used force push with --keep-base for the latest commit editing the last approved commit, please let me know if the history looks more acceptable now.

I have also added links to description of force push in the cover letter after discussing with Evgeny so it is easier for reviewers.

@abhijat abhijat force-pushed the add-random-failure-injector branch 2 times, most recently from 9f90dc6 to 3d95f85 Compare May 5, 2022 09:46
@abhijat abhijat requested a review from rystsov May 5, 2022 09:47
@abhijat abhijat force-pushed the add-random-failure-injector branch from 3d95f85 to 3d6686f Compare May 5, 2022 09:52
@abhijat abhijat added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 5, 2022
@abhijat
Copy link
Contributor Author

abhijat commented May 6, 2022

@rystsov please review, I have cleaned up the restoration code and replaced log with boolean that the test can use to assert internal state of the thread.

@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label May 6, 2022
rystsov
rystsov previously approved these changes May 6, 2022
Copy link
Contributor

@rystsov rystsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

tests/rptest/services/action_injector.py Outdated Show resolved Hide resolved
@rystsov
Copy link
Contributor

rystsov commented May 6, 2022

Some of the tests are failing, please for each failing test check if it's a know flaky test if it's then add a link to this build to the flaky issue and comment this PR to reassure that the failing tests aren't related to this PR.

For new failing tests you should investigate if it's related to this PR. If they are - fix the PR, if they aren't then you should open ci-failure tagged issues.

@abhijat
Copy link
Contributor Author

abhijat commented May 6, 2022

new issues #4601 and #4602 for ci failures seen

a new context manager is added which runs a background thread,
injecting actions into a redpanda cluster, and optionally reversing
them.
@abhijat
Copy link
Contributor Author

abhijat commented May 6, 2022

@rystsov fixed the name and added links to CI failures

@abhijat
Copy link
Contributor Author

abhijat commented May 6, 2022

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go from my pov

@abhijat abhijat merged commit d7595a6 into redpanda-data:dev May 7, 2022
@abhijat
Copy link
Contributor Author

abhijat commented May 7, 2022

/backport v22.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. see workflow
I executed the below command:

git cherry-pick -x 551045ec62d06c97825e804602aa2478f9d0f382 889747877f0cdc0c16ef82e9b5499a83e595c912

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants