Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: update ManyPartitionsTest #5816

Merged
merged 21 commits into from
Aug 11, 2022
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Aug 3, 2022

Cover letter

Updates to ManyPartitionsTest:

  • Make it dynamically select size based on enviroment: this is useful for developers but also for running on different instance types without having to manually adjust things.
  • Configure retention rules so that the test can run as long as it likes without filling disk.
  • Add a compacted topic test
  • Add a test that runs an OMB workload against a system with many partitions
  • Use new kgo-repeater traffic generator

Other improvements:

  • The KgoRepeaterService comes with a sweet python context manager for running part of a test with background traffic.
  • Check for XFS on dedicated nodes: this saves developer time if they've accidentally broken a /var/lib/redpanda symlink to ephemeral storage
  • Fix for NodeCrash printing
  • Many fixes/improvements to test helpers.

Fixes #5389

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

  • none

jcsp added 21 commits August 8, 2022 14:28
This was outputting number, should have been
outputting message.
The go dependencies are generally the fastest to build
and should not get held up behind other things:
- Move OMB (Java build) further up
- split `kaf` install from unrelated non-go stuff.
- move client-swarm build before go test utils
So that it shows the node name properly
This is for the benefit of scale tests, which would like
to reduce their per-partition outputs to reflect how
a user would configure the system, and to reduce any
overhead from emitting millions of lines.
This wraps the new `kgo-repeater` traffic generator
for scalable load generation.
It is helpful to print the error right at the point
of failure, rather than after the (potentially long
running) backtrace decode & log search jobs.

It'll get printed again later as well, but this way
I can search from the start of the file for the exception
name, and jump straight to the timestamp of the failure.
This is a nasty failure mode where we deploy fresh
packages and accidentally wip out our /var/lib/redpanda
symlink, resulting in running tests on very slow drives.
This is an efficiency/quality of life improvement for
working with tests that start larger numbers of nodes.

Leave the default as serial startup, because it makes logs
easier to read.
This is useful if a test is running longer than
you expected and you'd like to know how far through
it is without doing your own calculation of message counts.
When using this function to query leadership for partitions,
it is not necessary to exclude partitions just because
they failed to get some metadata from the leader (e.g. NOT_LEADER
errors for offets during transient leaderhsip change).

Add a `tolerant` flag that permits returning partially populated
RpkPartition results that just show the leader of a partition.
The default mode is rather expensive for high partition counts,
and complicates handling systems in transient states when one
or more of the partitions is likely to be underoing leadership
movement and therefore have NOT_LEADER errors etc in the
default per-partition output.

When all we want to know is the group's state, this lets
us get that.
This enables:
- Running on different instance types without
  hacking the test
- Running on local docker while developing the
  test itself.
I think this is a bug with the workload generator (or, unlikely perhaps
a problem with franz-go).  It is usually only a few consumers that disappear
from the group, so it doesn't hurt the validity of the overall scale
test, and we can hunt it down separately.
@jcsp jcsp marked this pull request as ready for review August 8, 2022 13:30
@jcsp jcsp requested review from rishabh96b, ballard26 and travisdowns and removed request for a team, dotnwat, NyaliaLui, mmaslankaprv, ztlpn, VadimPlh and rishabh96b August 8, 2022 13:30
tests/rptest/services/utils.py Show resolved Hide resolved
tests/rptest/services/redpanda.py Show resolved Hide resolved
tests/rptest/services/redpanda.py Show resolved Hide resolved
tests/rptest/services/redpanda.py Show resolved Hide resolved
tests/rptest/services/kgo_repeater_service.py Show resolved Hide resolved
@travisdowns
Copy link
Member

travisdowns commented Aug 11, 2022

I made it through it! Looks like a few nice scale test is shaping up.

I had a few miscellaneous questions and suggestions, nothing major.

@travisdowns travisdowns self-requested a review August 11, 2022 02:51
Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to merge in this state. Apart from some idle questions which don't need to be addressed in the code, the remaining things were all minor nits or style fixes that can be considered optional.

@jcsp
Copy link
Contributor Author

jcsp commented Aug 11, 2022

Thanks for making it through a lengthy set of commits.

All the comments I've silently marked resolved are addressed in #5970

@jcsp
Copy link
Contributor Author

jcsp commented Aug 11, 2022

CI failures were a transient issue apparently https://redpandadata.slack.com/archives/C02LZGSS66M/p1659967842254759

jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 11, 2022
jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 11, 2022
@jcsp
Copy link
Contributor Author

jcsp commented Aug 11, 2022

@jcsp jcsp merged commit c037f52 into redpanda-data:dev Aug 11, 2022
@jcsp jcsp deleted the scale-test-update branch August 11, 2022 18:57
jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 11, 2022
@jcsp
Copy link
Contributor Author

jcsp commented Aug 15, 2022

/backport v22.2.x

jcsp added a commit to vbotbuildovich/redpanda that referenced this pull request Aug 15, 2022
This is followup from PR
redpanda-data#5816

(cherry picked from commit 1689a72)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale+compaction tests + make spill_key_index memory size dynamic
2 participants