Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce duration of partition_movement_test from 25min to 8min #5238

Draft
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

rystsov
Copy link
Contributor

@rystsov rystsov commented Jun 27, 2022

Cover letter

  • reduces duration of partition_movement_test from 25min to 8min
  • introduce a workload generator / consistency checker to verified producer / consumers

The new verifier is based on the verifiers we use in chaos tests. The key features of this verifier:

Release notes

  • none

@rystsov rystsov requested review from a team, dotnwat and NyaliaLui as code owners June 27, 2022 02:25
@rystsov rystsov requested review from ivotron, jcsp, mmaslankaprv, andrewhsu and andrwng and removed request for a team June 27, 2022 02:25
@rystsov rystsov force-pushed the test-workloads branch 5 times, most recently from d019eec to 90a704e Compare June 27, 2022 21:30
Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes focussed on the test itself are fine, but I worry that we are repeating the same mistake of adding new workloads rather than extending/adapting existing ones.

I'm totally on board with removing KafProducer, but when it comes to replacements, we already have FranzGoVerifiableProducer which does pretty similar stuff to the new workloads added here (and I think the Java VerifiableProducer is not so different either?). If we were consolidating one one language I could understand, but the new workloads are in a mixture of languages too, so I'm left kind of uncertain about the motivation/direction.

tests/docker/Dockerfile Outdated Show resolved Hide resolved
tests/docker/Dockerfile Show resolved Hide resolved
tests/rptest/remote_scripts/control/rw_alive.sh Outdated Show resolved Hide resolved
tests/rptest/remote_scripts/w_workload.py Outdated Show resolved Hide resolved
tests/rptest/remote_scripts/w_workload.py Outdated Show resolved Hide resolved
tests/rptest/services/w_workload.py Outdated Show resolved Hide resolved
tests/rptest/remote_scripts/w_workload.py Outdated Show resolved Hide resolved
@NyaliaLui
Copy link
Contributor

@rystsov What's the status on this PR? Does it need review or are we waiting for some changes?

@rystsov rystsov force-pushed the test-workloads branch 4 times, most recently from d7084cb to d62a62e Compare July 14, 2022 19:39
@rystsov rystsov requested a review from jcsp July 14, 2022 19:39
@rystsov
Copy link
Contributor Author

rystsov commented Jul 14, 2022

@NyaliaLui I've addressed the comments and it waits for review

@rystsov
Copy link
Contributor Author

rystsov commented Jul 14, 2022

Most of the changes focused on the test itself are fine, but I worry that we are repeating the same mistake of adding new workloads rather than extending/adapting existing ones.

@jcsp The new verifier follows a different approach to testing so extending the existing verifiers would require more work. Once this PR is merged I'll push for giving up on VerifiableProducer, VerifiableConsumer and KafProducer. I've described the key features of this verifier in the cover letter.

Copy link
Contributor

@NyaliaLui NyaliaLui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going and the fact that the tool has helped detect a problem adds to the rw_worloads value.

tests/rptest/services/rw_workload.py Outdated Show resolved Hide resolved
@mmedenjak mmedenjak added the kind/enhance New feature or request label Jul 20, 2022
@rystsov rystsov force-pushed the test-workloads branch 2 times, most recently from 97d6305 to 43f2b83 Compare September 1, 2022 00:22
NyaliaLui
NyaliaLui previously approved these changes Sep 1, 2022
Copy link
Contributor

@NyaliaLui NyaliaLui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y'all the PR was ready to merge a while ago, I fixed the conflicts and it's ready to go. IMHO we may do an exception and let it go in despite the blockers. They aren't caused by the PR and the PR introduces a verifier to solve the "Consumer failed to consume up to offsets" ci-failure. What do you think?

LGTM

Moving test debs which change rarely up the docker file to
prevent long rebuilds caused by frequently changed debs in
the head of the dockerfile before them
we don't use kafka-verifier, removing the dead code
it's wasteful to download the same dependencies over and over
introducing a verifiers project to unite the java projects to
share debs
rw_verifier is a replacement for verified producer and
consumer services. rw_verifier is a workload generator
and an online linear and incremental consistency checker
capable of running for hours without an overhead.

The validation happens as the write / read go. No need
to pass written/read data to the test over ssh for vali-
dation.

The app exposes control and metrics over http. So there
is no need to eyeball the number of records:
 - start the workload
 - use the api to check that the app doing a progress at
   key points of the test
 - stop the app when the need without waiting for the
   remaining messages to pass
duration of partition_movement_test.py falls down from
14 minutes 15.077 seconds to 12 minutes 39.137 seconds
duration of partition_movement_test.py falls down from
12 minutes 39.137 seconds to 8 minutes 53.753 seconds
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me.

  • I think several files need a copyright header
  • The set of features / benefits listed in the cover letter are impressive, but it's not clear why "Reduce duration of partition_movement_test from 25min to 8min" this is true. Is there some insight into the time savings?

tests/rptest/tests/partition_movement_test.py Show resolved Hide resolved
@rystsov
Copy link
Contributor Author

rystsov commented Dec 21, 2022

but it's not clear why "Reduce duration of partition_movement_test from 25min to 8min" this is true. Is there some insight into the time savings?

it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until which was put in a separate PR since them the second best boost came from using logic like:

  • start continuous workload
  • await n updates
  • perform partition movement
  • await n updates
  • stop the workload

instead of:

  • start workload & schedule N operations (where N is big enough)
  • perform partition movement
  • wait until the workload is stopped

I'll change the description to avoid the confusion


static volatile long started_s = -1;
static int workload_id_gen = 0;
static synchronized int get_workload_id() { return workload_id_gen++; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we stick to the same naming convention for all functions/methods i.e. camelCase ?


HashMap<Integer, Queue<Long>> oids;
HashMap<Long, OpInfo> ops;
HashMap<Integer, Semaphore> rsems;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to make those variable names more descriptive ?

ConsumerRecords<String, String> records
= consumer.poll(Duration.ofMillis(10000));
var it = records.iterator();
while (it.hasNext()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this to be equivalent with existing VerifiableConsumer it would have to commit offsets, tolerate errors (redeliveries) and use consumer groups

Copy link
Contributor Author

@rystsov rystsov Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that RWVerifier doesn't cover consumer groups. But I don't think full equivalence is the right goal, instead of having the verifier we should have a set of specialized verifiers aimed at a particular set of features. With a single verifier we tend not to question how good it is and what it actually covers resulting in a false sense of security.

For example the pair of VerifiableConsumer & VerifiableProducer:

  • don't check if the consumed order matches the written order
  • ignore time travel
  • ignore when a consumer reads something what was never written
  • ignore indecisive errors on produce
  • has broken idempotency support (during rebalance a message is allowed to be consumed twice but the validation logic raise an error if there is any duplicate)
  • doesn't validate if a partition is assigned to multiple clients at the same time

RWVerifier isn't affected by all but the last point. And it already found a problem which VerifiableConsumer & VerifiableProducer missed.

Instead of adding weak support of consumer groups to RWVerifier we should add another verifier exclusively targeting the consumer groups unaffected by the existing flaws.

Copy link
Contributor Author

@rystsov rystsov Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I want to have an example of how to write remove services without relying on piping the output via ssh so that when there is a new PR introducing an end_to_end based test I could point to a RWVerifier and ask the author to do something similar.

What do you think of copying partition_movement_test into partition_movement_rw_test and marking original deprecated until we introduce a verifier with the consumer group support? It sounds ugly but postponing this PR also has cost, we'll continue using the regular approach and we'll depend on it more and more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thank you for explaining all the details i think there is no need to create duplicated tests the one we have now is enough, if the RWVerifier is going to be a goto tool to write EndToEnd tests what do you think about making it a little bit better documented in terms of usage/output and its code cleaned up so that the author of the new features will have an easier job.

@dotnwat
Copy link
Member

dotnwat commented Dec 24, 2022

it was true back in summer but now the benefits should be smaller, the major boost came from optimizing wait_until

ahh thanks. i remember that wait_until pr now. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants