Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ducky: pin to the latest version #6003

Closed
wants to merge 3 commits into from

Conversation

rystsov
Copy link
Contributor

@rystsov rystsov commented Aug 12, 2022

Cover letter

The latest version improved performance of wait_until and reduced the full test run by 30%. See details there

Faster wait_until dormant flakiness hidden by extra sleep. See this comment

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

Not a user visible change

Release notes

  • none

@rystsov
Copy link
Contributor Author

rystsov commented Aug 13, 2022

The change made wait_until faster and revealed previously hidden flakiness. I've analyzed the new failures and they are all caused by a pre-existing condition, since this PR only increased sensibility I don't think that fixing the issues is a prerequisite for merging this PR.

@dotnwat dotnwat requested a review from jcsp August 13, 2022 18:11
@mmedenjak mmedenjak added kind/enhance New feature or request area/build labels Aug 14, 2022
@jcsp
Copy link
Contributor

jcsp commented Aug 15, 2022

If this change is making failures more deterministic then that is good, but please let's not merge it until those tests are either fixed or marked ok_to_fail.

jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 15, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 15, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 16, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 16, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 16, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Aug 17, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010

(cherry picked from commit b7d4f2c)
@andrwng andrwng mentioned this pull request Aug 17, 2022
5 tasks
@rystsov rystsov added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 18, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 18, 2022
@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

ClusterConfigTest.test_restart #6095

@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade #5589

@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

BLOCKER: RpkClusterTest.test_upload_and_query_cluster_license_rpk #6096

@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

BLOCKER: OffsetForLeaderEpochTest.test_offset_for_leader_epoch #6097

@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

BLOCKER: OffsetForLeaderEpochTest.test_offset_for_leader_epoch #6098

@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

BLOCKER: EndToEndTopicRecovery.test_restore #6099

@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

PartitionBalancerTest.test_full_nodes - #5884

@rystsov rystsov added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 19, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 19, 2022
@rystsov rystsov added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 19, 2022
@rystsov
Copy link
Contributor Author

rystsov commented Aug 31, 2022

BLOCKER: TopicRecoveryTest.* #6275

@rystsov rystsov added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 31, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 31, 2022
@rystsov rystsov added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 31, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 31, 2022
@rystsov rystsov added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Sep 7, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Sep 7, 2022
@rystsov
Copy link
Contributor Author

rystsov commented Sep 7, 2022

BLOCKER: EndToEndTopicRecovery.test_restore_with_aborted_tx #6328

@rystsov
Copy link
Contributor Author

rystsov commented Sep 7, 2022

BLOCKER: TxFeatureFlagTest.test_disabling_transactions_after_they_being_used #6329

@rystsov rystsov added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Sep 7, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Sep 7, 2022
@rystsov rystsov added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Sep 12, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Sep 12, 2022
The latest version improved performance of wait_until and reduced
the full test run by 30%. See details in

redpanda-data/ducktape#16
@rystsov
Copy link
Contributor Author

rystsov commented Sep 26, 2022

BLOCKER: PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade #5589

@twmb twmb removed their request for review September 26, 2022 22:20
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Comment on lines +84 to +85
self.logger.exception("%s: something is wrong" %
self.who_am_i(node))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be conditional on !self._stopping.is_set()?

@rystsov rystsov closed this Oct 28, 2022
@BenPope
Copy link
Member

BenPope commented Oct 28, 2022

Oh. Why?

@jcsp
Copy link
Contributor

jcsp commented Oct 31, 2022

I saw that there was a commit updating ducktape here

commit b23e0e189c602488eb2b532c5e3e77f9d97b9de7
Author: ZeDRoman <ilgovskiy@mail.ru>
Date:   Wed Oct 19 11:22:43 2022 +0200

    ducktape: update duckatape version

diff --git a/tests/setup.py b/tests/setup.py
index 279ac777c..e8cebee2d 100644
--- a/tests/setup.py
+++ b/tests/setup.py
@@ -13,7 +13,7 @@ setup(
     package_data={'': ['*.md']},
     include_package_data=True,
     install_requires=[
-        'ducktape@git+https://github.com/redpanda-data/ducktape.git@7ba83f32d5f265aad955a31096dcc0c97d96c0ce',
+        'ducktape@git+https://github.com/redpanda-data/ducktape.git@7cbc42a2b8c0dac1fae1db7cc06a2372b35db3b6',
         'prometheus-client==0.9.0', 'pyyaml==5.3.1', 'kafka-python==2.0.2',
         'crc32c==2.2', 'confluent-kafka==1.7.0', 'zstandard==0.15.2',
         'xxhash==2.0.2', 'protobuf==3.19.3', 'fastavro==1.4.9',

Would have been nicer to have a comment explaining why this PR was closed though, I was initially confused.

@rystsov last time you updated this PR, were there tests that were still unstable with the wait_until change? I guess any tests that were in that state will now be unstable on dev too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants