-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456
Merged
piyushredpanda
merged 25 commits into
redpanda-data:v22.2.x
from
BenPope:backport-pr8963-v22.2.x-v4
Mar 16, 2023
Merged
[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456
piyushredpanda
merged 25 commits into
redpanda-data:v22.2.x
from
BenPope:backport-pr8963-v22.2.x-v4
Mar 16, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
NOT_COORDINATOR: This is not the correct coordinator is a retryable transient error. Let the wrapper retry. (cherry picked from commit 74d2923)
This check had been made permissive to work around a franz-go bug. Reinstate it, and keep some debug code which was added to make any failures here specifically say which workers disappeared. Fixes redpanda-data#5959 (cherry picked from commit bd28cfd)
This is one of the cases that rpk doesn't retry internally, it can happen when you try to describe a group shortly after stopping a node. (cherry picked from commit 00d87c7)
This is a bit different from other failures that are retried inside rpk.py. There's probably an underlying rpk bug to be fixed here in the error handling, but this makes the test handle rpk's current behavior, which can be a very slow timeout when the whole cluster was just restarted. (cherry picked from commit c169347)
- franz-go will actually never reply with an error that starts with "Kafka replied that..", it will print "broker replied that...", modifying this condition will ensure that when these rpk errors are observed the test client will automatically retry. - Fixes: redpanda-data#8750 (cherry picked from commit 0359767)
This commit makes the KafkaCliConsumer stop procedure two phase. We first try to terminate it gracefully and allow it to leave the consumer group. If that doesn't work we SIGKILL the process to ensure we don't end up with a stray consumer like in redpanda-data#6490. (cherry picked from commit 7315002)
Controlling max buffered records enables better throughput, same for mb_per_worker. (cherry picked from commit 91caadd)
This was missing a factor of 2 to make it more tolerant to noise. (cherry picked from commit d2e0c6a)
...to encourage a little more throughput. The default was to send single-event batches (cherry picked from commit ab1e743)
Enable re-using ports rather than just ticking upwards. This avoids the need for a super long open port range on AWS instances. (cherry picked from commit 9283551)
This is _not_ for running them in docker on CI. It's for developers who make test changes to be able to run a miniaturized version of the test to check for breakage on their workstation. (cherry picked from commit 41fd8c5)
These got broken by kgo-verifier interface changes. (cherry picked from commit 469940e)
This gave weird output like progress: 100.00% ProduceStatus<0 0 0 0 0/0/0> ...because it was calculating the percentage properly but printing the value of parent._status before updating it. (cherry picked from commit 19ebd83)
…eTest more robust These tests both had a similar defect. They produced much less data than comments indicated, and relied on all produces being complete before consumers started. Because old kgo-verifier code didn't output any status from producer until about 5 seconds in, the small ~500MB produce sizes would be complete quite reliably by the first time the test checked the status, before the consumer was started. The consumer would then see the full offset range when it started, and read the whole lot in one pass. This change does not fix the underlying flaw (that the tests hardly write any data in tests that are meant to be done under load), but it fixes the way the tests can now fail because kgo-verifier is snappier at indicating status. Fixing the tests to fulfil their inteded purpose and run partition movement/balancer under load is tracked by: redpanda-data#6245 (cherry picked from commit 5f378cc)
(cherry picked from commit 2e30447)
This was relying on fast propagation of controller writes between nodes, relative to the execution of the test. Fixes redpanda-data#6011 (cherry picked from commit 1db031c)
fixed semantic of get_broker: 404 is a determinate error (a requested resource doesn't exist) while 503 is service level error and when we return it admin interface doesn't create a false impression fixes redpanda-data#6016 (cherry picked from commit 7432633)
Aborts should be propagated as the standard ss::abort_requested_exception type which is understood by handlers to be ignored silently, as it occurs during normal shutdown. Timeouts remain specific exception type in offset_monitor, and in locations that used to catch + swallow both aborts and timeouts, timeouts are logged at WARN severity, as they are not necessarily indicative of a fault, but may indicate a system not operating at its best. Fixes: redpanda-data#5154 (cherry picked from commit 927ea66)
/cdt |
BenPope
force-pushed
the
backport-pr8963-v22.2.x-v4
branch
from
March 15, 2023 11:04
39f8fd5
to
63cfd53
Compare
wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style tests may well induce timeouts. (cherry picked from commit ea2abf6)
Print the missing node that caused the assertion to fail. This should make it easier to debug test failures as there's no need to corellate with other logs to figure out the missing node. (cherry picked from commit 0579d0e)
This patch introduces a few changes to make the ClusterMetricsTests more resilient: * use admin API utilities for waiting on stable controller leadership * wait on metrics from the controller before deciding they're not present (cherry picked from commit f9eafdb)
This commit adds a method the Redpanda service that allows for fetching samples for multiple metric families: "metrics_samples". Previously, call sites that required multiple metric families had to make separate calls to "metrics_sample" for each metric family required. This was inconvenient for waiting on a set of metrics to become available. (cherry picked from commit c8ffa67)
This commit updates the tests in ClusterMetricsTest to fetch all the cluster metric families at once. This change makes the tests significantly faster (approx 10 times) as we now wait on the full set of metrics to become available instead of waiting on each individual metric. (cherry picked from commit c0a838b)
Added ability to recognize an instance of a Kafka cli consumer backtground thread service. Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 4368a21)
The race condition in a tests caused them to fail as one of the consumer consumed all of the messages before the others joined the group. As the test is based on assumption that the consumers will consume from roughly the same number of partitions it failed. Fixed the race condition by starting producer after all of the consumers are started. This way we are certain that consumers are members before any messages appear in the partitions. Fixes: redpanda-data#5885 Fixes: redpanda-data#5952 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit a7ae054)
BenPope
force-pushed
the
backport-pr8963-v22.2.x-v4
branch
from
March 15, 2023 13:36
63cfd53
to
f4329e8
Compare
/cdt |
VladLazar
approved these changes
Mar 15, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sensible to me if the tests are happy.
Failures:
|
This was referenced Mar 16, 2023
This pull request was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of PR #6175
Fixes: #6173
Backport of PR #6285 (see also #8793)
Fixes: #5959
Backport of PR #8963
Fixes: #9285
Backport of PR #6501
Fixes #6490
Backport of PR #6246
Backport of PR #6091
Fixes: #6011
Backport of PR #6419
Fixes: #5154
Backport of PR #6157
Fixes: #6140
Backport of PR #6124
Fixes: #6016
Backport of PR #6429
Fixes: #5885
Fixes: #5952