Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456

Merged

Conversation

BenPope
Copy link
Member

@BenPope BenPope commented Mar 15, 2023

Backport of PR #6175
Fixes: #6173

Backport of PR #6285 (see also #8793)
Fixes: #5959

Backport of PR #8963
Fixes: #9285

Backport of PR #6501
Fixes #6490

Backport of PR #6246

Backport of PR #6091
Fixes: #6011

Backport of PR #6419
Fixes: #5154

Backport of PR #6157
Fixes: #6140

Backport of PR #6124
Fixes: #6016

Backport of PR #6429
Fixes: #5885
Fixes: #5952

bharathv and others added 18 commits March 15, 2023 08:16
 NOT_COORDINATOR: This is not the correct coordinator

is a retryable transient error. Let the wrapper retry.

(cherry picked from commit 74d2923)
This check had been made permissive to work around a franz-go
bug.

Reinstate it, and keep some debug code which was added to make
any failures here specifically say which workers disappeared.

Fixes redpanda-data#5959

(cherry picked from commit bd28cfd)
This is one of the cases that rpk doesn't retry internally,
it can happen when you try to describe a group shortly
after stopping a node.

(cherry picked from commit 00d87c7)
This is a bit different from other failures that are
retried inside rpk.py.  There's probably an underlying
rpk bug to be fixed here in the error handling, but
this makes the test handle rpk's current behavior, which
can be a very slow timeout when the whole cluster was
just restarted.

(cherry picked from commit c169347)
- franz-go will actually never reply with an error that starts with
"Kafka replied that..", it will print "broker replied that...",
modifying this condition will ensure that when these rpk errors are
observed the test client will automatically retry.

- Fixes: redpanda-data#8750

(cherry picked from commit 0359767)
This commit makes the KafkaCliConsumer stop procedure two phase.
We first try to terminate it gracefully and allow it to leave the
consumer group. If that doesn't work we SIGKILL the process to ensure
we don't end up with a stray consumer like in redpanda-data#6490.

(cherry picked from commit 7315002)
Controlling max buffered records enables better throughput,
same for mb_per_worker.

(cherry picked from commit 91caadd)
This was missing a factor of 2 to make it more tolerant
to noise.

(cherry picked from commit d2e0c6a)
...to encourage a little more throughput.  The default
was to send single-event batches

(cherry picked from commit ab1e743)
Enable re-using ports rather than just ticking
upwards.  This avoids the need for a super long
open port range on AWS instances.

(cherry picked from commit 9283551)
This is _not_ for running them in docker on CI.  It's for developers
who make test changes to be able to run a miniaturized version of
the test to check for breakage on their workstation.

(cherry picked from commit 41fd8c5)
These got broken by kgo-verifier interface changes.

(cherry picked from commit 469940e)
This gave weird output like
progress: 100.00% ProduceStatus<0 0 0 0 0/0/0>

...because it was calculating the percentage
properly but printing the value of parent._status
before updating it.

(cherry picked from commit 19ebd83)
…eTest more robust

These tests both had a similar defect.  They produced much less data
than comments indicated, and relied on all produces being complete
before consumers started.

Because old kgo-verifier code didn't output any status from producer
until about 5 seconds in, the small ~500MB produce sizes would be
complete quite reliably by the first time the test checked the status,
before the consumer was started.  The consumer would then see the full
offset range when it started, and read the whole lot in one pass.

This change does not fix the underlying flaw (that the tests hardly
write any data in tests that are meant to be done under load), but it
fixes the way the tests can now fail because kgo-verifier is
snappier at indicating status.

Fixing the tests to fulfil their inteded purpose and run partition
movement/balancer under load is tracked by:
redpanda-data#6245

(cherry picked from commit 5f378cc)
This was relying on fast propagation of controller
writes between nodes, relative to the execution of
the test.

Fixes redpanda-data#6011

(cherry picked from commit 1db031c)
fixed semantic of get_broker: 404 is a determinate error (a requested
resource doesn't exist) while 503 is service level error and when we
return it admin interface doesn't create a false impression

fixes redpanda-data#6016

(cherry picked from commit 7432633)
Aborts should be propagated as the standard
ss::abort_requested_exception type which is understood
by handlers to be ignored silently, as it occurs during
normal shutdown.

Timeouts remain specific exception type in offset_monitor,
and in locations that used to catch + swallow both aborts
and timeouts, timeouts are logged at WARN severity, as they
are not necessarily indicative of a fault, but may indicate
a system not operating at its best.

Fixes: redpanda-data#5154
(cherry picked from commit 927ea66)
@BenPope
Copy link
Member Author

BenPope commented Mar 15, 2023

/cdt
rp_version=build

jcsp and others added 6 commits March 15, 2023 13:36
wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style
tests may well induce timeouts.

(cherry picked from commit ea2abf6)
Print the missing node that caused the assertion to fail. This should
make it easier to debug test failures as there's no need to corellate
with other logs to figure out the missing node.

(cherry picked from commit 0579d0e)
This patch introduces a few changes to make the ClusterMetricsTests
more resilient:
* use admin API utilities for waiting on stable controller leadership
* wait on metrics from the controller before deciding they're not
  present

(cherry picked from commit f9eafdb)
This commit adds a method the Redpanda service that allows for fetching
samples for multiple metric families: "metrics_samples". Previously,
call sites that required multiple metric families had to make separate
calls to "metrics_sample" for each metric family required. This was
inconvenient for waiting on a set of metrics to become available.

(cherry picked from commit c8ffa67)
This commit updates the tests in ClusterMetricsTest to fetch all the
cluster metric families at once. This change makes the tests
significantly faster (approx 10 times) as we now wait on the full set of
metrics to become available instead of waiting on each individual
metric.

(cherry picked from commit c0a838b)
Added ability to recognize an instance of a Kafka cli consumer
backtground thread service.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 4368a21)
The race condition in a tests caused them to fail as one of the consumer
consumed all of the messages before the others joined the group. As the
test is based on assumption that the consumers will consume from roughly
the same number of partitions it failed. Fixed the race condition by
starting producer after all of the consumers are started. This way we
are certain that consumers are members before any messages appear in the
partitions.

Fixes: redpanda-data#5885
Fixes: redpanda-data#5952

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit a7ae054)
@BenPope
Copy link
Member Author

BenPope commented Mar 15, 2023

/cdt
rp_version=build

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me if the tests are happy.

@BenPope
Copy link
Member Author

BenPope commented Mar 15, 2023

Failures:

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants