[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456

BenPope · 2023-03-15T08:22:34Z

Backport of PR #6175
Fixes: #6173

Backport of PR #6285 (see also #8793)
Fixes: #5959

Backport of PR #8963
Fixes: #9285

Backport of PR #6501
Fixes #6490

Backport of PR #6246

Backport of PR #6091
Fixes: #6011

Backport of PR #6419
Fixes: #5154

Backport of PR #6157
Fixes: #6140

Backport of PR #6124
Fixes: #6016

Backport of PR #6429
Fixes: #5885
Fixes: #5952

NOT_COORDINATOR: This is not the correct coordinator is a retryable transient error. Let the wrapper retry. (cherry picked from commit 74d2923)

This check had been made permissive to work around a franz-go bug. Reinstate it, and keep some debug code which was added to make any failures here specifically say which workers disappeared. Fixes redpanda-data#5959 (cherry picked from commit bd28cfd)

This is one of the cases that rpk doesn't retry internally, it can happen when you try to describe a group shortly after stopping a node. (cherry picked from commit 00d87c7)

This is a bit different from other failures that are retried inside rpk.py. There's probably an underlying rpk bug to be fixed here in the error handling, but this makes the test handle rpk's current behavior, which can be a very slow timeout when the whole cluster was just restarted. (cherry picked from commit c169347)

- franz-go will actually never reply with an error that starts with "Kafka replied that..", it will print "broker replied that...", modifying this condition will ensure that when these rpk errors are observed the test client will automatically retry. - Fixes: redpanda-data#8750 (cherry picked from commit 0359767)

This commit makes the KafkaCliConsumer stop procedure two phase. We first try to terminate it gracefully and allow it to leave the consumer group. If that doesn't work we SIGKILL the process to ensure we don't end up with a stray consumer like in redpanda-data#6490. (cherry picked from commit 7315002)

Controlling max buffered records enables better throughput, same for mb_per_worker. (cherry picked from commit 91caadd)

This was missing a factor of 2 to make it more tolerant to noise. (cherry picked from commit d2e0c6a)

...to encourage a little more throughput. The default was to send single-event batches (cherry picked from commit ab1e743)

Enable re-using ports rather than just ticking upwards. This avoids the need for a super long open port range on AWS instances. (cherry picked from commit 9283551)

This is _not_ for running them in docker on CI. It's for developers who make test changes to be able to run a miniaturized version of the test to check for breakage on their workstation. (cherry picked from commit 41fd8c5)

These got broken by kgo-verifier interface changes. (cherry picked from commit 469940e)

This gave weird output like progress: 100.00% ProduceStatus<0 0 0 0 0/0/0> ...because it was calculating the percentage properly but printing the value of parent._status before updating it. (cherry picked from commit 19ebd83)

…eTest more robust These tests both had a similar defect. They produced much less data than comments indicated, and relied on all produces being complete before consumers started. Because old kgo-verifier code didn't output any status from producer until about 5 seconds in, the small ~500MB produce sizes would be complete quite reliably by the first time the test checked the status, before the consumer was started. The consumer would then see the full offset range when it started, and read the whole lot in one pass. This change does not fix the underlying flaw (that the tests hardly write any data in tests that are meant to be done under load), but it fixes the way the tests can now fail because kgo-verifier is snappier at indicating status. Fixing the tests to fulfil their inteded purpose and run partition movement/balancer under load is tracked by: redpanda-data#6245 (cherry picked from commit 5f378cc)

(cherry picked from commit 2e30447)

This was relying on fast propagation of controller writes between nodes, relative to the execution of the test. Fixes redpanda-data#6011 (cherry picked from commit 1db031c)

fixed semantic of get_broker: 404 is a determinate error (a requested resource doesn't exist) while 503 is service level error and when we return it admin interface doesn't create a false impression fixes redpanda-data#6016 (cherry picked from commit 7432633)

Aborts should be propagated as the standard ss::abort_requested_exception type which is understood by handlers to be ignored silently, as it occurs during normal shutdown. Timeouts remain specific exception type in offset_monitor, and in locations that used to catch + swallow both aborts and timeouts, timeouts are logged at WARN severity, as they are not necessarily indicative of a fault, but may indicate a system not operating at its best. Fixes: redpanda-data#5154 (cherry picked from commit 927ea66)

BenPope · 2023-03-15T11:01:23Z

/cdt
rp_version=build

wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style tests may well induce timeouts. (cherry picked from commit ea2abf6)

Print the missing node that caused the assertion to fail. This should make it easier to debug test failures as there's no need to corellate with other logs to figure out the missing node. (cherry picked from commit 0579d0e)

This patch introduces a few changes to make the ClusterMetricsTests more resilient: * use admin API utilities for waiting on stable controller leadership * wait on metrics from the controller before deciding they're not present (cherry picked from commit f9eafdb)

This commit adds a method the Redpanda service that allows for fetching samples for multiple metric families: "metrics_samples". Previously, call sites that required multiple metric families had to make separate calls to "metrics_sample" for each metric family required. This was inconvenient for waiting on a set of metrics to become available. (cherry picked from commit c8ffa67)

This commit updates the tests in ClusterMetricsTest to fetch all the cluster metric families at once. This change makes the tests significantly faster (approx 10 times) as we now wait on the full set of metrics to become available instead of waiting on each individual metric. (cherry picked from commit c0a838b)

Added ability to recognize an instance of a Kafka cli consumer backtground thread service. Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 4368a21)

The race condition in a tests caused them to fail as one of the consumer consumed all of the messages before the others joined the group. As the test is based on assumption that the consumers will consume from roughly the same number of partitions it failed. Fixed the race condition by starting producer after all of the consumers are started. This way we are certain that consumers are members before any messages appear in the partitions. Fixes: redpanda-data#5885 Fixes: redpanda-data#5952 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit a7ae054)

BenPope · 2023-03-15T13:37:03Z

/cdt
rp_version=build

VladLazar

Looks sensible to me if the tests are happy.

BenPope · 2023-03-15T19:54:50Z

Failures:

ManyPartitionsTest.test_many_partitions - inconclusive
TestReadReplicaService.test_simple_end_to_end - Maybe Timeout in PartitionBalancerTest.test_fuzz_admin_ops #7671 - Fix not backported
EndToEndTopicRecovery.test_restore: AssertionError: produced 100000, consumed 99972
NodeOperationFuzzyTest.test_node_operations: Do not allow decommissioned node rejoining the cluster #8547 - Fix not backported
TimeQueryTest.test_timequery: CI Failure (EndpointConnectionError, DNS) in TimeQueryTest.test_timequery.cloud_storage #8804
ClusterConfigTest.test_invalid_settings_forced: CI Failure (AssertionError) in lusterConfigTest.test_invalid_settings_forced #8801
ClusterConfigTest.test_restart: CI Failure in ClusterConfigTest.test_restart #6095 - Fix not backported
TestReadReplicaService.test_produce_is_forbidden: Maybe Timeout in PartitionBalancerTest.test_fuzz_admin_ops #7671 - Fix not backported

bharathv and others added 18 commits March 15, 2023 08:16

tests: Retry group describe on NOT_COORDINATOR err

f7d8863

NOT_COORDINATOR: This is not the correct coordinator is a retryable transient error. Let the wrapper retry. (cherry picked from commit 74d2923)

tests: tolerate connection refused in rpk group describe

dc26579

This is one of the cases that rpk doesn't retry internally, it can happen when you try to describe a group shortly after stopping a node. (cherry picked from commit 00d87c7)

tests: more options for kgorepeaterservice

6afe08b

Controlling max buffered records enables better throughput, same for mb_per_worker. (cherry picked from commit 91caadd)

test: correct a bandwidth factor on ManyPartitionsTest

8bf1de1

This was missing a factor of 2 to make it more tolerant to noise. (cherry picked from commit d2e0c6a)

tests: increase batching in manypartitionstest

d0a59fc

...to encourage a little more throughput. The default was to send single-event batches (cherry picked from commit ab1e743)

tests: improve port allocation in KgoVerifierService

ef6cf44

Enable re-using ports rather than just ticking upwards. This avoids the need for a super long open port range on AWS instances. (cherry picked from commit 9283551)

tests: fix partition movement/balancer scale tests

253109c

These got broken by kgo-verifier interface changes. (cherry picked from commit 469940e)

tests: fix progress output from KgoVerifierProducer

6e2a98e

This gave weird output like progress: 100.00% ProduceStatus<0 0 0 0 0/0/0> ...because it was calculating the percentage properly but printing the value of parent._status before updating it. (cherry picked from commit 19ebd83)

tests: improved logging in KgoVerifierService

1c553ae

(cherry picked from commit 2e30447)

tests: make FeaturesMultiNodeTest more robust

4497b76

This was relying on fast propagation of controller writes between nodes, relative to the execution of the test. Fixes redpanda-data#6011 (cherry picked from commit 1db031c)

github-actions bot added the area/redpanda label Mar 15, 2023

BenPope requested review from jcsp, VladLazar and mmaslankaprv March 15, 2023 08:39

BenPope force-pushed the backport-pr8963-v22.2.x-v4 branch from 39f8fd5 to 63cfd53 Compare March 15, 2023 11:04

jcsp and others added 6 commits March 15, 2023 13:36

tests: update log allow lists for offset_monitor exceptions

dfd1eea

wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style tests may well induce timeouts. (cherry picked from commit ea2abf6)

tests/redpanda: print node failing assertion

ac5875e

Print the missing node that caused the assertion to fail. This should make it easier to debug test failures as there's no need to corellate with other logs to figure out the missing node. (cherry picked from commit 0579d0e)

tests: added extra context to kafka cli consumer log entries

a7b1fe1

Added ability to recognize an instance of a Kafka cli consumer backtground thread service. Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 4368a21)

BenPope force-pushed the backport-pr8963-v22.2.x-v4 branch from 63cfd53 to f4329e8 Compare March 15, 2023 13:36

piyushredpanda mentioned this pull request Mar 15, 2023

[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #9440

Closed

VladLazar approved these changes Mar 15, 2023

View reviewed changes

BenPope mentioned this pull request Mar 15, 2023

[v22.2.x] Backport #6175 #6285 #8963 #9429

Closed

piyushredpanda added this to the v22.2.11 milestone Mar 15, 2023

piyushredpanda merged commit ce7d3ee into redpanda-data:v22.2.x Mar 16, 2023

This was referenced Mar 16, 2023

[v22.3.x] Fix for CI issue in ManyPartitionsTest where DescribeGroups request fails #9302

Merged

[v22.3.x] tests: reinstate exact consumer count checks in KgoRepeaterService #9497

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456

[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456

BenPope commented Mar 15, 2023 •

edited

Loading

BenPope commented Mar 15, 2023

BenPope commented Mar 15, 2023

VladLazar left a comment

BenPope commented Mar 15, 2023 •

edited

Loading

[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456

[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456

Conversation

BenPope commented Mar 15, 2023 • edited Loading

BenPope commented Mar 15, 2023

BenPope commented Mar 15, 2023

VladLazar left a comment

Choose a reason for hiding this comment

BenPope commented Mar 15, 2023 • edited Loading

BenPope commented Mar 15, 2023 •

edited

Loading

BenPope commented Mar 15, 2023 •

edited

Loading