[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455

BenPope · 2023-03-15T08:20:35Z

Backport of PR #6175
Fixes: #6173

Backport of PR #6285 (see also #8793)
Fixes: #5959

Backport of PR #8963
Fixes: #9285

Backport of PR #6501
Fixes #6490

Backport of #6246

Backport of PR #6091
Fixes: #6011

Backport of PR #6419
Fixes: #5154

Backport of PR #6157
Fixes: #6140

Backport of PR #6124
Fixes: #6016

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 29f3a1d)

When node count changes the unevenness error increases. Made the error value not dependent from the number of nodes to prevent situation in which rebalancing is interrupted when another node is added to the cluster. Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 30a5fbc)

[V22.2.x] Backport of redpanda-data#6814, redpanda-data#7367, redpanda-data#7388

Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit 1ecca56)

Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit 438bc30)

Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit db04a44)

…54-v22.2.x-260 [v22.2.x] Fixed logging abort_requested_exception with error severity

Our recommended clocksource is tsc, which is only present for x86 architectures In arch != amd rpk will show that the tuner is not available and print: Clocksource setting not available for this architecture Instead of: Preferred clocksource 'tsc' not available (cherry picked from commit 41be394)

Clocksource tuner is not supported for ARM, so we need to adjust the output. (cherry picked from commit cef42f4)

When do_transfer_leadersip(), if a follower is still not caught up after prepare_transfer_leadership() is done, a `timeout` was returned. However it's not really a timeout, it's a flap (we thought recovery was done but it's not). This commit changes it to `exponential_backoff` so that admin API would return a 503 (plz retry) for that rather than a 504 (we couldn't do it in time). (cherry picked from commit f62b7e7)

…46-7403-v22.2.x-874 [v22.2.x] rpk clocksource tuner only enabled for amd

…xes-to-v22.2.x-553 [v22.2.x] treewide: Small cleanups for llvm-15

Reason for condition is legit and the tombstones will be replicated the next time nodes start, so lowering log message severity to WARN. Error text is logged for other error codes (cherry picked from commit afccf5d)

…to a common private method this change is used in the next commit

the test will cause the generation of remote segments with only configuration batches, and then test that the topic can be restored

…ad size for the next commit, where stream_stats::size_bytes is used

…ipping over configuration-only segments When retention.bytes is set, download_log_with_capped_size is used to not exceed this size limit. Now the function counts toward this limit only those segments that contain some topic data. Previously the segments to download were precomputed from offset_meta keeping a total_size sum. Now sum is done while downloading segments, to account for segments that contain no data and that will be discarded.

It's unclear why, but in a case where this threw from _worker, the test ended up hanging during teardown. We don't need to throw from _worker: it is neater to capture the exception and promptly raise it in the test body via condition_met. Fixes redpanda-data#7426 (cherry picked from commit 70c280d)

[v22.2.x] cloud_storage: fix partition recovery with data-less segments

Redpanda suggests: ``` INFO 2022-11-23 15:00:11,965 [shard 0] main - application.cc:543 - Node configuration properties: INFO 2022-11-23 15:00:11,965 [shard 0] main - application.cc:544 - (use `rpk config set <cfg> <value>` to change) ``` When the `rpk config set` is used the following log appears: ``` Command "set" is deprecated, use "rpk redpanda config set" instead ``` Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit bfbfe88)

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 35f0977)

Originally it was expected that we would not hit this path on a healthy system, but in practice we do from time to time when running against AWS S3. Fixes: redpanda-data#7208 (cherry picked from commit b2a8a46)

Fixes: redpanda-data#6357 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 51fb17b)

…2.2.x [v22.2.x] redpanda: Recommend rpk config set

[v22.2.x] Backport of redpanda-data#6244, 6488

The http client in cloud_roles::make_request was not closed when exception is thrown. RAII might usually be a better fit for ensuring the client is closed, but the client stop method returns a future which cannot be returned from/blocked on in destructor. This change introduces a helper modeled after ss::with_file to accept and own an http client and close it after a user-supplied operation is finished. It also adds a helper to wrap making an http request with catching and logging common http call errors. (cherry picked from commit 024ba8e)

(cherry picked from commit dc5c9ef) automated cherry pick failed in redpanda-data#7015 adjusted the http imposter to bring in only the required part of the api for the tests in backport.

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 32dd67a)

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 5daebdb)

…08-v22.2.x-574 [v22.2.x] archival: downgrade upload failure from ERR to WARN

This was missing a factor of 2 to make it more tolerant to noise. (cherry picked from commit d2e0c6a)

...to encourage a little more throughput. The default was to send single-event batches (cherry picked from commit ab1e743)

Enable re-using ports rather than just ticking upwards. This avoids the need for a super long open port range on AWS instances. (cherry picked from commit 9283551)

This is _not_ for running them in docker on CI. It's for developers who make test changes to be able to run a miniaturized version of the test to check for breakage on their workstation. (cherry picked from commit 41fd8c5)

These got broken by kgo-verifier interface changes. (cherry picked from commit 469940e)

This gave weird output like progress: 100.00% ProduceStatus<0 0 0 0 0/0/0> ...because it was calculating the percentage properly but printing the value of parent._status before updating it. (cherry picked from commit 19ebd83)

…eTest more robust These tests both had a similar defect. They produced much less data than comments indicated, and relied on all produces being complete before consumers started. Because old kgo-verifier code didn't output any status from producer until about 5 seconds in, the small ~500MB produce sizes would be complete quite reliably by the first time the test checked the status, before the consumer was started. The consumer would then see the full offset range when it started, and read the whole lot in one pass. This change does not fix the underlying flaw (that the tests hardly write any data in tests that are meant to be done under load), but it fixes the way the tests can now fail because kgo-verifier is snappier at indicating status. Fixing the tests to fulfil their inteded purpose and run partition movement/balancer under load is tracked by: redpanda-data#6245 (cherry picked from commit 5f378cc)

(cherry picked from commit 2e30447)

This was relying on fast propagation of controller writes between nodes, relative to the execution of the test. Fixes redpanda-data#6011 (cherry picked from commit 1db031c)

fixed semantic of get_broker: 404 is a determinate error (a requested resource doesn't exist) while 503 is service level error and when we return it admin interface doesn't create a false impression fixes redpanda-data#6016 (cherry picked from commit 7432633)

Aborts should be propagated as the standard ss::abort_requested_exception type which is understood by handlers to be ignored silently, as it occurs during normal shutdown. Timeouts remain specific exception type in offset_monitor, and in locations that used to catch + swallow both aborts and timeouts, timeouts are logged at WARN severity, as they are not necessarily indicative of a fault, but may indicate a system not operating at its best. Fixes: redpanda-data#5154 (cherry picked from commit 927ea66)

wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style tests may well induce timeouts. (cherry picked from commit ea2abf6)

Print the missing node that caused the assertion to fail. This should make it easier to debug test failures as there's no need to corellate with other logs to figure out the missing node. (cherry picked from commit 0579d0e)

This patch introduces a few changes to make the ClusterMetricsTests more resilient: * use admin API utilities for waiting on stable controller leadership * wait on metrics from the controller before deciding they're not present (cherry picked from commit f9eafdb)

This commit adds a method the Redpanda service that allows for fetching samples for multiple metric families: "metrics_samples". Previously, call sites that required multiple metric families had to make separate calls to "metrics_sample" for each metric family required. This was inconvenient for waiting on a set of metrics to become available. (cherry picked from commit c8ffa67)

This commit updates the tests in ClusterMetricsTest to fetch all the cluster metric families at once. This change makes the tests significantly faster (approx 10 times) as we now wait on the full set of metrics to become available instead of waiting on each individual metric. (cherry picked from commit c0a838b)

BenPope · 2023-03-15T08:20:54Z

Wrong base

mmaslankaprv and others added 30 commits November 21, 2022 12:42

admin_server: renamed on demand rebalance handler method

f22b742

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 29f3a1d)

Merge pull request redpanda-data#7389 from mmaslankaprv/v22.2.x

14307bb

[V22.2.x] Backport of redpanda-data#6814, redpanda-data#7367, redpanda-data#7388

http/tests: libc++ doesn't support uniform_int_distribution<char>

93c2e51

Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit 1ecca56)

cloud_storage/tests: warning: variable 'ix' set but not used

ca3b296

Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit 438bc30)

cluster/tests: warning: variable 'capacity' set but not used

391b9a5

Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit db04a44)

Merge pull request redpanda-data#7348 from vbotbuildovich/backport-56…

1a1b192

…54-v22.2.x-260 [v22.2.x] Fixed logging abort_requested_exception with error severity

tests: change expected output for tune list in arm

fda3a8d

Clocksource tuner is not supported for ARM, so we need to adjust the output. (cherry picked from commit cef42f4)

Merge pull request redpanda-data#7443 from vbotbuildovich/backport-64…

cf263d0

…46-7403-v22.2.x-874 [v22.2.x] rpk clocksource tuner only enabled for amd

Merge pull request redpanda-data#7409 from vbotbuildovich/backport-fi…

d6af155

…xes-to-v22.2.x-553 [v22.2.x] treewide: Small cleanups for llvm-15

k/group: lower log error sev to DEBUG for errc::shutting_down

9c800a4

Reason for condition is legit and the tombstones will be replicated the next time nodes start, so lowering log message severity to WARN. Error text is logged for other error codes (cherry picked from commit afccf5d)

tests/e2e_topic_recovery_test: factored out repeated helper function …

4f9fccd

…to a common private method this change is used in the next commit

tests/e2e_topic_recovery_test: added test... to check issue/6413

15d93d2

the test will cause the generation of remote segments with only configuration batches, and then test that the topic can be restored

v/cloud_storage: reused struct cloud_storage::stream_stats for downlo…

470353a

…ad size for the next commit, where stream_stats::size_bytes is used

Merge pull request redpanda-data#7465 from andijcr/backport/7157

362c2ff

[v22.2.x] cloud_storage: fix partition recovery with data-less segments

tests/end_to_end: fixed enabling si in end to end test

8423c84

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 35f0977)

archival: downgrade upload failure from ERR to WARN

651fd6a

Originally it was expected that we would not hit this path on a healthy system, but in practice we do from time to time when running against AWS S3. Fixes: redpanda-data#7208 (cherry picked from commit b2a8a46)

tests: fixed setting si settings in read replica tests

8abb268

Fixes: redpanda-data#6357 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 51fb17b)

Merge pull request redpanda-data#7476 from BenPope/backport-pr7466-v2…

ad6cf1e

…2.2.x [v22.2.x] redpanda: Recommend rpk config set

Merge pull request redpanda-data#7483 from ztlpn/v22.2.x-bp

2740c23

[v22.2.x] Backport of redpanda-data#6244, 6488

cloud_roles: test for client closing cleanly

ad470c8

(cherry picked from commit dc5c9ef) automated cherry pick failed in redpanda-data#7015 adjusted the http imposter to bring in only the required part of the api for the tests in backport.

k/describe_configs: do not report replication factor and partition count

ce3bad3

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 32dd67a)

tests: updated describe topics test not to relay on redpanda specifics

cd2f776

Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 5daebdb)

Merge pull request redpanda-data#7482 from vbotbuildovich/backport-72…

e47eeeb

…08-v22.2.x-574 [v22.2.x] archival: downgrade upload failure from ERR to WARN

jcsp and others added 16 commits March 15, 2023 08:16

test: correct a bandwidth factor on ManyPartitionsTest

8bf1de1

This was missing a factor of 2 to make it more tolerant to noise. (cherry picked from commit d2e0c6a)

tests: increase batching in manypartitionstest

d0a59fc

...to encourage a little more throughput. The default was to send single-event batches (cherry picked from commit ab1e743)

tests: improve port allocation in KgoVerifierService

ef6cf44

Enable re-using ports rather than just ticking upwards. This avoids the need for a super long open port range on AWS instances. (cherry picked from commit 9283551)

tests: fix partition movement/balancer scale tests

253109c

These got broken by kgo-verifier interface changes. (cherry picked from commit 469940e)

tests: fix progress output from KgoVerifierProducer

6e2a98e

This gave weird output like progress: 100.00% ProduceStatus<0 0 0 0 0/0/0> ...because it was calculating the percentage properly but printing the value of parent._status before updating it. (cherry picked from commit 19ebd83)

tests: improved logging in KgoVerifierService

1c553ae

(cherry picked from commit 2e30447)

tests: make FeaturesMultiNodeTest more robust

4497b76

This was relying on fast propagation of controller writes between nodes, relative to the execution of the test. Fixes redpanda-data#6011 (cherry picked from commit 1db031c)

tests: update log allow lists for offset_monitor exceptions

e0a34fc

wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style tests may well induce timeouts. (cherry picked from commit ea2abf6)

tests/redpanda: print node failing assertion

93861d0

Print the missing node that caused the assertion to fail. This should make it easier to debug test failures as there's no need to corellate with other logs to figure out the missing node. (cherry picked from commit 0579d0e)

BenPope requested review from a team, twmb, r-vasquez and gene-redpanda as code owners March 15, 2023 08:20

BenPope requested review from ivotron and removed request for a team March 15, 2023 08:20

BenPope closed this Mar 15, 2023

github-actions bot added area/build area/docs area/k8s area/redpanda area/rpk labels Mar 15, 2023

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455

[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455

BenPope commented Mar 15, 2023

BenPope commented Mar 15, 2023

[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455

[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455

Conversation

BenPope commented Mar 15, 2023

BenPope commented Mar 15, 2023