Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455

Closed
wants to merge 714 commits into from

Conversation

BenPope
Copy link
Member

@BenPope BenPope commented Mar 15, 2023

Backport of PR #6175
Fixes: #6173

Backport of PR #6285 (see also #8793)
Fixes: #5959

Backport of PR #8963
Fixes: #9285

Backport of PR #6501
Fixes #6490

Backport of #6246

Backport of PR #6091
Fixes: #6011

Backport of PR #6419
Fixes: #5154

Backport of PR #6157
Fixes: #6140

Backport of PR #6124
Fixes: #6016

mmaslankaprv and others added 30 commits November 21, 2022 12:42
Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 29f3a1d)
When node count changes the unevenness error increases. Made the error
value not dependent from the number of nodes to prevent situation in
which rebalancing is interrupted when another node is added to the
cluster.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 30a5fbc)
Signed-off-by: Ben Pope <ben@redpanda.com>
(cherry picked from commit 1ecca56)
Signed-off-by: Ben Pope <ben@redpanda.com>
(cherry picked from commit 438bc30)
Signed-off-by: Ben Pope <ben@redpanda.com>
(cherry picked from commit db04a44)
…54-v22.2.x-260

[v22.2.x] Fixed logging abort_requested_exception with error severity
Our recommended clocksource is tsc, which is only present for x86 architectures

In arch != amd rpk will show that the tuner is not available and print:
 Clocksource setting not available for this architecture

Instead of:
 Preferred clocksource 'tsc' not available

(cherry picked from commit 41be394)
Clocksource tuner is not supported for ARM, so we need to adjust the output.

(cherry picked from commit cef42f4)
When do_transfer_leadersip(), if a follower is still not caught up
after prepare_transfer_leadership() is done, a `timeout` was returned.
However it's not really a timeout, it's a flap (we thought recovery
was done but it's not). This commit changes it to `exponential_backoff`
so that admin API would return a 503 (plz retry) for that rather than
a 504 (we couldn't do it in time).

(cherry picked from commit f62b7e7)
…46-7403-v22.2.x-874

[v22.2.x] rpk clocksource tuner only enabled for amd
…xes-to-v22.2.x-553

[v22.2.x] treewide: Small cleanups for llvm-15
Reason for condition is legit and the tombstones will be replicated
the next time nodes start, so lowering log message severity to WARN.
Error text is logged for other error codes

(cherry picked from commit afccf5d)
…to a common private method

this change is used in the next commit
the test will cause the generation of remote segments with only
configuration batches, and then test that the topic can be restored
…ad size

for the next commit, where stream_stats::size_bytes is used
…ipping over configuration-only segments

When retention.bytes is set, download_log_with_capped_size is used to not exceed this size limit.
Now the function counts toward this limit only those segments that contain some topic data.
Previously the segments to download were precomputed from offset_meta keeping a total_size sum.
Now sum is done while downloading segments, to account for segments that contain no data and that will be discarded.
It's unclear why, but in a case where this threw from
_worker, the test ended up hanging during teardown.

We don't need to throw from _worker: it is neater
to capture the exception and promptly raise it in
the test body via condition_met.

Fixes redpanda-data#7426

(cherry picked from commit 70c280d)
[v22.2.x] cloud_storage: fix partition recovery with data-less segments
Redpanda suggests:
```
INFO  2022-11-23 15:00:11,965 [shard 0] main - application.cc:543 - Node configuration properties:
INFO  2022-11-23 15:00:11,965 [shard 0] main - application.cc:544 - (use `rpk config set <cfg> <value>` to change)
```
When the `rpk config set` is used the following log appears:
```
Command "set" is deprecated, use "rpk redpanda config set" instead
```

Signed-off-by: Ben Pope <ben@redpanda.com>
(cherry picked from commit bfbfe88)
Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 35f0977)
Originally it was expected that we would not hit this path on
a healthy system, but in practice we do from time to time
when running against AWS S3.

Fixes: redpanda-data#7208
(cherry picked from commit b2a8a46)
Fixes: redpanda-data#6357

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 51fb17b)
…2.2.x

[v22.2.x] redpanda: Recommend rpk config set
The http client in cloud_roles::make_request was not closed when
exception is thrown. RAII might usually be a better fit for ensuring the
client is closed, but the client stop method returns a future which
cannot be returned from/blocked on in destructor.

This change introduces a helper modeled after ss::with_file to accept
and own an http client and close it after a user-supplied operation is
finished.

It also adds a helper to wrap making an http request with catching and
logging common http call errors.

(cherry picked from commit 024ba8e)
(cherry picked from commit dc5c9ef)

automated cherry pick failed in
redpanda-data#7015

adjusted the http imposter to bring in only the required part of the api
for the tests in backport.
Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 32dd67a)
Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 5daebdb)
…08-v22.2.x-574

[v22.2.x] archival: downgrade upload failure from ERR to WARN
jcsp and others added 16 commits March 15, 2023 08:16
This was missing a factor of 2 to make it more tolerant
to noise.

(cherry picked from commit d2e0c6a)
...to encourage a little more throughput.  The default
was to send single-event batches

(cherry picked from commit ab1e743)
Enable re-using ports rather than just ticking
upwards.  This avoids the need for a super long
open port range on AWS instances.

(cherry picked from commit 9283551)
This is _not_ for running them in docker on CI.  It's for developers
who make test changes to be able to run a miniaturized version of
the test to check for breakage on their workstation.

(cherry picked from commit 41fd8c5)
These got broken by kgo-verifier interface changes.

(cherry picked from commit 469940e)
This gave weird output like
progress: 100.00% ProduceStatus<0 0 0 0 0/0/0>

...because it was calculating the percentage
properly but printing the value of parent._status
before updating it.

(cherry picked from commit 19ebd83)
…eTest more robust

These tests both had a similar defect.  They produced much less data
than comments indicated, and relied on all produces being complete
before consumers started.

Because old kgo-verifier code didn't output any status from producer
until about 5 seconds in, the small ~500MB produce sizes would be
complete quite reliably by the first time the test checked the status,
before the consumer was started.  The consumer would then see the full
offset range when it started, and read the whole lot in one pass.

This change does not fix the underlying flaw (that the tests hardly
write any data in tests that are meant to be done under load), but it
fixes the way the tests can now fail because kgo-verifier is
snappier at indicating status.

Fixing the tests to fulfil their inteded purpose and run partition
movement/balancer under load is tracked by:
redpanda-data#6245

(cherry picked from commit 5f378cc)
This was relying on fast propagation of controller
writes between nodes, relative to the execution of
the test.

Fixes redpanda-data#6011

(cherry picked from commit 1db031c)
fixed semantic of get_broker: 404 is a determinate error (a requested
resource doesn't exist) while 503 is service level error and when we
return it admin interface doesn't create a false impression

fixes redpanda-data#6016

(cherry picked from commit 7432633)
Aborts should be propagated as the standard
ss::abort_requested_exception type which is understood
by handlers to be ignored silently, as it occurs during
normal shutdown.

Timeouts remain specific exception type in offset_monitor,
and in locations that used to catch + swallow both aborts
and timeouts, timeouts are logged at WARN severity, as they
are not necessarily indicative of a fault, but may indicate
a system not operating at its best.

Fixes: redpanda-data#5154
(cherry picked from commit 927ea66)
wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style
tests may well induce timeouts.

(cherry picked from commit ea2abf6)
Print the missing node that caused the assertion to fail. This should
make it easier to debug test failures as there's no need to corellate
with other logs to figure out the missing node.

(cherry picked from commit 0579d0e)
This patch introduces a few changes to make the ClusterMetricsTests
more resilient:
* use admin API utilities for waiting on stable controller leadership
* wait on metrics from the controller before deciding they're not
  present

(cherry picked from commit f9eafdb)
This commit adds a method the Redpanda service that allows for fetching
samples for multiple metric families: "metrics_samples". Previously,
call sites that required multiple metric families had to make separate
calls to "metrics_sample" for each metric family required. This was
inconvenient for waiting on a set of metrics to become available.

(cherry picked from commit c8ffa67)
This commit updates the tests in ClusterMetricsTest to fetch all the
cluster metric families at once. This change makes the tests
significantly faster (approx 10 times) as we now wait on the full set of
metrics to become available instead of waiting on each individual
metric.

(cherry picked from commit c0a838b)
@BenPope BenPope requested review from ivotron and removed request for a team March 15, 2023 08:20
@BenPope
Copy link
Member Author

BenPope commented Mar 15, 2023

Wrong base

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment