-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 29f3a1d)
When node count changes the unevenness error increases. Made the error value not dependent from the number of nodes to prevent situation in which rebalancing is interrupted when another node is added to the cluster. Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 30a5fbc)
Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit 1ecca56)
Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit 438bc30)
Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit db04a44)
…54-v22.2.x-260 [v22.2.x] Fixed logging abort_requested_exception with error severity
Our recommended clocksource is tsc, which is only present for x86 architectures In arch != amd rpk will show that the tuner is not available and print: Clocksource setting not available for this architecture Instead of: Preferred clocksource 'tsc' not available (cherry picked from commit 41be394)
Clocksource tuner is not supported for ARM, so we need to adjust the output. (cherry picked from commit cef42f4)
When do_transfer_leadersip(), if a follower is still not caught up after prepare_transfer_leadership() is done, a `timeout` was returned. However it's not really a timeout, it's a flap (we thought recovery was done but it's not). This commit changes it to `exponential_backoff` so that admin API would return a 503 (plz retry) for that rather than a 504 (we couldn't do it in time). (cherry picked from commit f62b7e7)
…46-7403-v22.2.x-874 [v22.2.x] rpk clocksource tuner only enabled for amd
…xes-to-v22.2.x-553 [v22.2.x] treewide: Small cleanups for llvm-15
Reason for condition is legit and the tombstones will be replicated the next time nodes start, so lowering log message severity to WARN. Error text is logged for other error codes (cherry picked from commit afccf5d)
…to a common private method this change is used in the next commit
the test will cause the generation of remote segments with only configuration batches, and then test that the topic can be restored
…ad size for the next commit, where stream_stats::size_bytes is used
…ipping over configuration-only segments When retention.bytes is set, download_log_with_capped_size is used to not exceed this size limit. Now the function counts toward this limit only those segments that contain some topic data. Previously the segments to download were precomputed from offset_meta keeping a total_size sum. Now sum is done while downloading segments, to account for segments that contain no data and that will be discarded.
It's unclear why, but in a case where this threw from _worker, the test ended up hanging during teardown. We don't need to throw from _worker: it is neater to capture the exception and promptly raise it in the test body via condition_met. Fixes redpanda-data#7426 (cherry picked from commit 70c280d)
[v22.2.x] cloud_storage: fix partition recovery with data-less segments
Redpanda suggests: ``` INFO 2022-11-23 15:00:11,965 [shard 0] main - application.cc:543 - Node configuration properties: INFO 2022-11-23 15:00:11,965 [shard 0] main - application.cc:544 - (use `rpk config set <cfg> <value>` to change) ``` When the `rpk config set` is used the following log appears: ``` Command "set" is deprecated, use "rpk redpanda config set" instead ``` Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit bfbfe88)
Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 35f0977)
Originally it was expected that we would not hit this path on a healthy system, but in practice we do from time to time when running against AWS S3. Fixes: redpanda-data#7208 (cherry picked from commit b2a8a46)
Fixes: redpanda-data#6357 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 51fb17b)
…2.2.x [v22.2.x] redpanda: Recommend rpk config set
[v22.2.x] Backport of redpanda-data#6244, 6488
The http client in cloud_roles::make_request was not closed when exception is thrown. RAII might usually be a better fit for ensuring the client is closed, but the client stop method returns a future which cannot be returned from/blocked on in destructor. This change introduces a helper modeled after ss::with_file to accept and own an http client and close it after a user-supplied operation is finished. It also adds a helper to wrap making an http request with catching and logging common http call errors. (cherry picked from commit 024ba8e)
(cherry picked from commit dc5c9ef) automated cherry pick failed in redpanda-data#7015 adjusted the http imposter to bring in only the required part of the api for the tests in backport.
Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 32dd67a)
Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 5daebdb)
…08-v22.2.x-574 [v22.2.x] archival: downgrade upload failure from ERR to WARN
This was missing a factor of 2 to make it more tolerant to noise. (cherry picked from commit d2e0c6a)
...to encourage a little more throughput. The default was to send single-event batches (cherry picked from commit ab1e743)
Enable re-using ports rather than just ticking upwards. This avoids the need for a super long open port range on AWS instances. (cherry picked from commit 9283551)
This is _not_ for running them in docker on CI. It's for developers who make test changes to be able to run a miniaturized version of the test to check for breakage on their workstation. (cherry picked from commit 41fd8c5)
These got broken by kgo-verifier interface changes. (cherry picked from commit 469940e)
This gave weird output like progress: 100.00% ProduceStatus<0 0 0 0 0/0/0> ...because it was calculating the percentage properly but printing the value of parent._status before updating it. (cherry picked from commit 19ebd83)
…eTest more robust These tests both had a similar defect. They produced much less data than comments indicated, and relied on all produces being complete before consumers started. Because old kgo-verifier code didn't output any status from producer until about 5 seconds in, the small ~500MB produce sizes would be complete quite reliably by the first time the test checked the status, before the consumer was started. The consumer would then see the full offset range when it started, and read the whole lot in one pass. This change does not fix the underlying flaw (that the tests hardly write any data in tests that are meant to be done under load), but it fixes the way the tests can now fail because kgo-verifier is snappier at indicating status. Fixing the tests to fulfil their inteded purpose and run partition movement/balancer under load is tracked by: redpanda-data#6245 (cherry picked from commit 5f378cc)
(cherry picked from commit 2e30447)
This was relying on fast propagation of controller writes between nodes, relative to the execution of the test. Fixes redpanda-data#6011 (cherry picked from commit 1db031c)
fixed semantic of get_broker: 404 is a determinate error (a requested resource doesn't exist) while 503 is service level error and when we return it admin interface doesn't create a false impression fixes redpanda-data#6016 (cherry picked from commit 7432633)
Aborts should be propagated as the standard ss::abort_requested_exception type which is understood by handlers to be ignored silently, as it occurs during normal shutdown. Timeouts remain specific exception type in offset_monitor, and in locations that used to catch + swallow both aborts and timeouts, timeouts are logged at WARN severity, as they are not necessarily indicative of a fault, but may indicate a system not operating at its best. Fixes: redpanda-data#5154 (cherry picked from commit 927ea66)
wait_timed_out is permitted in CHAOS_ALLOW_LIST because chaos-style tests may well induce timeouts. (cherry picked from commit ea2abf6)
Print the missing node that caused the assertion to fail. This should make it easier to debug test failures as there's no need to corellate with other logs to figure out the missing node. (cherry picked from commit 0579d0e)
This patch introduces a few changes to make the ClusterMetricsTests more resilient: * use admin API utilities for waiting on stable controller leadership * wait on metrics from the controller before deciding they're not present (cherry picked from commit f9eafdb)
This commit adds a method the Redpanda service that allows for fetching samples for multiple metric families: "metrics_samples". Previously, call sites that required multiple metric families had to make separate calls to "metrics_sample" for each metric family required. This was inconvenient for waiting on a set of metrics to become available. (cherry picked from commit c8ffa67)
This commit updates the tests in ClusterMetricsTest to fetch all the cluster metric families at once. This change makes the tests significantly faster (approx 10 times) as we now wait on the full set of metrics to become available instead of waiting on each individual metric. (cherry picked from commit c0a838b)
BenPope
requested review from
a team,
twmb,
r-vasquez and
gene-redpanda
as code owners
March 15, 2023 08:20
Wrong base |
This pull request was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of PR #6175
Fixes: #6173
Backport of PR #6285 (see also #8793)
Fixes: #5959
Backport of PR #8963
Fixes: #9285
Backport of PR #6501
Fixes #6490
Backport of #6246
Backport of PR #6091
Fixes: #6011
Backport of PR #6419
Fixes: #5154
Backport of PR #6157
Fixes: #6140
Backport of PR #6124
Fixes: #6016