-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More detailed partition reconfiguration tracking #10201
More detailed partition reconfiguration tracking #10201
Conversation
0cf1b95
to
fc87bd2
Compare
@mmaslankaprv - what's the difference between "shard" and "core" can you use consistent naming. |
@mmaslankaprv - this needs some
|
fc87bd2
to
e282113
Compare
you are right, i changed it to be consistent. |
e282113
to
d82ad13
Compare
/ci-repeat 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need a dedicated ducktape test?
@@ -2605,6 +2605,32 @@ admin_server::mark_transaction_expired_handler( | |||
}); | |||
} | |||
|
|||
ss::future<ss::json::json_return_type> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't look like this needs to be a coroutine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i use future<>
here as in next commit i actually leverage this function being a coroutine.
src/v/cluster/controller_api.cc
Outdated
co_await ss::maybe_yield(); | ||
} | ||
|
||
using ret_t = result<std::vector<ntp_reconciliation_state>>; | ||
|
||
auto node_results = co_await ssx::parallel_transform( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was the maybe_yield used because the set could be large? if that's true, then should concurrently be limited for the parallel transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the set of ntps
can be indeed large, here we are limited to the number of nodes as the ntps are grouped, i think there is no need to limit concurrency here.
Signed-off-by: Michal Maslanka <michal@redpanda.com>
In order to provide a generic error code to express errors originating from outside of the cluster module (errors with different category) or an exceptions occurred in `controller_backend` we introduce a separate error code. Signed-off-by: Michal Maslanka <michal@redpanda.com>
d82ad13
to
95d38c0
Compare
if (ec.category() == error_category()) { | ||
it->last_error = static_cast<errc>(ec.value()); | ||
} else { | ||
it->last_error = errc::partition_operation_failed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we just save the error code as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need it send it over the RPC we do not have way to serialize error category
src/v/redpanda/admin_server.cc
Outdated
size_t left_to_move = 0; | ||
size_t already_moved = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking this pr, but it would be really great to have these available as metrics and drilled down per node to be able to see a dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, we will do it as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty cool.. I only have a minor comment.
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Added revision, last error and retry count to backend operation. The information will be used to track partition reconfiguration progress. Signed-off-by: Michal Maslanka <michal@redpanda.com>
Added `controller_api` that allows caller to request partition reconciliation state from all the replicas where partition is currently hosted. The API returns a data structure containing operations that are executed by `controller_backend` on all of the replicas. Signed-off-by: Michal Maslanka <michal@redpanda.com>
0630381
to
69bbf8e
Compare
ci failure: #10163 |
The `/reconfiguartions` endpoint didn't provide an insight into the progress of partition reconfigurations. Added information that will allow user to check the operation progress and additionally check status of reconciliation on all replicas. Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
69bbf8e
to
d83a975
Compare
/backport v23.1.x |
Failed to run cherry-pick command. I executed the below command:
|
/backport v23.1.x |
Failed to run cherry-pick command. I executed the below command:
|
/backport v23.1.x |
Failed to run cherry-pick command. I executed the below command:
|
/backport v22.2.x |
Failed to run cherry-pick command. I executed the below command:
|
Please ignore the /backport v22.2.x command. I was testing out the backport bot. |
[backport] [v23.1.x] More detailed partition reconfiguration tracking #10201
Enriched
/reconfiguration
API with more information allowing users to track the progress of partition reconciliation. Now the API returns a complete set of information related with partition reconfiguration that is taking place.The API will now return the following JSON:
FIxes: https://github.com/redpanda-data/core-internal/issues/444
Backports Required
Release Notes
Improvements