-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpk: introduce cluster partitions balancer-status #5798
Conversation
dddaddb
to
f02d9c2
Compare
f02d9c2
to
061ffea
Compare
061ffea
to
10a3c79
Compare
ready: The balancer is active but there is nothing to do. | ||
starting: The balancer is starting but has not run yet. | ||
in_progress: The balancer is active and is in the process of scheduling partition movements. | ||
stalled: There are some violations, but for some reason, the balancer cannot make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...this is a tough one, because the doc is essentially telling the customer that something is wrong, but we have no idea what that something is. Understand this could likely have many causes. Still, we should offer up some action for the use to take--should they check the config? Contact Redpanda Support (<--typically, measure of last resort in docs).
Maybe:
Violations have been detected and the balancer cannot correct them. <then include action user can/should take>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ztlpn What action do you think the user can/should take after seeing stalled
status in the balancer status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a kind of catch-all status so the causes can vary. In the future we'll probably have to display more detailed error information. For now I can list some things that are reasonable to check:
- if there are enough healthy nodes to move partitions to (e.g. in a 3-node cluster no movements are possible, so we'll stall every time there is a violation)
- if the cluster has enough space (if all nodes are over 80% used disk space, we can't rebalance this out)
- if all partitions have quorum (e.g. if two of the three partition replicas are down, we can't move this partition)
- if there are any nodes in maintenance mode (we stop moving partitions if a node is in maintenance mode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Feediver1 should we add this to a section here, or should we refer to online docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user experience should be the top criteria, so would recommend putting the info in the message - this saves customers from having to jump from product over to docs to look it up. We should also put this guidance in the docs (filed Issue , but not require customers to jump to the docs to act on the stalled error msg.
Perhaps:
Violations have been detected and the balancer cannot correct them. Check the following:
- Are there are enough healthy nodes to which to move partitions? For example, in a 3-node cluster no movements are possible, so you will see a stall every time there is a violation.
- Does the cluster have sufficient space? Nodes with over 80% of used disk space cannot be rebalanced.
- Do all partitions have quorum? If two of the three partition replicas are down, this partition cannot be moved.
- Are any nodes in maintenance mode? Partitions are not moved if a node is in maintenance mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nodes with over 80% of used disk space cannot be rebalanced.
Just to correct the phrasing, should be "if there are no nodes in the cluster with less than 80% of used disk space, rebalancing cannot proceed". Or "all nodes are with more than 80%". Not just a single node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added this troubleshooting guide 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added info about stalled status in the new Continuous Data Balancing doc: https://deploy-preview-470--redpanda-documentation.netlify.app/docs/core/cluster-administration/continuous-data-balancing/#use-data-balancing-commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micheleRP We just updated the wording a little bit for the Stalled Status Troubleshooting. It might be worth changing this in the docs too. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micheleRP We just updated the wording a little bit for the Stalled Status Troubleshooting. It might be worth changing this in the docs too. :D
Thanks, I've updated the info about the status command in the Continuous Data Balancing doc: https://deploy-preview-470--redpanda-documentation.netlify.app/docs/core/cluster-administration/continuous-data-balancing/#use-data-balancing-commands
10a3c79
to
e7c671a
Compare
Update:
|
e7c671a
to
0ca51d6
Compare
Update: Added troubleshooting guide when the status is |
0ca51d6
to
1f84985
Compare
1f84985
to
7ada01c
Compare
This command will query the cluster via admin api to retrieve information about the partition auto balancer.
7ada01c
to
00fcd1a
Compare
Merging now -- tests passed previously and all that was updated was help text (a string) |
Cover letter
Introducing
rpk cluster partitions balancer-status
: this command will query the cluster via admin API to retrieve information about the partition auto balancer.States:
Documentation:
Via
--help
flag and we are also adding more documentation in the Admin API package since rpk is being used as a library by console and k8s operator.Fixes #5780
Backport Required
UX changes
This PR adds
rpk cluster partitions balancer-status
which allows the user to query the cluster to get the partition auto balancer status instead of directly calling the admin API.Release notes
rpk cluster partitions balancer-status