rpk: introduce cluster partitions balancer-status #5798

r-vasquez · 2022-08-02T21:26:30Z

Cover letter

Introducing rpk cluster partitions balancer-status: this command will query the cluster via admin API to retrieve information about the partition auto balancer.

States:

$ rpk cluster partitions balancer-status 
Status:                       off
Seconds Since Last Tick:      0
Current Reassignment Count:   0

$ rpk cluster partitions balancer-status 
Status:                       ready
Seconds Since Last Tick:      3
Current Reassignment Count:   0

$ rpk cluster partitions balancer-status 
Status:                       starting
Seconds Since Last Tick:      0
Current Reassignment Count:   0

$ rpk cluster partitions balancer-status 
Status:                       in_progress
Seconds Since Last Tick:      0
Current Reassignment Count:   36
Unavailable Nodes:            [0]
Over Disk Limit Nodes:        []

$ rpk cluster partitions balancer-status 
Status:                       stalled
Seconds Since Last Tick:      13
Current Reassignment Count:   0
Unavailable Nodes:            [0]
Over Disk Limit Nodes:        []

Documentation:

Via --help flag and we are also adding more documentation in the Admin API package since rpk is being used as a library by console and k8s operator.

$ rpk cluster partitions balancer-status --help 
Queries cluster for partition balancer status:

If continuous partition balancing is enabled, redpanda will continuously
reassign partitions from both unavailable nodes and from nodes using more disk
space than the configured limit.

This command can be used to monitor the partition balancer status.

FIELDS

    Status:                        Either off, ready, starting, in progress, or
                                   stalled.
    Seconds Since Last Tick:       The last time the partition balancer ran.
    Current Reassignments Count:   Current number of partition reassignments in
                                   progress.
    Unavailable Nodes:             The nodes that have been unavailable after a
                                   time set by the
                                   "partition_autobalancing_node_availability_timeout_sec"
                                   cluster property.
    Over Disk Limit Nodes:         The nodes that surpassed the threshold of
                                   used disk percentage specified in the
                                   "partition_autobalancing_max_disk_usage_percent"
                                   cluster property.

BALANCER STATUS

    off:          The balancer is disabled.
    ready:        The balancer is active but there is nothing to do.
    starting:     The balancer is starting but has not run yet.
    in_progress:  The balancer is active and is in the process of scheduling
                  partition movements.
    stalled:      Violations have been detected and the balancer cannot correct
                  them.

STALLED BALANCER

A stalled balancer can occur for a few reasons and requires a bit of manual
investigation. A few areas to investigate:

* Are there are enough healthy nodes to which to move partitions? For example,
  in a three node cluster, no movements are possible for partitions with three
  replicas. You will see a stall every time there is a violation.

* Does the cluster have sufficient space? If all nodes in the cluster are
  utilizing more than 80% of their disk space, rebalancing cannot proceed.

* Do all partitions have quorum? If the majority of a partition's replicas are
  down, the partition cannot be moved.

* Are any nodes in maintenance mode? Partitions are not moved if any node is in
  maintenance mode.



Usage:
  rpk cluster partitions balancer-status [flags]

Flags:
  -h, --help   Help for balancer-status

Global Flags:
      --admin-api-tls-cert string         The certificate to be used for TLS authentication with the Admin API
      --admin-api-tls-enabled             Enable TLS for the Admin API (not necessary if specifying custom certs)
      --admin-api-tls-key string          The certificate key to be used for TLS authentication with the Admin API
      --admin-api-tls-truststore string   The truststore to be used for TLS communication with the Admin API
      --api-urls string                   Comma-separated list of admin API addresses (<IP>:<port>)
      --brokers strings                   Comma-separated list of broker ip:port pairs (e.g. --brokers '192.168.78.34:9092,192.168.78.35:9092,192.179.23.54:9092'). Alternatively, you may set the REDPANDA_BROKERS environment variable with the comma-separated list of broker addresses
      --config string                     Redpanda config file, if not set the file will be searched for in the default locations
      --password string                   SASL password to be used for authentication
      --sasl-mechanism string             The authentication mechanism to use. Supported values: SCRAM-SHA-256, SCRAM-SHA-512
      --tls-cert string                   The certificate to be used for TLS authentication with the broker
      --tls-enabled                       Enable TLS for the Kafka API (not necessary if specifying custom certs)
      --tls-key string                    The certificate key to be used for TLS authentication with the broker
      --tls-truststore string             The truststore to be used for TLS communication with the broker
      --user string                       SASL user to be used for authentication
  -v, --verbose                           Enable verbose logging (default: false)

Fixes #5780

Backport Required

v22.2.x

UX changes

This PR adds rpk cluster partitions balancer-status which allows the user to query the cluster to get the partition auto balancer status instead of directly calling the admin API.

Release notes

rpk: now you can query the partition auto balancer status via rpk cluster partitions balancer-status

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go

Feediver1 · 2022-08-06T23:04:46Z

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go

+ ready: The balancer is active but there is nothing to do.
+ starting: The balancer is starting but has not run yet.
+ in_progress: The balancer is active and is in the process of scheduling partition movements.
+ stalled: There are some violations, but for some reason, the balancer cannot make


Hmmm...this is a tough one, because the doc is essentially telling the customer that something is wrong, but we have no idea what that something is. Understand this could likely have many causes. Still, we should offer up some action for the use to take--should they check the config? Contact Redpanda Support (<--typically, measure of last resort in docs).

Maybe:
Violations have been detected and the balancer cannot correct them. <then include action user can/should take>

@ztlpn What action do you think the user can/should take after seeing stalled status in the balancer status?

Yes this is a kind of catch-all status so the causes can vary. In the future we'll probably have to display more detailed error information. For now I can list some things that are reasonable to check:

if there are enough healthy nodes to move partitions to (e.g. in a 3-node cluster no movements are possible, so we'll stall every time there is a violation)

if the cluster has enough space (if all nodes are over 80% used disk space, we can't rebalance this out)

if all partitions have quorum (e.g. if two of the three partition replicas are down, we can't move this partition)

if there are any nodes in maintenance mode (we stop moving partitions if a node is in maintenance mode)

@Feediver1 should we add this to a section here, or should we refer to online docs?

I think the user experience should be the top criteria, so would recommend putting the info in the message - this saves customers from having to jump from product over to docs to look it up. We should also put this guidance in the docs (filed Issue , but not require customers to jump to the docs to act on the stalled error msg.

Perhaps:
Violations have been detected and the balancer cannot correct them. Check the following:

Are there are enough healthy nodes to which to move partitions? For example, in a 3-node cluster no movements are possible, so you will see a stall every time there is a violation.

Does the cluster have sufficient space? Nodes with over 80% of used disk space cannot be rebalanced.

Do all partitions have quorum? If two of the three partition replicas are down, this partition cannot be moved.

Are any nodes in maintenance mode? Partitions are not moved if a node is in maintenance mode.

Nodes with over 80% of used disk space cannot be rebalanced.

Just to correct the phrasing, should be "if there are no nodes in the cluster with less than 80% of used disk space, rebalancing cannot proceed". Or "all nodes are with more than 80%". Not just a single node.

I just added this troubleshooting guide 👍

I also added info about stalled status in the new Continuous Data Balancing doc: https://deploy-preview-470--redpanda-documentation.netlify.app/docs/core/cluster-administration/continuous-data-balancing/#use-data-balancing-commands

@micheleRP We just updated the wording a little bit for the Stalled Status Troubleshooting. It might be worth changing this in the docs too. :D

@micheleRP We just updated the wording a little bit for the Stalled Status Troubleshooting. It might be worth changing this in the docs too. :D

Thanks, I've updated the info about the status command in the Continuous Data Balancing doc: https://deploy-preview-470--redpanda-documentation.netlify.app/docs/core/cluster-administration/continuous-data-balancing/#use-data-balancing-commands

r-vasquez · 2022-08-08T16:35:08Z

Update:

Rebase
Doc changes.

r-vasquez · 2022-08-08T23:13:06Z

Update: Added troubleshooting guide when the status is stalled

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go

This command will query the cluster via admin api to retrieve information about the partition auto balancer.

twmb · 2022-08-09T18:09:03Z

Merging now -- tests passed previously and all that was updated was help text (a string)

r-vasquez requested review from twmb and 0x5d as code owners August 2, 2022 21:26

github-actions bot added the area/rpk label Aug 2, 2022

r-vasquez force-pushed the rpk-partition-status branch from dddaddb to f02d9c2 Compare August 2, 2022 21:29

r-vasquez commented Aug 2, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

r-vasquez commented Aug 2, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

twmb reviewed Aug 4, 2022

View reviewed changes

r-vasquez force-pushed the rpk-partition-status branch from f02d9c2 to 061ffea Compare August 5, 2022 15:27

twmb reviewed Aug 5, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

r-vasquez force-pushed the rpk-partition-status branch from 061ffea to 10a3c79 Compare August 5, 2022 22:47

twmb previously approved these changes Aug 5, 2022

View reviewed changes

Feediver1 reviewed Aug 6, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Show resolved Hide resolved

Feediver1 reviewed Aug 6, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

Feediver1 reviewed Aug 6, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

Feediver1 reviewed Aug 6, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

Feediver1 reviewed Aug 6, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

Feediver1 reviewed Aug 6, 2022

View reviewed changes

r-vasquez dismissed twmb’s stale review via e7c671a August 8, 2022 16:30

r-vasquez force-pushed the rpk-partition-status branch from 10a3c79 to e7c671a Compare August 8, 2022 16:30

r-vasquez requested a review from twmb August 8, 2022 20:02

r-vasquez force-pushed the rpk-partition-status branch from e7c671a to 0ca51d6 Compare August 8, 2022 23:11

r-vasquez requested review from Feediver1 and ztlpn August 8, 2022 23:13

r-vasquez force-pushed the rpk-partition-status branch from 0ca51d6 to 1f84985 Compare August 9, 2022 00:15

twmb reviewed Aug 9, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/partitions/status.go Outdated Show resolved Hide resolved

mmedenjak added the kind/enhance New feature or request label Aug 9, 2022

r-vasquez force-pushed the rpk-partition-status branch from 1f84985 to 7ada01c Compare August 9, 2022 13:54

rpk: introduce cluster partitions balancer-status

00fcd1a

This command will query the cluster via admin api to retrieve information about the partition auto balancer.

r-vasquez force-pushed the rpk-partition-status branch from 7ada01c to 00fcd1a Compare August 9, 2022 16:55

twmb approved these changes Aug 9, 2022

View reviewed changes

twmb merged commit f91a04b into redpanda-data:dev Aug 9, 2022

r-vasquez deleted the rpk-partition-status branch August 11, 2022 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpk: introduce cluster partitions balancer-status #5798

rpk: introduce cluster partitions balancer-status #5798

r-vasquez commented Aug 2, 2022 •

edited

Loading

Feediver1 Aug 6, 2022

r-vasquez Aug 8, 2022

ztlpn Aug 8, 2022

twmb Aug 8, 2022

Feediver1 Aug 8, 2022

ztlpn Aug 8, 2022

r-vasquez Aug 8, 2022

micheleRP Aug 9, 2022

r-vasquez Aug 9, 2022

micheleRP Aug 9, 2022

r-vasquez commented Aug 8, 2022

r-vasquez commented Aug 8, 2022

twmb commented Aug 9, 2022

rpk: introduce cluster partitions balancer-status #5798

rpk: introduce cluster partitions balancer-status #5798

Conversation

r-vasquez commented Aug 2, 2022 • edited Loading

Cover letter

States:

Documentation:

Backport Required

UX changes

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r-vasquez commented Aug 8, 2022

r-vasquez commented Aug 8, 2022

twmb commented Aug 9, 2022

r-vasquez commented Aug 2, 2022 •

edited

Loading