Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpk: introduce cluster partitions balancer-status #5798

Merged
merged 1 commit into from
Aug 9, 2022

Conversation

r-vasquez
Copy link
Contributor

@r-vasquez r-vasquez commented Aug 2, 2022

Cover letter

Introducing rpk cluster partitions balancer-status: this command will query the cluster via admin API to retrieve information about the partition auto balancer.

States:

$ rpk cluster partitions balancer-status 
Status:                       off
Seconds Since Last Tick:      0
Current Reassignment Count:   0

$ rpk cluster partitions balancer-status 
Status:                       ready
Seconds Since Last Tick:      3
Current Reassignment Count:   0

$ rpk cluster partitions balancer-status 
Status:                       starting
Seconds Since Last Tick:      0
Current Reassignment Count:   0

$ rpk cluster partitions balancer-status 
Status:                       in_progress
Seconds Since Last Tick:      0
Current Reassignment Count:   36
Unavailable Nodes:            [0]
Over Disk Limit Nodes:        []

$ rpk cluster partitions balancer-status 
Status:                       stalled
Seconds Since Last Tick:      13
Current Reassignment Count:   0
Unavailable Nodes:            [0]
Over Disk Limit Nodes:        []

Documentation:

Via --help flag and we are also adding more documentation in the Admin API package since rpk is being used as a library by console and k8s operator.

$ rpk cluster partitions balancer-status --help 
Queries cluster for partition balancer status:

If continuous partition balancing is enabled, redpanda will continuously
reassign partitions from both unavailable nodes and from nodes using more disk
space than the configured limit.

This command can be used to monitor the partition balancer status.

FIELDS

    Status:                        Either off, ready, starting, in progress, or
                                   stalled.
    Seconds Since Last Tick:       The last time the partition balancer ran.
    Current Reassignments Count:   Current number of partition reassignments in
                                   progress.
    Unavailable Nodes:             The nodes that have been unavailable after a
                                   time set by the
                                   "partition_autobalancing_node_availability_timeout_sec"
                                   cluster property.
    Over Disk Limit Nodes:         The nodes that surpassed the threshold of
                                   used disk percentage specified in the
                                   "partition_autobalancing_max_disk_usage_percent"
                                   cluster property.

BALANCER STATUS

    off:          The balancer is disabled.
    ready:        The balancer is active but there is nothing to do.
    starting:     The balancer is starting but has not run yet.
    in_progress:  The balancer is active and is in the process of scheduling
                  partition movements.
    stalled:      Violations have been detected and the balancer cannot correct
                  them.

STALLED BALANCER

A stalled balancer can occur for a few reasons and requires a bit of manual
investigation. A few areas to investigate:

* Are there are enough healthy nodes to which to move partitions? For example,
  in a three node cluster, no movements are possible for partitions with three
  replicas. You will see a stall every time there is a violation.

* Does the cluster have sufficient space? If all nodes in the cluster are
  utilizing more than 80% of their disk space, rebalancing cannot proceed.

* Do all partitions have quorum? If the majority of a partition's replicas are
  down, the partition cannot be moved.

* Are any nodes in maintenance mode? Partitions are not moved if any node is in
  maintenance mode.



Usage:
  rpk cluster partitions balancer-status [flags]

Flags:
  -h, --help   Help for balancer-status

Global Flags:
      --admin-api-tls-cert string         The certificate to be used for TLS authentication with the Admin API
      --admin-api-tls-enabled             Enable TLS for the Admin API (not necessary if specifying custom certs)
      --admin-api-tls-key string          The certificate key to be used for TLS authentication with the Admin API
      --admin-api-tls-truststore string   The truststore to be used for TLS communication with the Admin API
      --api-urls string                   Comma-separated list of admin API addresses (<IP>:<port>)
      --brokers strings                   Comma-separated list of broker ip:port pairs (e.g. --brokers '192.168.78.34:9092,192.168.78.35:9092,192.179.23.54:9092'). Alternatively, you may set the REDPANDA_BROKERS environment variable with the comma-separated list of broker addresses
      --config string                     Redpanda config file, if not set the file will be searched for in the default locations
      --password string                   SASL password to be used for authentication
      --sasl-mechanism string             The authentication mechanism to use. Supported values: SCRAM-SHA-256, SCRAM-SHA-512
      --tls-cert string                   The certificate to be used for TLS authentication with the broker
      --tls-enabled                       Enable TLS for the Kafka API (not necessary if specifying custom certs)
      --tls-key string                    The certificate key to be used for TLS authentication with the broker
      --tls-truststore string             The truststore to be used for TLS communication with the broker
      --user string                       SASL user to be used for authentication
  -v, --verbose                           Enable verbose logging (default: false)

Fixes #5780

Backport Required

  • v22.2.x

UX changes

This PR adds rpk cluster partitions balancer-status which allows the user to query the cluster to get the partition auto balancer status instead of directly calling the admin API.

Release notes

  • rpk: now you can query the partition auto balancer status via rpk cluster partitions balancer-status

twmb
twmb previously approved these changes Aug 5, 2022
ready: The balancer is active but there is nothing to do.
starting: The balancer is starting but has not run yet.
in_progress: The balancer is active and is in the process of scheduling partition movements.
stalled: There are some violations, but for some reason, the balancer cannot make

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...this is a tough one, because the doc is essentially telling the customer that something is wrong, but we have no idea what that something is. Understand this could likely have many causes. Still, we should offer up some action for the use to take--should they check the config? Contact Redpanda Support (<--typically, measure of last resort in docs).

Maybe:
Violations have been detected and the balancer cannot correct them. <then include action user can/should take>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ztlpn What action do you think the user can/should take after seeing stalled status in the balancer status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a kind of catch-all status so the causes can vary. In the future we'll probably have to display more detailed error information. For now I can list some things that are reasonable to check:

  • if there are enough healthy nodes to move partitions to (e.g. in a 3-node cluster no movements are possible, so we'll stall every time there is a violation)
  • if the cluster has enough space (if all nodes are over 80% used disk space, we can't rebalance this out)
  • if all partitions have quorum (e.g. if two of the three partition replicas are down, we can't move this partition)
  • if there are any nodes in maintenance mode (we stop moving partitions if a node is in maintenance mode)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Feediver1 should we add this to a section here, or should we refer to online docs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the user experience should be the top criteria, so would recommend putting the info in the message - this saves customers from having to jump from product over to docs to look it up. We should also put this guidance in the docs (filed Issue , but not require customers to jump to the docs to act on the stalled error msg.

Perhaps:
Violations have been detected and the balancer cannot correct them. Check the following:

  • Are there are enough healthy nodes to which to move partitions? For example, in a 3-node cluster no movements are possible, so you will see a stall every time there is a violation.
  • Does the cluster have sufficient space? Nodes with over 80% of used disk space cannot be rebalanced.
  • Do all partitions have quorum? If two of the three partition replicas are down, this partition cannot be moved.
  • Are any nodes in maintenance mode? Partitions are not moved if a node is in maintenance mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes with over 80% of used disk space cannot be rebalanced.

Just to correct the phrasing, should be "if there are no nodes in the cluster with less than 80% of used disk space, rebalancing cannot proceed". Or "all nodes are with more than 80%". Not just a single node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this troubleshooting guide 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micheleRP We just updated the wording a little bit for the Stalled Status Troubleshooting. It might be worth changing this in the docs too. :D

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micheleRP We just updated the wording a little bit for the Stalled Status Troubleshooting. It might be worth changing this in the docs too. :D

Thanks, I've updated the info about the status command in the Continuous Data Balancing doc: https://deploy-preview-470--redpanda-documentation.netlify.app/docs/core/cluster-administration/continuous-data-balancing/#use-data-balancing-commands

@r-vasquez
Copy link
Contributor Author

Update:

  • Rebase
  • Doc changes.

@r-vasquez
Copy link
Contributor Author

Update: Added troubleshooting guide when the status is stalled

@mmedenjak mmedenjak added the kind/enhance New feature or request label Aug 9, 2022
This command will query the cluster via admin api
to retrieve information about the partition auto
balancer.
@twmb
Copy link
Contributor

twmb commented Aug 9, 2022

Merging now -- tests passed previously and all that was updated was help text (a string)

@twmb twmb merged commit f91a04b into redpanda-data:dev Aug 9, 2022
@r-vasquez r-vasquez deleted the rpk-partition-status branch August 11, 2022 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rpk kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpk: partition balancer status command
6 participants