Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition autobalancer full disk test #5839

Merged

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Aug 4, 2022

Cover letter

Add a test for partition balancer full disk node handling with the following scenario:

  • produce data and fill the cluster up to ~75%
  • kill a node.
  • partition balancer will move partitions to remaining nodes, causing remaining nodes to go over 80%
  • start the killed node, check that partition balancer will balance partitions away from full nodes.

To make this test possible, we introduce an environment variable that forces redpanda to report mock disk size in the health monitor.

This test uncovered a bug in the partition balancer planner: because the planner tries to move all replicas in a set from "bad" nodes in one reallocation, if we don't take into account previous reallocations, we might think that some node is still full even if we have already planned several moves away from it. This lead to excessive movements being planned: when a full node became almost empty after executing a batch of reallocations. To fix this we use "final" node disk usage (after all reallocations are finished) where appropriate.

Backport Required

  • v22.2.x

UX changes

none

Release notes

  • none

@ztlpn
Copy link
Contributor Author

ztlpn commented Aug 4, 2022

ci failure is #5713

@ztlpn ztlpn force-pushed the partition-autobalancer-test-full-disk branch from 957d994 to 55dfad0 Compare August 5, 2022 00:28
@ztlpn ztlpn requested a review from mmaslankaprv August 5, 2022 00:30
mmaslankaprv
mmaslankaprv previously approved these changes Aug 5, 2022
When moving partitions away from unavailable nodes it is desireable to
violate disk ratio that we use to determine when we need to move nodes
away from a node - this will allow us to save data if we have some free
space left. But we shouldn't go to 100%, so we use the
storage_space_alert_free_threshold_percent config value to determine the
hard limit.
Previously there was a following bug with moving partitions away from
nodes with full disks: because the planner tries to move all replicas
in a set from "bad" nodes in one reallocation, if we don't take into
account previous reallocations, we might consider that some node is
still full even if we have already planned several moves away from it.
This lead to excessive movements being planned: when a full node became
almost empty after executing a batch of reallocations. In this commit we
use "final" node disk usage (after all reallocations are finished) where
appropriate.
@ztlpn
Copy link
Contributor Author

ztlpn commented Aug 5, 2022

rebased on dev to resolve a merge conflict

@ztlpn ztlpn requested a review from mmaslankaprv August 5, 2022 09:50
@ztlpn
Copy link
Contributor Author

ztlpn commented Aug 5, 2022

ci failure in the release build is #5276 and in the debug build is #5713

@ztlpn ztlpn merged commit f2702c3 into redpanda-data:dev Aug 5, 2022
@ztlpn ztlpn deleted the partition-autobalancer-test-full-disk branch November 27, 2023 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants