Partition autobalancer full disk test #5839

ztlpn · 2022-08-04T13:54:18Z

Cover letter

Add a test for partition balancer full disk node handling with the following scenario:

produce data and fill the cluster up to ~75%
kill a node.
partition balancer will move partitions to remaining nodes, causing remaining nodes to go over 80%
start the killed node, check that partition balancer will balance partitions away from full nodes.

To make this test possible, we introduce an environment variable that forces redpanda to report mock disk size in the health monitor.

This test uncovered a bug in the partition balancer planner: because the planner tries to move all replicas in a set from "bad" nodes in one reallocation, if we don't take into account previous reallocations, we might think that some node is still full even if we have already planned several moves away from it. This lead to excessive movements being planned: when a full node became almost empty after executing a batch of reallocations. To fix this we use "final" node disk usage (after all reallocations are finished) where appropriate.

Backport Required

v22.2.x

UX changes

none

Release notes

none

src/v/cluster/partition_balancer_planner.cc

ztlpn · 2022-08-04T22:20:38Z

ci failure is #5713

When moving partitions away from unavailable nodes it is desireable to violate disk ratio that we use to determine when we need to move nodes away from a node - this will allow us to save data if we have some free space left. But we shouldn't go to 100%, so we use the storage_space_alert_free_threshold_percent config value to determine the hard limit.

Previously there was a following bug with moving partitions away from nodes with full disks: because the planner tries to move all replicas in a set from "bad" nodes in one reallocation, if we don't take into account previous reallocations, we might consider that some node is still full even if we have already planned several moves away from it. This lead to excessive movements being planned: when a full node became almost empty after executing a batch of reallocations. In this commit we use "final" node disk usage (after all reallocations are finished) where appropriate.

ztlpn · 2022-08-05T09:50:21Z

rebased on dev to resolve a merge conflict

ztlpn · 2022-08-05T12:07:40Z

ci failure in the release build is #5276 and in the debug build is #5713

ztlpn requested review from dotnwat, NyaliaLui, mmaslankaprv and VadimPlh as code owners August 4, 2022 13:54

github-actions bot added the area/redpanda label Aug 4, 2022

ztlpn force-pushed the partition-autobalancer-test-full-disk branch from 55b1d1e to 957d994 Compare August 4, 2022 13:56

mmaslankaprv reviewed Aug 4, 2022

View reviewed changes

src/v/cluster/partition_balancer_planner.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Aug 4, 2022

View reviewed changes

src/v/cluster/partition_balancer_planner.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Aug 4, 2022

View reviewed changes

src/v/cluster/partition_balancer_planner.cc Show resolved Hide resolved

ztlpn force-pushed the partition-autobalancer-test-full-disk branch from 957d994 to 55dfad0 Compare August 5, 2022 00:28

ztlpn requested a review from mmaslankaprv August 5, 2022 00:30

mmaslankaprv previously approved these changes Aug 5, 2022

View reviewed changes

ztlpn added 8 commits August 5, 2022 12:47

c/local_monitor: add ability to set mock disk size via env var

4bd1500

tests/end_to_end: add ability to set environment for redpanda

5eff865

tests: fix off-by-1 error in await_minimum_produced_records

849b684

c/partition_balancer: fix calculating node_released_disk_size map

8f9e4b8

tests/partition_balancer: full nodes test

d58ecbc

c/partition_balancer: rename planned_movement_disk_size

a484b37

ztlpn dismissed mmaslankaprv’s stale review via a484b37 August 5, 2022 09:48

ztlpn force-pushed the partition-autobalancer-test-full-disk branch from 55dfad0 to a484b37 Compare August 5, 2022 09:48

ztlpn requested a review from mmaslankaprv August 5, 2022 09:50

mmaslankaprv approved these changes Aug 5, 2022

View reviewed changes

ztlpn merged commit f2702c3 into redpanda-data:dev Aug 5, 2022

ztlpn deleted the partition-autobalancer-test-full-disk branch November 27, 2023 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition autobalancer full disk test #5839

Partition autobalancer full disk test #5839

ztlpn commented Aug 4, 2022

ztlpn commented Aug 4, 2022

ztlpn commented Aug 5, 2022

ztlpn commented Aug 5, 2022

Partition autobalancer full disk test #5839

Partition autobalancer full disk test #5839

Conversation

ztlpn commented Aug 4, 2022

Cover letter

Backport Required

UX changes

Release notes

ztlpn commented Aug 4, 2022

ztlpn commented Aug 5, 2022

ztlpn commented Aug 5, 2022