etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) #12941

serathius · 2021-05-10T13:13:40Z

Defragment is an expensive operation to execute when serving traffic as requires locking database for multiple seconds. Instead of defragmenting during operation of Etcd we can mitigate this cost by moving this process to server bootstrap. This also has benefit of reducing maintenance cost for users that didn't setup full automation to trigger defrag periodically, but still do some other operations like upgrades.

This PR adds new --experimental-bootstrap-defrag-threshold-megabytes flag to etcdserver that allows users to set a disk size in megabytes. During bootstrap if disk size that would be freed by defrag is greater then threshold set, etcdserver will automatically execute defrag before starting to serve traffic.

@ptabor

server/etcdmain/config.go

server/embed/config.go

server/embed/etcd.go

server/etcdserver/server.go

ptabor · 2021-05-10T17:35:41Z

Thank you. This will be very helpful for an infrequent defragmentation that does not impacts tile-latency of requests served by nodes in clusters. If member is updated or is restared (e.g in response to NO_SPACE alarm), it can be configured to perform some autohealing action, like 'defrag'.

This mitigation was discussed on community meeting on July 30, 2020 in response to
kubernetes/kubernetes#93280.

Marek: Please add a test (e.g. e2e) to at least guarantee that setting this flag does not crashes the server.

gyuho · 2021-05-10T22:10:35Z

This will be very helpful for an infrequent defragmentation that does not impacts tile-latency of requests served by nodes in clusters.

Yes, defragging live node is quite disruptive, even causing error too many requests.

Defrag on bootstrap seems safe, on the other hand.

serathius · 2021-05-11T08:46:29Z

Added e2e test

server/etcdmain/config.go

server/etcdmain/help.go

hexfusion

Overall this makes sense, while this can fairly trivially be added as an init container it does have value as a server runtime.

…ivided threshold

hexfusion · 2021-05-11T12:11:57Z

Just a thought do we have any concerns with race condition and file locks? I am thinking about OS that does not support flock? Should we consider gating this on a supported arch(s)?

ptabor · 2021-05-11T12:20:50Z

Just a thought do we have any concerns with race condition and file locks? I am thinking about OS that does not support flock? Should we consider gating this on a supported arch(s)?

What scenario do you envision ?

My reasoning: I assume we are running this before 'concurrent' code within etcd is initialized. We could even move it before ci.SetBackend(be) to make it even more guaranteed. So whetever concurrent process would access the file directly it would either disrupt 'backup' or the main's etcd operations.

hexfusion · 2021-05-11T12:41:26Z

Just a thought do we have any concerns with race condition and file locks? I am thinking about OS that does not support flock? Should we consider gating this on a supported arch(s)?

What scenario do you envision ?

My reasoning: I assume we are running this before 'concurrent' code within etcd is initialized. We could even move it before ci.SetBackend(be) to make it even more guaranteed. So whetever concurrent process would access the file directly it would either disrupt 'backup' or the main's etcd operations.

Just high level I have not dug into this yet but if this code happens before listeners would anything block defrag from attempting to run against an already running etcd process?

etcd process running (graceful term)
new process starts
file lock race?

xiang90 · 2021-05-11T18:19:51Z

server/etcdserver/server.go

+			zap.String("experimental-bootstrap-defrag-threshold", humanize.Bytes(uint64(thresholdBytes))),
+		)
+		return nil
+	}


We probably also should log here for the non-skipping case.

I used to think so, but be.Defrag internally has pretty decent logging already.

xiang90 · 2021-05-11T18:30:00Z

@serathius

The reclamation of disk space can be important. But I guess the most important role is to rearrange the on-disk pages and reduce the size of freelist. Thus it can also help with write throughput/latency (reduce random writes). So I am not sure if the size factor is the most important one to config. We could check how many small holes are there and let users config it. But I do not really want to make the flag configuration too complicated either.

ptabor · 2021-05-11T18:52:00Z

@xiang90 I assume that size is the good proxy to number of pages being cleaned up (size/4096), so number of entries being deleted from the free-pages list. From my perspective the goals are:

reducing size of snapshots being sent during node recovery
reducing size of database being mmaped/mlocked (so RAM consumption)
mostly in situations that DB had temporary spike of 'usage' but later it shrank / got compacted.

xiang90 · 2021-05-11T19:30:23Z

@ptabor

DB tends to just grow in size over time. It is also true for Kubernetes use cases where the cluster normally just keep growing.
When etcd does release some pages, it is probably because of compaction then it will grow back soon.

Because of these two, defrag only at the boot time might not be super effective on space reclaim or RAM space saving.

If we really want to reduce the snapshot sending size, we'd better compact it before sending (do not send empty pages as it is now?).

If we really want to save RAM size, we need to be smarter on mmap control (do not map large holes) and page allocations.

Most of the issues we see in production for large cluster are huge freelist (increase search time) and the random writes (multiple non-leaf branches needs to be updates as well comparing to seq writes).

xiang90 · 2021-05-11T19:33:41Z

I assume that size is the good proxy to number of pages being cleaned up (size/4096), so number of entries being deleted from the free-pages list.

Not entirely true when there are a lot of small holes vs big holes. The item in the freelist should be span instead of items? But I guess most of the time it can be a good indicator?

ptabor · 2021-05-12T07:26:38Z

xiang90

We are thinking about in-place defrag in the bolt layer that will take care of long-running servers fragmentation. That one would mark the 'last' pages as 'dirty' and let them be moved to the beginning of the file in background without the need to 'block' all ongoing RW & RO transactions as the current 'deep' defrag do.
The full defrag on boot would be additional level of protection.

lilic reviewed May 10, 2021

View reviewed changes

server/etcdmain/config.go Outdated Show resolved Hide resolved

serathius force-pushed the defrag branch from 0a07ad9 to 0049f82 Compare May 10, 2021 13:46

ptabor reviewed May 10, 2021

View reviewed changes

server/embed/config.go Outdated Show resolved Hide resolved

server/embed/etcd.go Outdated Show resolved Hide resolved

serathius force-pushed the defrag branch 2 times, most recently from c136e85 to 53ee739 Compare May 10, 2021 14:20

ptabor reviewed May 10, 2021

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

ptabor reviewed May 10, 2021

View reviewed changes

server/etcdserver/server.go Show resolved Hide resolved

wpedrak reviewed May 10, 2021

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

serathius force-pushed the defrag branch 2 times, most recently from bdaadbf to 4307bdd Compare May 10, 2021 16:39

ptabor approved these changes May 10, 2021

View reviewed changes

ptabor changed the title ~~etcdserver: Implement running defrag if freeable space will exceed provided threshold.~~ etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) May 10, 2021

gyuho approved these changes May 10, 2021

View reviewed changes

serathius force-pushed the defrag branch from 4307bdd to 3a825ef Compare May 11, 2021 08:45

mm4tt reviewed May 11, 2021

View reviewed changes

server/etcdmain/config.go Outdated Show resolved Hide resolved

hexfusion reviewed May 11, 2021

View reviewed changes

server/etcdmain/help.go Show resolved Hide resolved

hexfusion approved these changes May 11, 2021

View reviewed changes

serathius force-pushed the defrag branch 3 times, most recently from b2cffb4 to 535cbb5 Compare May 11, 2021 11:25

etcdserver: Implement running defrag if freeable space will exceed pr…

efc8505

…ivided threshold

serathius force-pushed the defrag branch from 535cbb5 to efc8505 Compare May 11, 2021 12:00

ptabor approved these changes May 11, 2021

View reviewed changes

xiang90 reviewed May 11, 2021

View reviewed changes

ptabor merged commit e0a8484 into etcd-io:master May 12, 2021

lavacat mentioned this pull request Mar 23, 2022

Track stabilization of --experimental-bootstrap-defrag-threshold-megabytes flag #13782

Open

shyamjvs mentioned this pull request Aug 31, 2022

Apiserver should map etcd errors to proper response status kubernetes/kubernetes#112152

Open

serathius deleted the defrag branch June 15, 2023 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) #12941

etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) #12941

serathius commented May 10, 2021 •

edited

Loading

ptabor commented May 10, 2021

gyuho commented May 10, 2021

serathius commented May 11, 2021

hexfusion left a comment

hexfusion commented May 11, 2021

ptabor commented May 11, 2021

hexfusion commented May 11, 2021

xiang90 May 11, 2021

ptabor May 11, 2021

xiang90 commented May 11, 2021

ptabor commented May 11, 2021

xiang90 commented May 11, 2021

xiang90 commented May 11, 2021

ptabor commented May 12, 2021

etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) #12941

etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) #12941

Conversation

serathius commented May 10, 2021 • edited Loading

ptabor commented May 10, 2021

gyuho commented May 10, 2021

serathius commented May 11, 2021

hexfusion left a comment

Choose a reason for hiding this comment

hexfusion commented May 11, 2021

ptabor commented May 11, 2021

hexfusion commented May 11, 2021

xiang90 May 11, 2021

Choose a reason for hiding this comment

ptabor May 11, 2021

Choose a reason for hiding this comment

xiang90 commented May 11, 2021

ptabor commented May 11, 2021

xiang90 commented May 11, 2021

xiang90 commented May 11, 2021

ptabor commented May 12, 2021

serathius commented May 10, 2021 •

edited

Loading