Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) #12941

Merged
merged 1 commit into from
May 12, 2021

Conversation

serathius
Copy link
Member

@serathius serathius commented May 10, 2021

Defragment is an expensive operation to execute when serving traffic as requires locking database for multiple seconds. Instead of defragmenting during operation of Etcd we can mitigate this cost by moving this process to server bootstrap. This also has benefit of reducing maintenance cost for users that didn't setup full automation to trigger defrag periodically, but still do some other operations like upgrades.

This PR adds new --experimental-bootstrap-defrag-threshold-megabytes flag to etcdserver that allows users to set a disk size in megabytes. During bootstrap if disk size that would be freed by defrag is greater then threshold set, etcdserver will automatically execute defrag before starting to serve traffic.

@ptabor

server/etcdmain/config.go Outdated Show resolved Hide resolved
server/embed/config.go Outdated Show resolved Hide resolved
server/embed/etcd.go Outdated Show resolved Hide resolved
@serathius serathius force-pushed the defrag branch 2 times, most recently from c136e85 to 53ee739 Compare May 10, 2021 14:20
server/etcdserver/server.go Outdated Show resolved Hide resolved
@serathius serathius force-pushed the defrag branch 2 times, most recently from bdaadbf to 4307bdd Compare May 10, 2021 16:39
@ptabor
Copy link
Contributor

ptabor commented May 10, 2021

Thank you. This will be very helpful for an infrequent defragmentation that does not impacts tile-latency of requests served by nodes in clusters. If member is updated or is restared (e.g in response to NO_SPACE alarm), it can be configured to perform some autohealing action, like 'defrag'.

This mitigation was discussed on community meeting on July 30, 2020 in response to
kubernetes/kubernetes#93280.


Marek: Please add a test (e.g. e2e) to at least guarantee that setting this flag does not crashes the server.

@ptabor ptabor changed the title etcdserver: Implement running defrag if freeable space will exceed provided threshold. etcdserver: Implement running defrag if freeable space will exceed provided threshold (on boot) May 10, 2021
@gyuho
Copy link
Contributor

gyuho commented May 10, 2021

This will be very helpful for an infrequent defragmentation that does not impacts tile-latency of requests served by nodes in clusters.

Yes, defragging live node is quite disruptive, even causing error too many requests.

Defrag on bootstrap seems safe, on the other hand.

@serathius
Copy link
Member Author

Added e2e test

server/etcdmain/config.go Outdated Show resolved Hide resolved
Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense, while this can fairly trivially be added as an init container it does have value as a server runtime.

@serathius serathius force-pushed the defrag branch 3 times, most recently from b2cffb4 to 535cbb5 Compare May 11, 2021 11:25
@hexfusion
Copy link
Contributor

Just a thought do we have any concerns with race condition and file locks? I am thinking about OS that does not support flock? Should we consider gating this on a supported arch(s)?

@ptabor
Copy link
Contributor

ptabor commented May 11, 2021

Just a thought do we have any concerns with race condition and file locks? I am thinking about OS that does not support flock? Should we consider gating this on a supported arch(s)?

What scenario do you envision ?

My reasoning: I assume we are running this before 'concurrent' code within etcd is initialized. We could even move it before ci.SetBackend(be) to make it even more guaranteed. So whetever concurrent process would access the file directly it would either disrupt 'backup' or the main's etcd operations.

@hexfusion
Copy link
Contributor

Just a thought do we have any concerns with race condition and file locks? I am thinking about OS that does not support flock? Should we consider gating this on a supported arch(s)?

What scenario do you envision ?

My reasoning: I assume we are running this before 'concurrent' code within etcd is initialized. We could even move it before ci.SetBackend(be) to make it even more guaranteed. So whetever concurrent process would access the file directly it would either disrupt 'backup' or the main's etcd operations.

Just high level I have not dug into this yet but if this code happens before listeners would anything block defrag from attempting to run against an already running etcd process?

  • etcd process running (graceful term)
  • new process starts
  • file lock race?

zap.String("experimental-bootstrap-defrag-threshold", humanize.Bytes(uint64(thresholdBytes))),
)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also should log here for the non-skipping case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to think so, but be.Defrag internally has pretty decent logging already.

@xiang90
Copy link
Contributor

xiang90 commented May 11, 2021

@serathius

The reclamation of disk space can be important. But I guess the most important role is to rearrange the on-disk pages and reduce the size of freelist. Thus it can also help with write throughput/latency (reduce random writes). So I am not sure if the size factor is the most important one to config. We could check how many small holes are there and let users config it. But I do not really want to make the flag configuration too complicated either.

@ptabor
Copy link
Contributor

ptabor commented May 11, 2021

@xiang90 I assume that size is the good proxy to number of pages being cleaned up (size/4096), so number of entries being deleted from the free-pages list. From my perspective the goals are:

  • reducing size of snapshots being sent during node recovery
  • reducing size of database being mmaped/mlocked (so RAM consumption)
    mostly in situations that DB had temporary spike of 'usage' but later it shrank / got compacted.

@xiang90
Copy link
Contributor

xiang90 commented May 11, 2021

@ptabor

  1. DB tends to just grow in size over time. It is also true for Kubernetes use cases where the cluster normally just keep growing.
  2. When etcd does release some pages, it is probably because of compaction then it will grow back soon.

Because of these two, defrag only at the boot time might not be super effective on space reclaim or RAM space saving.

If we really want to reduce the snapshot sending size, we'd better compact it before sending (do not send empty pages as it is now?).

If we really want to save RAM size, we need to be smarter on mmap control (do not map large holes) and page allocations.

Most of the issues we see in production for large cluster are huge freelist (increase search time) and the random writes (multiple non-leaf branches needs to be updates as well comparing to seq writes).

@xiang90
Copy link
Contributor

xiang90 commented May 11, 2021

I assume that size is the good proxy to number of pages being cleaned up (size/4096), so number of entries being deleted from the free-pages list.

Not entirely true when there are a lot of small holes vs big holes. The item in the freelist should be span instead of items? But I guess most of the time it can be a good indicator?

@ptabor
Copy link
Contributor

ptabor commented May 12, 2021

xiang90

  • We are thinking about in-place defrag in the bolt layer that will take care of long-running servers fragmentation. That one would mark the 'last' pages as 'dirty' and let them be moved to the beginning of the file in background without the need to 'block' all ongoing RW & RO transactions as the current 'deep' defrag do.
  • The full defrag on boot would be additional level of protection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants