Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Pressure / Load Shedding #8897

Open
vmg opened this issue Sep 28, 2021 · 1 comment
Open

Memory Pressure / Load Shedding #8897

vmg opened this issue Sep 28, 2021 · 1 comment

Comments

@vmg
Copy link
Collaborator

vmg commented Sep 28, 2021

Here's a tracking issue for a feature that @rbranson suggested and which I'm actively researching.

The problem statement is as follows: A high amount of large requests per second can cause the heap size of vtgate to grow significantly, large enough to force the kernel OOM killer to terminate us. Ideally, we'd like to keep track of our memory usage as to throttle incoming requests and start shedding load so the total size of our heap is kept under a pre-configured limit.

I've been doing some research and some testing and, as I anticipated, this is a complex feature to implement. The problem with trying to implement this feature in practice is that there's a tradeoff between the frequency for the memory measurements and their practical utility. To put it another way: the bursts of incoming requests can be so large (and hence the memory growth so sudden), that most methods to acquire memory usage measurements become too CPU-intensive when ran with a high enough frequency to detect the memory growth before our process is OOM-killed.

Survey of memory usage measurement methods

  • runtime.ReadMemStats (https://pkg.go.dev/runtime#ReadMemStats): this gives us very accurate measurements for the size of our heap, but it is prohibitive to run in practice because every call to ReadMemStats stops the world inside the runtime. To acquire reasonable measurements we'd need to run this with a 10ms frequency, which would basically block our runtime 100 times per second. That's untenable. Even if we could fix the runtime to not stop the world when calling this function, the actual collection of the heap stats is not free either because it's the aggregation of the stats for all the different size classes in our allocator. Otherwise, this is a very expensive call that does way more than we need it to do.

  • Reading from proc: there are two main approaches here, either /proc/$PID/memtstats, or the CGroup memory limits. The CPU usage of constantly polling the procfs is bearable, but the delay can be significant because the procfs is not updated in realtime, and we're polling it on a loop, so the two delays add up.

  • Using the Linux cgroup notification API: cgroup.event_control lets us use an eventfd that will automatically trigger on memory.pressure_level for our CGroup. This is very compelling because the memory pressure notification would trigger with very little delay, and it doesn't involve actively polling the procfs, so CPU usage is minimal. There are two issues here: first, this is Linux-only -- which as far as I'm concerned is a perfectly OK trade-off for production systems. The year of Vitess on Microsoft Windows NT Server 2020 is not yet upon us. Secondly, and most importantly, this is not a trivial thing to set up inside a Kubernetes pod, an use case which we're very interested in handling. By default, the processes running inside a pod have read access to the cgroup procfs, but in order to enable the notifications API, we'd need to write to cgroup.event_control in the procfs. Can we give write access to only the cgroup.event_control path with a K8s rule? It's not clear to me, because I'm kube-allergic and my eyes get swollen from looking at YAML.

  • Using the runtime.SetMaxHeap API: but wait, there's no such API? Turns out that memory backpressure is a hard problem to solve and there are other people working on it. This CL (https://go-review.googlesource.com/c/go/+/227767/) from Google implements an experimental SetMaxHeap API which basically does everything that we need to do: it allows us to configure a maximum size of our applications' heap, it actively tunes the garbage collector to increase collection frequency as our heap's size reaches the maximum, and it emits an event through a channel so that our application code can respond to memory pressure by shedding load / slowing down incoming requests. This is a patch to the Go runtime that hasn't been merged yet, although it is currently being used internally in several services at Google.

Proposed approach

With these options in mind, I think we can make some key decisions here:

  • Actively polling for memory usage is simply unfeasible. It uses too much CPU and it's not accurate enough. Any approach that involves polling must be discarded.

  • Receiving memory pressure notifications from the cgroup API is compelling, although it presents a blocker in Kubernetes environments because of the required Write access.

  • Simply reacting to the memory pressure notifications from cgroup and shedding load is not enough: in order for the process to survive the near-kill, we need to inrease GC frequency accordingly. This is fortunately doable: runtime.SetGCPercent can be tweaked in real time -- this was not possible in older versions of Go, because changing the GC percent would immediately trigger a GC run, something that is prohibitive when fine-tuning it based on notifications, but this issue has been fixed (runtime/debug: don't STW GC on SetGCPercent golang/go#19076, golang/go@227fff2) in recent Go releases.

  • Listening for the cgroup events and tuning GC frequency can be accomplished fully in application space with no changes to the Go runtime. However, runtime.SetMaxHeap also does this much more efficiently and accurately. Since the Go maintainers are looking for production feedback on this API, I believe we should implement it in Vitess and perform gradual testing where possible.

    • The ideal approach would be making the memory notifications subsystem pluggable, so we can switch between the cgroup/external GC tuning approach and SetMaxHeap and evaluate the accuracy and performance differences between both.

    • In order to implement the SetMaxHeap API in Vitess, we'd need to build Vitess with a special version of the Go compiler whose runtime supports the API. That's not great! Obviously, it means that the SetMaxHeap backend must be optional and hidden behind a feature flag.

    • Since testing Vitess with the new API will require a custom build, it'll be reserved for internal testing by bleeding edge users who have been experiencing these memory pressure issues in production (starting with us at PlanetScale) -- but that's perfectly OK. If the new API provides good results, we'll report back to the Golang authors and hopefully help with the push to mainline it into the next Go release. If it doesn't provide good results, we'll just remove it altogether and fall back to the app-space approach.

    • In order to ensure we're on the bleeding edge, and since Vitess now requires Golang 1.17+ to compile, I've had to rebase the SetMaxHeap patchset on top of 1.17: https://github.com/vmg/go/pull/new/maxheap-1.17 (the backport was non-trivial but it was a good crash course on Golang GC internals).

That's it for now. I'll report back later this week comparing the two potential implementations.

cc @deepthi @rbranson


References / Additional reading

@deepthi
Copy link
Member

deepthi commented Oct 8, 2021

Could we possibly do something like what https://github.com/grosser/preoomkiller is doing to monitor memory usage at the cgroup level? It IS clearly a polling solution, but it's possibly lighter-weight than polling the runtime or proc, and it requires only read access in k8s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants