Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM due to underestimated memory use in metadata handler #4804

Closed
travisdowns opened this issue May 18, 2022 · 0 comments · Fixed by #5346
Closed

OOM due to underestimated memory use in metadata handler #4804

travisdowns opened this issue May 18, 2022 · 0 comments · Fixed by #5346
Assignees
Labels
area/controller DW kind/bug Something isn't working

Comments

@travisdowns
Copy link
Member

travisdowns commented May 18, 2022

Version & Environment

Redpanda version: dev f75ceed

What went wrong?

Redpanda node continually runs out of memory at start when joining a loaded cluster. An underlying cause is that the size of metadata requests (and probably other types) is underestimated by several orders of magnitude in our memory throttling logic. This results in many requests executing in parallel, exhausting the per-shard memory.

The crashes are correlated with (a) startup or (b) another node being killed because in those cases the metadata requests pile up as we are trying to either (a) get the first metadata refresh or (b) figuring out who the new controller is, and many requests pile up on the refresh semaphore and then are suddenly uncorked all at once, causing a temporary and perhaps fatal spike in memory use.

What should have happened instead?

Our kafka request logic should throttle the requests to an an appropriate concurrent level to avoid OOM.

How to reproduce the issue?

  1. Connect many consumers to a cluster.
  2. Kill a node and restart it.
  3. Observe (with added logging) the concurrency and memory usage for metadata requests.

Additional information

In reserve_request_units we calculate an estimated "memory size" of the request. This estimate is based on the over-the-wire size of the request: specifically we estimate we need request_size * 2 + 8000 bytes. In the case of metadata requests, this request is about 26 bytes on the wire, so we estimate that 8052 bytes are needed and "reserve" that many units from the memory semaphore.

However, the actual memory size to handle the request is not related to the on-the-wire size, but rather the size of the response. With many partitions, the response can be large: at least 100 bytes per partition just in "working space" alone, plus the size of the on-the-wire response which is similar. For 20k partitions that's probably in the range of 4 MB (we see memory allocation failures for single allocations in this range). So this is about 500x larger than the estimate and the memory semaphore provides no real protection (e.g., with 600M allocated to Kafka requests, we should let at most 150 requests run concurrently, but the existing throttle will let ~75,000 requests go in parallel).

It is not clear how best to fix this. A better estimate would solve the problem, but very early in the request handling flow it is not clear we can make a much better estimate: in this example we haven't built the topics table at all as we are starting up, so the requests are already in progress before we know there are many partitions. As a workaround, we can add a second semaphore inside the metadata handler: in this location we can make a better estimate of the size of the response (before we start building it) and limit the total concurrency to a response-size-aware value.

@travisdowns travisdowns added kind/bug Something isn't working area/controller labels May 18, 2022
@travisdowns travisdowns self-assigned this May 18, 2022
@travisdowns travisdowns added the DW label Jun 15, 2022
travisdowns added a commit that referenced this issue Jul 15, 2022
Per-handler memory estimation, more accurate estimate for metadata handler

Currently we estimate that metadata requests take 8000 + rsize * 2 bytes
of memory to process, where rsize is the size of the request. Since
metadata requests are very small, this end up being roughly 8000 bytes.

However, metadata requests which return information about every
partition and replica may easily be several MBs in size.

To fix this for metadata requests specifically, we use a new more
conservative estimate which uses the current topic and partition
configuration to give an upper bound on the size.

The remainder of this series sets up this change and also prepares
for a more comprehensive change where we will allow a "second
chance" allocation from the memory semaphore.

Fixes: #4804
Fixes: #5278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller DW kind/bug Something isn't working
Projects
None yet
1 participant