Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-handler memory estimation, more accurate estimate for metadata handler #5346

Merged
merged 20 commits into from
Jul 15, 2022

Conversation

travisdowns
Copy link
Member

@travisdowns travisdowns commented Jul 5, 2022

Currently we estimate that metadata requests take 8000 + rsize * 2 bytes
of memory to process, where rsize is the size of the request. Since
metadata requests are very small, this end up being roughly 8000 bytes.

However, metadata requests which return information about every
partition and replica may easily be several MBs in size.

To fix this for metadata requests specifically, we use a new more
conservative estimate which uses the current topic and partition
configuration to give an upper bound on the size.

The remainder of this series sets up this change and also prepares
for a more comprehensive change where we will allow a "second
chance" allocation from the memory semaphore.

Fixes: #4804
Fixes: #5278

@mmedenjak mmedenjak added kind/bug Something isn't working area/controller DW and removed area/redpanda labels Jul 6, 2022
@travisdowns travisdowns marked this pull request as ready for review July 11, 2022 05:57
@travisdowns travisdowns changed the title Td 4804 oom fix Per-handler memory estimation, more convervative estimte for metadata handler Jul 11, 2022
Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change: clearly there are degrees of sophistication we can try to use for our guesses here, but having a big estimate for metadata requests is a good place to start.

The use of dynamic dispatch for handlers looks like a net win as well, although it does make me wish we had a microbenchmark to provide confidence that the code size reduction is beneficial enough to cancel out any overhead from using virtual functions -- I wonder if you tested that by hand?

// metadata response.
constexpr size_t bytes_per_partition = 200;

return max_clusterwide_partitions * bytes_per_partition;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a reference to topic_table into the request processing path so that we can use the real partition count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Probably" - but I explicitly wanted to avoid this. This early request processing should be as simple as possible, so use information we have statically available which is just the request size at this point.

At least in the scenario where we saw this problem, the topic metadata wasn't necessarily populated yet because the system was just starting up so in some cases you wouldn't have a reliable estimate, though I guess you could use the conservative estimate at that point, but in general this seems subject to data staleness (among others, the OOM was seen in other cases too e.g., where the controller was killed and so requests were piling up waiting for a new controller to be elected: in this case the old topic table would probably be good enough).

If this were the final change, I would perhaps be in favor of this: but it is only a stopgap towards what I think should be the final solution: a two phase allocation of units where we allocate a small number of units up front to account for initial processing, then the "full amount" at the point during processing that we know the final amount, e.g., once we've queried the topic table and before we start serializing the result. I'll raise this solution up for more discussion, however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the topic metadata wasn't necessarily populated yet because the system was just starting up

That sounds strange: we don't start listening on the Kafka port until the controller has finished replay. The exception to that is when we are adding a new node, or a node is coming back after being offline for some time, but even in those cases the window for clients to act on a stale partition count is small (and very fixable, all we need is a way for nodes in that state to learn the controller high water mark via internal rpc).

it is only a stopgap towards what I think should be the final solution: a two phase allocation of units where we allocate a small number of units up front to account for initial processing, then the "full amount" at the point during processing that we know the final amount

I was thinking of the two phase allocation the other way around: that we would initially allocate the worst case number of units, and then release some when we knew enough about the request to do so. If we do it the other way (small allocation then big allocation), we risk deadlock (or just reduced throughput) from too many concurrent requests taking their small phase 1 units, then there not being enough units available to allow any of them to proceed into phase 2.

So I guess the key question here is whether the end state is going to be the small->big procedure (in which case the end state does not require the total partition count, it calculates its big amount from the request's partition list), or the end state is going to be the big->small procedure (in which case the end state will still require the total partition count, so we should plumb it in now)

Copy link
Member Author

@travisdowns travisdowns Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds strange: we don't start listening on the Kafka port until the controller

The server was up and accepting kafka connections, but the metadata requests were blocked on refreshing metadata from the controller (which is what caused requests to line up in time). As I recall there had never been a successful metadata dissemination to this node at the time so as far as the node was concerned there were ~0 partitions, and by the time the dissemination completed the requests were already long past the initial part of the connection and were blocked on the metadata refresh semaphore.

Regardless of the specific details of that case, doing the estimation separately is always subject to races? You do the estimation and then some time later prepare the response and the number of partitions may have changed arbitrarily in the meantime.

I was thinking of the two phase allocation the other way around: that we would initially allocate the worst case number of units, and then release some when we knew enough about the request to do so.

Yes, I recall. The primary complaint (not from me) was the possible performance impact due to reduced concurrency. Though see below also.

If we do it the other way (small allocation then big allocation), we risk deadlock (or just reduced throughput) from too many concurrent requests taking their small phase 1 units, then there not being enough units available to allow any of them to proceed into phase 2.

Yes, I am aware. I don't know if this is the right place for a detailed review of this final state, but I do consider this case by including a (1) portion of the units which are guaranteed to be only for phase 2 and (2) prioritization of the units semaphore so in-progress requests always get first shots at the units. It was the realization that we need to do (2) that made me want to do this stopgap first before proceeding with the full fix.

So I guess the key question here is whether the end state is going to be the small->big procedure

My primary complaint about big -> small is that I don't think it is very general. It happens to work for a specific type of metadata request (the "empty" request which means "give me everything") which has a known max size and where the size is fairly easily to calculate based on cached metadata, but in general you might have to a lot more work to estimate the size: e.g., any metadata request which a topic list (though an overestimate would at least work) or other types of requests. At this point we don't even have access to the body of the request (though we could change that): we've read only 4 bytes (the size).

In the limit, you have to more or less do the request twice: once to estimate the size and then once again to really do it. The small -> big just kind of works "naturally" in that you don't have a separate estimation phase, you'll always have some point where you know the size of the request as you are processing it (at least, one can hope, though there is the question of whether you know it before you use the memory, which is usually possible but often not the way it is currently written).

If it were only this one type of metadata request, I think this would be fine but this is really a problem with a lot of request types, which I think favors small -> big if we can convince ourselves of deadlock freedom. Note that small -> big can really be something like small -> big -> small too, e.g., we can free units after we leave the high memory part of the processing such as in the other comment (though to avoid deadlock there should only be 1 increases after the initial one).

Copy link
Member Author

@travisdowns travisdowns Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I wonder if we can decouple the "full fix" discussion from the "stopgap" one? Whichever way we go I don't think this change really pre-disposes us either way. Most of the changes are just related to allowing per-handler estimates at all and neither method of plumbing the units around is implemented here (well really, big -> small doesn't need any plumbing of units, it just needs a richer estimator, like a second handle type call to do the estimation, while small -> big needs the units put into the request context so they can be adjusted).

Copy link
Contributor

@jcsp jcsp Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall there had never been a successful metadata dissemination to this node at the time so as far as the node was concerned there were ~0 partitions,

Ah, so this sounds like it could have been avoided by consulting topics_table (populated before kafka api comes up by controller replay) rather than any of the metadata dissemination state which is fuzzier.

Anyway. I guess this boils down to "is hardcoded 40k good enough for 22.2". I think what I'm missing is how hard it would actually be to wire through topics_table: if it's super onerous then I can accept hardcoding a magic number, but can you maybe try it and see how invasive the patch is? and/or for concerns about races, we could have a hardcoded lower bound that says "even if you didn't see partitions yet, pretend like there are at least 10k" or somesuch.

Maybe a compromise is to make it a config property rather than hardcoding it -- I was thinking of having one of these anyway, to encode the "here's how big we ever tested" number. That way folks doing scale testing beyond 40k can at least dial it up without recompiling redpanda.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this sounds like it could have been avoided by consulting topics_table (populated before kafka api comes up by controller replay) rather than any of the metadata dissemination state which is fuzzier.

I believe this was the first startup ever of the node so I suspect the controller state is at least initially empty, though maybe you said above that on first join we don't start accepting connections until controller state has been recovered/replayed from another node?

Anyway. I guess this boils down to "is hardcoded 40k good enough for 22.2". I think what I'm missing is how hard it would actually be to wire through topics_table: if it's super onerous then I can accept hardcoding a magic number, but can you maybe try it and see how invasive the patch is?

This should be fairly easy and yes I can do that.

we could have a hardcoded lower bound that says "even if you didn't see partitions yet, pretend like there are at least 10k" or somesuch.

Yes, though the race isn't only for the no-partitions case, it can be at any point: from jumping from 20k to 100k partitions or whatever. I think this a downside of big -> small in general: any estimate can always become out of date between the estimation point and when the request is processed, and even though this window might be small with constant requests we would expect it to be hit by some requests with high probability when the conditions change (though a fix for this could be to fail requests that are found to be unexpectedly large rather than forging ahead into a possible OOM).

Maybe a compromise is to make it a config property rather than hardcoding it -- I was thinking of having one of these anyway, to encode the "here's how big we ever tested" number. That way folks doing scale testing beyond 40k can at least dial it up without recompiling redpanda.

Also a good suggestion though I'll try your other suggestion first.

Copy link
Member Author

@travisdowns travisdowns Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this PR to use an estimate based on the actual (worst-case) topic & partition metadata, plus the number of brokers.

// We cannot make a precise estimate of the size of a metadata response by
// examining only the size of the request (nor even by examining the entire
// request) since the response depends on the number of partitions in the
// cluster. Instead, we return a conservative estimate based on the soft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it require much more plumbing to also implement the return of un-needed units once the request is read? It's not essential for robustness, but i'd like to have confidence that the code structure here is sufficient for adding that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few places that we can release units, but I think "mostly yes".

Specifically, the main place we can release units is when the response processing has been been completed (the handler is done) and we just need to finish writing out the request: at this point we know the size of the request is simply the encoded size of the response (plus perhaps some padding due to allocation buckets, though we hardly handle this anywhere currently) so we can reduce the units to that amount.

This can be done generically w/o help from the handler (and in fact is quite easy, I just omitted it here out of an abundance of caution given that we'd like to get this into 22.2).

The other place would be inside the request processing, e.g., at certain points where it it is known that large intermediate structures have been cleaned up, we can free units. This can be part of the dynamic two-phase allocation as described above and would be up to each handler. To do this, we need to pass the units into somewhere the handlers can access them, e.g., the request context, which is where I was heading with much of this plumbing. I don't actually plumb it into the request context yet as it is not used in this stopgap fix.

@jcsp jcsp changed the title Per-handler memory estimation, more convervative estimte for metadata handler Per-handler memory estimation, more convervative estimate for metadata handler Jul 11, 2022
@travisdowns travisdowns changed the title Per-handler memory estimation, more convervative estimate for metadata handler Per-handler memory estimation, more conservative estimate for metadata handler Jul 11, 2022
* to handlers, generally const objects with static storage duration
* obtained from handler_for_key.
*/
using any_handler = const any_handler_t*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i am wondering if we can just name this handler this way according to Liskov substitution principle each request_hander inheriting from any_handler is a hander. (this is a matter o taste i guess)

Copy link
Member Author

@travisdowns travisdowns Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this name, but the problem is we already have (even before these changes) a kafka::handler symbol which is the helper template for single-pass handlers, so we can create one like:

using fetch_handler = handler<fetch_api, 4, 11>;

Both uses of the symbol can't co-exist. Would you like to rename that helper to something else?

It is also possible I did not fully understand the suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe handler_base or hander_interface ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmaslankaprv - OK, I tried out your original suggestion of handler: to do this I renamed the existing template helper to single_stage_helper which is consistent with the other helper with the same purpose for 2-stage: two_stage_helper.

This freed up kafka::handler to be used for the pointer to the handler interface. I've pushed those changes here.

@mmaslankaprv
Copy link
Member

This mostly looks good, if we are concerned about the virtual dispatch overhead we may replace any_handler with concept and then generate the dispatch switch()...case with template parameter pack expansion

Copy link
Contributor

@graphcareful graphcareful left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions and small comments, but super nice job

std::optional<any_handler> handler_for_key(kafka::api_key key) {
static constexpr auto lut = make_lut(request_types{});
if (key >= (short)0 && key < (short)lut.size()) {
if (auto handler = lut.at(key)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can prefer operator[] here instead of calling .at because you've already performed the bounds checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It true - I did it this way to quiet a linter warning against every using [] with non-constant bounds which wasn't smart enough to notice the earlier bounds check. I'll switch to [] linter be-damned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// NO LINT hehe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also you can mark this as noexcept too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added.

src/v/kafka/server/handlers/handler.h Outdated Show resolved Hide resolved
size_estimate += sizeof(kafka::metadata_response_topic);
size_estimate += tp_ns.tp().size();

using partition = kafka::metadata_response_partition;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice! Just a thought here. Metadata response v9 supports returning the list of brokers too. In large installations with thousands of brokers we might want to reserve extra memory for that as well. Opinions? In either case we only support this API at v7 anyway, so this would for sure be future work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think even earlier versions supported that? I know returning the brokers is a key function of this API.

I was including this in the 10,000 k slush fund, assuming something like 100 bytes per broker * 100 brokers max, but sure I can add brokers into this calculation too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it looks like brokers are unconditionally returned in all versions of metadata response)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, not sure why i said v9 i guess i was looking at something wrong. Heres the source:

    { "name": "Brokers", "type": "[]MetadataResponseBroker", "versions": "0+",
      "about": "Each broker in the response.", "fields": [

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In any case, I added an estimate of 200 bytes per broker, plus the size of the broker struct itself. The 200 bytes is for the hostname and rack, if any.

@travisdowns
Copy link
Member Author

This mostly looks good, if we are concerned about the virtual dispatch overhead we may replace any_handler with concept and then generate the dispatch switch()...case with template parameter pack expansion

@mmaslankaprv I forgot to add why I introduced the generic any_handler at all: because existing way of using template methods generated a lot of code bloat: requests.cc dropped from 12 MB to 5 MB in size after this changed: just the one vlog call which logged the request was responsible for several MB of bloat since the same call gets instantiated for all 30+ handlers though it does the same thing.

Since I was adding even more handler-specific logic here, I didn't want to propagate this pattern and I think "bad" virtual-method based polymorphism is natural here: after all, we before already had a big switch to dispatch to the right template-based process function (since fundamentally the handler needs to be chosen dynamically) and we'd need it again in any place we want to do handler specific logic. Now we can do the handler lookup once and pass the handler around and we just have 1 virtual function call on each invocation which as a similar cost to the switch on a call-vs-call basis (though the template approach can have fewer total calls since once you are inside the template you don't pay any more cost). The request handling path looks like it is already many 1000s of instructions at least, so I think the cost of a virtual call or two here is very small.

So in fact generating specialized per-handler code is exactly what I was trying to avoid here :).

@graphcareful
Copy link
Contributor

This mostly looks good, if we are concerned about the virtual dispatch overhead we may replace any_handler with concept and then generate the dispatch switch()...case with template parameter pack expansion

@mmaslankaprv I forgot to add why I introduced the generic any_handler at all: because existing way of using template methods generated a lot of code bloat: requests.cc dropped from 12 MB to 5 MB in size after this changed: just the one vlog call which logged the request was responsible for several MB of bloat since the same call gets instantiated for all 30+ handlers though it does the same thing.

Since I was adding even more handler-specific logic here, I didn't want to propagate this pattern and I think "bad" virtual-method based polymorphism is natural here: after all, we before already had a big switch to dispatch to the right template-based process function (since fundamentally the handler needs to be chosen dynamically) and we'd need it again in any place we want to do handler specific logic. Now we can do the handler lookup once and pass the handler around and we just have 1 virtual function call on each invocation which as a similar cost to the switch on a call-vs-call basis (though the template approach can have fewer total calls since once you are inside the template you don't pay any more cost). The request handling path looks like it is already many 1000s of instructions at least, so I think the cost of a virtual call or two here is very small.

So in fact generating specialized per-handler code is exactly what I was trying to avoid here :).

I actually looked into Michal's comment here, realizing you did move a big conditional and still wanted to look into the drawbacks of virtual method dispatch. As a preface I do think the benefits outweigh the possible drawbacks, so i'm in favor of what you have here.

I was reading that since the compiler cannot know what address in the vtable is going to be called, it cannot make optimizations on the proceeding conditional it must make. Do you know more about this? Just curious.

@travisdowns
Copy link
Member Author

I was reading that since the compiler cannot know what address in the vtable is going to be called, it cannot make optimizations on the proceeding conditional it must make. Do you know more about this? Just curious.

Thanks for your comment @graphcareful .

Right, the primary optimization it can't make is that it can't know the target of the call, so it can't inline the call or do any other IPA it might otherwise do. The key is ask is if this reflects "fundamental" dynamism or not. That is, could the compiler conceivably know the target? At a high level here the answer is "no" - the API key on which we base our behavior comes over the wire, so the compiler is never going to know the path the request processing will take statically.

In the code prior to this change, this was reflected in the big switch where we move from "dynamic" type-erased handling of the request to template-based distinct per-handler instantiation. This switch is replaced by the the handler_for_key method and is probably sightly worse than this method (in particular because it has to generate a lot of code for all the copies of the do_process call. The compiler can't optimize away this switch either.

So just replacing this switch with a virtual function is more or less a wash for this 1 call to handle. One big difference is though that when you want to make N calls to the handler for things that vary per handler (in this case N is 2: the existing handle call and the new memory_estimate call). In the virtual case it is two virtual calls now, plus the handler lookup. In the switch case it could be two switch lookups + dispatch or you could move the template<typename Request> stuff up one level higher to the (grand, etc)parent function which calls all N functions so now you only have one switch at the top level and the compiler doesn't need to do any dynamic dispatch for the N handler calls since the type is known. So it is an advantage in the micro-sense if you don't consider the other impacts.

To do do this here even for N=2 you'd have to make a ton of code be under the template and the code size would just explode. Many MB of code probably to duplicate the request flow for every handler type. To evaluate whether it's worth it, I think you'd want to know some order of magnitude costs: a virtual call is just an indirect function call, which is not much more expensive than a function call and is on the order of 1 ns. Import requests are probably 10 us at least (?), so 4 orders of magnitude more than a virtual call. Even a single icache miss due to code bloat probably costs 100x more. The key is that the bulk of the request is still compiled with full visibility inside the handle method specific to the request type: there aren't fine-grained virtual calls inside the handling of a single request, just 2 at the "front end", before we start the handler. The meat is unaffected.

Sorry these are kind of handy-wavy napkin estimates: but I'd like to soon be in a place where we can actually measure this stuff, but until then I think you can say with high confidence that this won't make any measurable difference, and if anything I'd put my money on "slightly faster".

It also has compile-time benefits: the requests/connection_context code doesn't need to know about each handler type any more, just the generic any_handler, so it can stop including the header for every handler, etc (this made possible by your request_types type_list, very handy.

@graphcareful
Copy link
Contributor

It also has compile-time benefits: the requests/connection_context code doesn't need to know about each handler type any more, just the generic any_handler, so it can stop including the header for every handler, etc (this made possible by your request_types type_list, very handy.

Awesome comment, thanks a bunch for the insights

static constexpr api_version min_supported = api_version(0);
static constexpr api_version max_supported = api_version(7);
static process_result_stages handle(request_context, ss::smp_service_group);
static constexpr auto despam_interval = std::chrono::minutes(5);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

despam_interval has moved inside the implementation, there was no reason, I think, to have it here inside the public interface of the handler.

@@ -21,6 +21,7 @@

namespace kafka {

template<>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two-stage handlers now use the helper template so these methods are template specializations now, just as they are in the single-stage case.

@travisdowns travisdowns changed the title Per-handler memory estimation, more conservative estimate for metadata handler Per-handler memory estimation, more accurate estimate for metadata handler Jul 14, 2022
The behavior of process_next_response is worth clarifying as the
returned future does not nececessarily wait for all enqueued respones
to be finished before resolving.

We also rename the method to better reflect its purpose.
Currently we release resources after the response is enqueued
in connection_context and response processing is called, but it
may not have been sent at this point as we require in-order responses
but second-stage processing may happen out of order.

In this change, we instead tunnel the resource object through to
the place where the response is written, and release it there.

FIxes redpanda-data#5278.
This lets us share it with the request processing code which
would also like to do type list based manipulation of the
request types.
We already had concepts for one-phase and two-phase handlers,
and this concept is simply the union of those two handler
concepts, i.e., "any" type of handler.
We wish to reclaim the name handler for the generic handler interface
introduced in the next change.
This is a polymorphic handler class (abstract class with virtual
methods), as well as concrete instantiations of the interface for each
of the existing handlers, which can be looked up by API key.

This lets us treat handlers generically without resorting to template
functions which must be specialized for each handler type.

This reduces code bloat significantly as we do not duplicate code paths
for our ~45 handler types.

For example, requests.cc.o drops from ~11 MB to ~5 MB after it is
switched to the any_handler approach.
The handler_interface already gets good functional coverage as it is
added to the core request path in requests.cc, but we also include
this unit test with basic coverage.
Preceding changes in this series introduced a runtime polymorphic
handler class, and this change switches most of the request handling
to use it.

In particular, we replace the large switch on API key which dispatches
to a template method to a lookup of the handler method and virtual
dispatch.

Some handlers that need special processing like the authentication/SASL
related ones still use the old approach for now.
Currently we use  single memory estimate for all kafka request types,
but different API calls may use wildly different amounts of memory.

This change allows each handler to perform an API-specific calculation
instead.
In connection_context, we now use the handler-specific initial memory
use estimate, rather than a single estimate for every handler type.
The session_resources type was a private member of
connection_context, but as we want to use it more
broadly, move it out as a standalone public class.

Additionally, pass it by shared_pointer in preparation
for later changes will feed it into requests.
Single-stage handlers have a handler template which means that handler
objects can be declared in a single line specifying their api object,
min and max API versions.

This change extends this nice concept to two-stage handlers as well.
Passing the connection context to the estimator allows the
estimator to use the various subsystems to estimate the memory
use of a given request.
Currently we estimate that metadata requests take 8000 + rsize * 2 bytes
of memory to process, where rsize is the size of the request. Since
metadata requests are very small, this end up being roughly 8000 bytes.

However, metadata requests which return information about every
partition and replica may easily be several MBs in size.

To fix this for metadata requests specifically, we use a new more
conservative estimate which uses the current topic and partition
configuration to give an upper bound on the size.
Prior to this change, we used only the topic and partition data to estimate
the size of the metadata response. Now, we also include the approximate
size of the broker metadata portion of the response, which may
be important if the list of brokers is very large or they have very long
hostnames.
We already check the bounds so the cpp core guideline presumably
does not apply and we don't want to pay the price for the additional
bounds check inside at().
graphcareful
graphcareful previously approved these changes Jul 14, 2022
Copy link
Contributor

@graphcareful graphcareful left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM sweet

@travisdowns
Copy link
Member Author

@travisdowns travisdowns merged commit 0b8e0c1 into redpanda-data:dev Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request resources may be released at the wrong time OOM due to underestimated memory use in metadata handler
5 participants