Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU when consuming from 1 partition out of many in topic #4410

Closed
travisdowns opened this issue Apr 25, 2022 · 4 comments · Fixed by #4443
Closed

High CPU when consuming from 1 partition out of many in topic #4410

travisdowns opened this issue Apr 25, 2022 · 4 comments · Fixed by #4443
Assignees
Labels
area/redpanda kind/bug Something isn't working

Comments

@travisdowns
Copy link
Member

Version & Environment

Redpanda version: dev fb0529d

What went wrong?

When consuming from 1 (or a few) partitions in a topic with 20k partition, high CPU use is observed in simple_fetch_planner::create_plan.

What should have happened instead?

Less CPU use, or at least CPU use should not have a strong correlation to the number of partitions in the topic (it is reasonable to expect that will vary depending on the number of partitions actually mentioned in the fetch, however).

How to reproduce the issue?

  1. Create a 20k partition topic.
  2. Use a test app to consume slowly from this empty topic.
  3. Profile redpanda with perf record.

Additional information

The CPU time is largely being spent in this check that the partition/topic exists at all. It is slow because contains() must iterate over the partitions in the top (20k, here) in order to check if the specified partition exists. For a partition at the end of the list means examining the entire list (not that this also means unbalanced behavior: nodes handling partitions near the start of the list will suffer significantly less CPU use here than those handling partitions higher in the list).

As a quick fix, we could move this check down inside the shard_for check immediately above, since if the partition does not exist it also will not appear in the shard mapping (modulo races, which this code already doesn't avoid).

@travisdowns travisdowns added the kind/bug Something isn't working label Apr 25, 2022
@travisdowns travisdowns self-assigned this Apr 25, 2022
@emaxerrno
Copy link
Contributor

emaxerrno commented Apr 25, 2022

I like your idea, but random thoughts:

@travisdowns i bet this is common place

The CPU time is largely being spent in this check that the partition/topic exists at all. It is slow because contains()

I wonder if we have a core-local cache of ntp -> uint64 (xxhash64) which checks on a (type of compressed/roaring) bitmap or smth so querying even larger topics is still fast.

maybe the right model instead of elevating every query to the ntp to say for this ns && this topic, evaluate all of these partitions

.... i think your fix makes sense btw, but i also think we may have this as a structural problem in multiple places. cc: @mmaslankaprv

travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 25, 2022
metadata_cache.contains() is a slow call with many partitions becase
it scans the entire parition list looking for the specified one.

On the fetch planning path we can avoid this cost simply by
moving the check inside a shard_for check we do immediately
after the metadata check: in a stable system the shard_for check will
fail any time the metadata call will fail.

Issue redpanda-data#4410
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 25, 2022
metadata_cache.contains() is a slow call with many partitions becase
it scans the entire parition list looking for the specified one.

On the fetch planning path we can avoid this cost simply by
moving the check inside a shard_for check we do immediately
after the metadata check: in a stable system the shard_for check will
fail any time the metadata call will fail.

Issue redpanda-data#4410
travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 25, 2022
metadata_cache.contains() is a slow call with many partitions becase
it scans the entire parition list looking for the specified one.

On the fetch planning path we can avoid this cost simply by
moving the check inside a shard_for check we do immediately
after the metadata check: in a stable system the shard_for check will
fail any time the metadata call will fail.

Issue redpanda-data#4410
@travisdowns
Copy link
Member Author

travisdowns commented Apr 26, 2022

@senior7515 - you read our minds, haha. I discussed this earlier with @mmaslankaprv and we agree it is a structural problem (also applying to produce and other paths) and a better fix is just to have an index on these lists for so when we need to query them we don't need to do an exhaustive iteration. Initially I think it can just be something very simple like a hash map, and we can optimize from there to reach higher partition counts.

The suggestion I had at the end was just a very quick patch to unblock certain use cases where 20k partitions are being used: if we go with that PR it won't close this issue.

BTW, the time spent in contains is > 20x higher than anywhere else:

image

travisdowns added a commit to travisdowns/redpanda that referenced this issue Apr 26, 2022
metadata_cache.contains() is a slow call with many partitions becase
it scans the entire parition list looking for the specified one.

On the fetch planning path we can avoid this cost simply by
moving the check inside a shard_for check we do immediately
after the metadata check: in a stable system the shard_for check will
fail any time the metadata call will fail.

Issue redpanda-data#4410
@emaxerrno
Copy link
Contributor

@senior7515 - you read our minds, haha. I discussed this earlier with @mmaslankaprv and we agree it is a structural

nice :)

problem (also applying to produce and other paths) and a better fix is just to have an index on these lists for so when we need to query them we don't need to do an exhaustive iteration. Initially I think it can just be something very simple like a hash map, and we can optimize from there to reach higher partition counts.

yeah, figured some simple index makes sense like a

node_hash<uint64_t, std::vector<uint32_t> > sorta thing. for a query of partitions of this namespace::topic that belong to this core`

or what structure were you all thinking.

The suggestion I had at the end was just a very quick patch to unblock certain use cases where 20k partitions are being used: if we go with that PR it won't close this issue.

BTW, the time spent in contains is > 20x higher than anywhere else:

🤯 - awesome find. interesting to see the next bottleneck on this.

@travisdowns
Copy link
Member Author

travisdowns commented Apr 26, 2022

@senior7515 well I had my very weak "hide it behind another check" fix ready to go but @mmaslankaprv has a draft fix for it which is much better and fixes this same problem all over the place along the lines of what you suggested:

#4443

interesting to see the next bottleneck on this.

Interestingly, we find that librdkafka actually has some scalability bottlenecks with many partitions in a similar vein: even when a consumer consumes from only a single partition, there are some operations which iterate over all partitions in the topic which shows up as a big hot spot with 20k partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants