Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library does not scale with multiple cores #531

Open
bIgBV opened this issue Apr 21, 2021 · 11 comments
Open

Library does not scale with multiple cores #531

bIgBV opened this issue Apr 21, 2021 · 11 comments
Assignees

Comments

@bIgBV
Copy link

bIgBV commented Apr 21, 2021

As demonstrated by the benchmarks in this reddit post, you can see the rust_tonic_mt benchmark falling behind in performance as the number of threads are increased.

The likely cause for this could be that a big portion of the shared state is behind this Mutex.

@bIgBV
Copy link
Author

bIgBV commented Apr 21, 2021

A couple of things to try:

  • Replace the std::sync::mutex with one from parking_lot
  • Update hyper to keep all streams from a connection on a single thread.

@seanmonstar
Copy link
Member

A bigger effort would be to replace the single massive lock to per-stream locks. There might still need to be a large lock, but the goal would be to not need to keep it locked for long or for most operations.

It's not exactly the same here, but grpc-go did a similar change a couple years ago to reduce contention: grpc/grpc-go#1962

@bIgBV
Copy link
Author

bIgBV commented Apr 30, 2021

Some initial measurements from replacing the lock in `streams.rs with one from parking lot:

https://gist.github.com/bIgBV/4d6d76773a948734ebef1367ef5221d5

@w41ter
Copy link

w41ter commented Sep 7, 2022

@bIgBV It seems that the comparison results of parking_lot and the original implementation are similar?

@notgull
Copy link
Contributor

notgull commented Sep 7, 2022

The libstd Mutex was recently replaced with a new implementation that is both much smaller and significantly faster. There is much less to lose now with per-stream locking.

@jeffutter
Copy link

Resurrecting this old issue, but I think I'm hitting this bottleneck fairly acutely. I'm experimenting with using tonic to build something like a load-balancing proxy between grpc streams. I have X clients connecting over Y connections each with Z streams. I then load balance the requests (mostly 1-1 request-response type requests) across I connections each with J streams to K downstream servers.

I was seeing fairly disappointing performance. If I have the external clients hit the backends directly I'm requsets are taking ~200μs at a certain load level. With the proxy in play it's closer to 1ms. I started digging into this bottleneck and found this github issue.

To isolate the problem further, I removed the server component and built a little client implementation (named pummel) that hammers the backend with requests across I connections each with J streams. With any appreciable amount of concurrency, the performance shows similar characteristics to the proxy when compared to our external clients (they happen to be written in elixir).

In profiling pummel I see this lock using a significant amount of CPU time:

Untitled1

If I'm reading this correctly, over 11% of the CPU time is dedicated to this mutex.

Currently, this is all running in a single Tokio runtime. I can configure the number of grpc connections and streams used, so I may play with ideas like starting a separate Tokio runtime per core or having more connections with fewer streams in hopes of reducing contention on this lock.

I don't really have any suggestions on how to improve this at the moment. Just wanted to share my findings. I'm glad to do any further testing if anyone has any ideas on how to improve this.

@seanmonstar
Copy link
Member

@jeffutter thanks for the excellent write-up! A way forward would be to do what I suggested, make per-stream locks so we only need to lock the stream store in-frequently: when adding or removing a stream.

@jeffutter
Copy link

jeffutter commented Nov 6, 2023

@seanmonstar Yeah. I think that would help my specific use case greatly, since I create all of the streams up-front and re-use them for many requests. So the global locks wouldn't occur mid-work. I might try to take a stab at making that change in my free time. Although, it'll probably take me a while to get up-to speed on h2 internals.
In the meantime if anyone gives that a try or has any other ideas, I'd be glad to test them out.

@jeffutter
Copy link

@seanmonstar I’ve been reading through the h2 source code, that grpc-go issue and the HTTP/2 spec. I’d like to take a stab at this. I’ll admit I’m new to h2 and HTTP/2 in any capacity more than a user so it’ll probably take me a bit to ramp up.

My understanding is that ultimately only one Frame can be written to the underlying IO at one time. So there needs to be a single buffer of Frames to send or I suppose a set of buffers and some mechanism to choose which one to take a frame from next. Currently all of the Frames get put in the SendBuffer on the Streams. It looks like each stream has it’s own pending_send Dequeue for it’s own frames. So, Architecturally, do you see those components remaining the same and the idea here being breaking up some of the state in the Store and maybe some of the Actions so that they can be tracked on the stream itself?

Let me know if that’s making any sense 🙃 or if you have any other suggestions as to how you’d go about implementing this.

Also, if you have any general resources for understanding HTTP/2 streams and flow control beyond the spec I’d love to read up more there too.

Thanks again for any help here. Hopefully with a bit of guidance I can help find a solution.

Noah-Kennedy pushed a commit that referenced this issue Apr 9, 2024
This PR adds a simple benchmark to measure perf improvement changes. E.g., a potential fix for this issue: #531

The benchmark is simple: have a client send `100_000` requests to a server and wait for a response. 

Output:
```
cargo bench
H2 running in current-thread runtime at 127.0.0.1:5928:
Overall: 353ms.
Fastest: 91ms
Slowest: 315ms
Avg    : 249ms
H2 running in multi-thread runtime at 127.0.0.1:5929:
Overall: 533ms.
Fastest: 88ms
Slowest: 511ms
Avg    : 456ms
```
@Noah-Kennedy Noah-Kennedy self-assigned this Jun 30, 2024
@Noah-Kennedy
Copy link
Contributor

@seanmonstar FYI i'm working on this now

@howardjohn
Copy link
Contributor

I think the problem is probably pretty well understood from the information above, but if it helps, I collected some traces that I thought showcase the problem well.

Context: we have a bunch of incoming connections which we forward over a shared h2 connection (one h2 stream per downstream connection). Each row is a thread, and shows what is currently executing.

Here you can see while Connection::poll is running, we are blocked from writing on the streams trying to acquire a lock to call reserve_capacity
2024-07-25_13-04-32

Similar picture shows all work in the system is blocked; 1 thread is writing out on the stream, and the rest are all blocking on the mutex freeing up:
2024-07-25_13-09-39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants