-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce concurrency bottlenecks in the table and block caches #172
Conversation
petermattis
commented
Jul 3, 2019
- Shard tableCache and cache.Cache
- Update tableCacheNode.refCount atomically.
0768b6d
to
c3c68fc
Compare
I've add a 3rd commit which implements the idea mentioned in https://github.com/petermattis/pebble/issues/11#issuecomment-508116863 for |
9ac3500
to
7137a7d
Compare
The sharding of the table and block caches take the same strategy: create num-cpu shards and select the shard based on the key. The shards are configured with size/num-shards capacity. The table and block cache mutexes were showing up highly on blocking profiles and were demonstratably limiting performance on concurrent scan workloads. On a 16 core machine, a scan only workload previously showed: N Before After 1 137,016 160,719 ( +17%) 10 433,436 968,601 (+123%) 100 416,035 909,192 (+118%) The first column (N) is the number of concurrent workers. Note that at 100 concurrent workers the performance was somewhat "jumpy" which is something that deserves further investigation.
Also update tableCacheShard.mu.releasing atomically, though the field is still present in tableCacheShard.mu because it is decremented while holding that lock. This removes the need to grab tableCacheShard.mu in the common case when closing an iterator.
Updating the tableCacheShard LRU lists requires mutually exclusive access to the list. Grabbing and releasing tableCacheShard.mu for each access shows up prominently on blocking profiles. There is a known technique for avoiding this overhead: record the access in a per-thread data structure and then batch apply the buffered accesses when the buffer is full. For a highly concurrent scan workload, this results in a significant improvement in both throughput and stability of the throughput numbers. Here are numbers when running with 100 concurrent scanners before this change: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 109476226 912303.4 After: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 151736009 1264384.6 That's a 39% speedup. Even better, the instantaneous numbers during the run were stable. See #11
The Clock-PRO algorithm doesn't require exclusive access to internal state on hits. Switch to using a RWMutex which reduces a source of contention in cached workloads. For a concurrent scan workload, performance before this commit was: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 151736009 1264384.6 After: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 170326751 1419354.7 That's a 12% improvement in throughput. See #11
502bcfb
to
5689eb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 3 of 3 files at r5.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ajkr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @petermattis)
table_cache.go, line 207 at r5 (raw file):
// exclusively. hits := c.hitsPool.Get().(*tableCacheHits) hits.flushLocked()
I like this way of batching updates to the LRU list until a miss occurs (or 64 hits). MyRocks really could've used something like this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @petermattis)
table_cache.go, line 207 at r5 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
I like this way of batching updates to the LRU list until a miss occurs (or 64 hits). MyRocks really could've used something like this too.
Oh this is table cache. They needed it for block cache as they use max_open_files==-1
. Seems you are avoiding the problem in block cache, too, though I'm not familiar with the clock-pro algorithm yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ajkr)
table_cache.go, line 207 at r5 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Oh this is table cache. They needed it for block cache as they use
max_open_files==-1
. Seems you are avoiding the problem in block cache, too, though I'm not familiar with the clock-pro algorithm yet.
The same idea should work for the block cache, it just isn't needed there. The idea comes from https://drive.google.com/file/d/0B8oWjCpZGTn3YVhCc2dZSU5DNFBJRlBZa0s1STdaUWR5emxj/view.