Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce concurrency bottlenecks in the table and block caches #172

Merged
merged 5 commits into from
Jul 9, 2019

Conversation

petermattis
Copy link
Collaborator

  • Shard tableCache and cache.Cache
  • Update tableCacheNode.refCount atomically.

@petermattis
Copy link
Collaborator Author

This change is Reviewable

@petermattis
Copy link
Collaborator Author

I've add a 3rd commit which implements the idea mentioned in https://github.com/petermattis/pebble/issues/11#issuecomment-508116863 for tableCache. The results were even better than my hacky experiment suggested.

@petermattis petermattis force-pushed the pmattis/shard-caches branch 2 times, most recently from 9ac3500 to 7137a7d Compare July 3, 2019 22:13
The sharding of the table and block caches take the same strategy:
create num-cpu shards and select the shard based on the key. The shards
are configured with size/num-shards capacity.

The table and block cache mutexes were showing up highly on blocking
profiles and were demonstratably limiting performance on concurrent scan
workloads. On a 16 core machine, a scan only workload previously showed:

      N     Before      After
      1    137,016    160,719 ( +17%)
     10    433,436    968,601 (+123%)
    100    416,035    909,192 (+118%)

The first column (N) is the number of concurrent workers. Note that at
100 concurrent workers the performance was somewhat "jumpy" which is
something that deserves further investigation.
Also update tableCacheShard.mu.releasing atomically, though the field is
still present in tableCacheShard.mu because it is decremented while
holding that lock. This removes the need to grab tableCacheShard.mu in
the common case when closing an iterator.
Updating the tableCacheShard LRU lists requires mutually exclusive
access to the list. Grabbing and releasing tableCacheShard.mu for each
access shows up prominently on blocking profiles. There is a known
technique for avoiding this overhead: record the access in a per-thread
data structure and then batch apply the buffered accesses when the
buffer is full. For a highly concurrent scan workload, this results in a
significant improvement in both throughput and stability of the
throughput numbers. Here are numbers when running with 100 concurrent
scanners before this change:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      109476226       912303.4

After:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      151736009      1264384.6

That's a 39% speedup. Even better, the instantaneous numbers during the
run were stable.

See #11
The Clock-PRO algorithm doesn't require exclusive access to internal
state on hits. Switch to using a RWMutex which reduces a source of
contention in cached workloads.

For a concurrent scan workload, performance before this commit was:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      151736009      1264384.6

After:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      170326751      1419354.7

That's a 12% improvement in throughput.

See #11
Copy link
Contributor

@Ryanfsdf Ryanfsdf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 3 of 3 files at r5.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ajkr)

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @petermattis)


table_cache.go, line 207 at r5 (raw file):

		// exclusively.
		hits := c.hitsPool.Get().(*tableCacheHits)
		hits.flushLocked()

I like this way of batching updates to the LRU list until a miss occurs (or 64 hits). MyRocks really could've used something like this too.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @petermattis)


table_cache.go, line 207 at r5 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

I like this way of batching updates to the LRU list until a miss occurs (or 64 hits). MyRocks really could've used something like this too.

Oh this is table cache. They needed it for block cache as they use max_open_files==-1. Seems you are avoiding the problem in block cache, too, though I'm not familiar with the clock-pro algorithm yet.

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ajkr)


table_cache.go, line 207 at r5 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Oh this is table cache. They needed it for block cache as they use max_open_files==-1. Seems you are avoiding the problem in block cache, too, though I'm not familiar with the clock-pro algorithm yet.

The same idea should work for the block cache, it just isn't needed there. The idea comes from https://drive.google.com/file/d/0B8oWjCpZGTn3YVhCc2dZSU5DNFBJRlBZa0s1STdaUWR5emxj/view.

@petermattis petermattis merged commit a3902b0 into master Jul 9, 2019
@petermattis petermattis deleted the pmattis/shard-caches branch July 9, 2019 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants