perf: faster table/block cache #11

petermattis · 2018-08-03T14:07:07Z

The existing block cache implementation uses the Clock-PRO replacement policy. The block cache is important for performance. The raw speed of accessing the block cache is an important component of read performance. Concurrency of the block cache directly affects concurrent read performance. There are three areas worth exploring here:

Should the block cache be placed outside of the visibility of the Go GC. This could be accomplished by implementing our own hash table on top of arena allocated memory.
Improve the concurrency of the block cache, either via sharding or lock-free approaches.
Use a better cache replacement policy such as TinyLFU. See also Caffeine, a high-performance Java caching library.

olekukonko · 2019-03-10T11:24:19Z

I believe the dgraph team wants to implement a caffeine port https://blog.dgraph.io/post/caching-in-go/

petermattis · 2019-07-03T14:26:56Z

Pebble, like RocksDB, contains two primary caches: the block cache and the table cache. The table cache (implemented by tableCache) caches sstable.Readers. Performance experiments are currently demonstrating lock contention in the tableCache under concurrent read workloads. The tableCache internally contains an LRU list, and manipulation of the list is the problematic bit that requires mutual exclusion. The ristretto project is looking to use the same technique as Caffeine and BP-Wrapper: record page accesses in a lock-free (or thread-local) structure and then apply them in batch. Some hacky experimentation in this direction showed a 25% speedup in a concurrent scan workload.

For tableCache.findNode, we'd have a fast-path check for when the table is already in the cache that uses a RWMutex instead of a Mutex. Something like:

func (c *tableCache) findNode(meta *fileMetadata) *tableCacheNode {
	c.mu.RLock()
	if n := c.mu.nodes[meta.fileNum]; n != nil {
		atomic.AddInt32(&n.refCount, 1)
		c.mu.RUnlock()
		// TODO(peter): record access of `n`.
		return n
	}
	c.mu.RUnlock()

	c.mu.Lock()
	defer c.mu.Unlock()
	// Handle miss: we need to check c.mu.nodes again...
}

Updating the tableCacheShard LRU lists requires mutually exclusive access to the list. Grabbing and releasing tableCacheShard.mu for each access shows up prominently on blocking profiles. There is a known technique for avoiding this overhead: record the access in a per-thread data structure and then batch apply the buffered accesses when the buffer is full. For a highly concurrent scan workload, this results in a significant improvement in both throughput and stability of the throughput numbers. Here are numbers when running with 100 concurrent scanners before this change: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 109476226 912303.4 After: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 151736009 1264384.6 That's a 39% speedup. Even better, the instantaneous numbers during the run were stable. See #11

The Clock-PRO algorithm doesn't require exclusive access to internal state on hits. Switch to using a RWMutex which reduces a source of contention in cached workloads. For a concurrent scan workload, performance before this commit was: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 151736009 1264384.6 After: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 170326751 1419354.7 That's a 12% improvement in throughput. See #11

Updating the tableCacheShard LRU lists requires mutually exclusive access to the list. Grabbing and releasing tableCacheShard.mu for each access shows up prominently on blocking profiles. There is a known technique for avoiding this overhead: record the access in a per-thread data structure and then batch apply the buffered accesses when the buffer is full. For a highly concurrent scan workload, this results in a significant improvement in both throughput and stability of the throughput numbers. Here are numbers when running with 100 concurrent scanners before this change: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 109476226 912303.4 After: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 151736009 1264384.6 That's a 39% speedup. Even better, the instantaneous numbers during the run were stable. See #11

The Clock-PRO algorithm doesn't require exclusive access to internal state on hits. Switch to using a RWMutex which reduces a source of contention in cached workloads. For a concurrent scan workload, performance before this commit was: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 151736009 1264384.6 After: ____optype__elapsed_____ops(total)___ops/sec(cum) scan_100 120.0s 170326751 1419354.7 That's a 12% improvement in throughput. See #11

petermattis · 2020-02-06T19:40:42Z

Should the block cache be placed outside of the visibility of the Go GC. This could be accomplished by implementing our own hash table on top of arena allocated memory.

This is being done in #523.

Use C malloc/free for the bulk of cache allocations. This required elevating `cache.Value` to a public citizen of the cache package. A distinction is made between manually managed memory and automatically managed memory. Weak handles can only be made from values stored in automatically managed memory. Note that weak handles are only used for the index, filter, and range-del blocks, so the number of weak handles is O(num-tables). A finalizer is set on `*allocCache` and `*Cache` in order to ensure that any outstanding manually allocated memory is released when these objects are collected. When `invariants` are enabled, finalizers are also set on `*Value` and sstable iterators to ensure that we're not leaking manually managed memory. Fixes #11

petermattis changed the title ~~Faster block cache~~ perf: faster block cache Aug 11, 2018

petermattis changed the title ~~perf: faster block cache~~ perf: faster table/block cache Jul 3, 2019

petermattis self-assigned this Jul 3, 2019

petermattis mentioned this issue Jul 3, 2019

Reduce concurrency bottlenecks in the table and block caches #172

Merged

petermattis mentioned this issue Feb 6, 2020

internal/cache: move the bulk of allocations off the Go heap #523

Merged

petermattis closed this as completed in #523 Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: faster table/block cache #11

perf: faster table/block cache #11

petermattis commented Aug 3, 2018

olekukonko commented Mar 10, 2019

petermattis commented Jul 3, 2019

petermattis commented Feb 6, 2020

perf: faster table/block cache #11

perf: faster table/block cache #11

Comments

petermattis commented Aug 3, 2018

olekukonko commented Mar 10, 2019

petermattis commented Jul 3, 2019

petermattis commented Feb 6, 2020