Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: faster table/block cache #11

Closed
petermattis opened this issue Aug 3, 2018 · 3 comments · Fixed by #523
Closed

perf: faster table/block cache #11

petermattis opened this issue Aug 3, 2018 · 3 comments · Fixed by #523
Assignees

Comments

@petermattis
Copy link
Collaborator

The existing block cache implementation uses the Clock-PRO replacement policy. The block cache is important for performance. The raw speed of accessing the block cache is an important component of read performance. Concurrency of the block cache directly affects concurrent read performance. There are three areas worth exploring here:

  1. Should the block cache be placed outside of the visibility of the Go GC. This could be accomplished by implementing our own hash table on top of arena allocated memory.
  2. Improve the concurrency of the block cache, either via sharding or lock-free approaches.
  3. Use a better cache replacement policy such as TinyLFU. See also Caffeine, a high-performance Java caching library.
@petermattis petermattis changed the title Faster block cache perf: faster block cache Aug 11, 2018
@olekukonko
Copy link

I believe the dgraph team wants to implement a caffeine port https://blog.dgraph.io/post/caching-in-go/

@petermattis petermattis changed the title perf: faster block cache perf: faster table/block cache Jul 3, 2019
@petermattis petermattis self-assigned this Jul 3, 2019
@petermattis
Copy link
Collaborator Author

Pebble, like RocksDB, contains two primary caches: the block cache and the table cache. The table cache (implemented by tableCache) caches sstable.Readers. Performance experiments are currently demonstrating lock contention in the tableCache under concurrent read workloads. The tableCache internally contains an LRU list, and manipulation of the list is the problematic bit that requires mutual exclusion. The ristretto project is looking to use the same technique as Caffeine and BP-Wrapper: record page accesses in a lock-free (or thread-local) structure and then apply them in batch. Some hacky experimentation in this direction showed a 25% speedup in a concurrent scan workload.

For tableCache.findNode, we'd have a fast-path check for when the table is already in the cache that uses a RWMutex instead of a Mutex. Something like:

func (c *tableCache) findNode(meta *fileMetadata) *tableCacheNode {
	c.mu.RLock()
	if n := c.mu.nodes[meta.fileNum]; n != nil {
		atomic.AddInt32(&n.refCount, 1)
		c.mu.RUnlock()
		// TODO(peter): record access of `n`.
		return n
	}
	c.mu.RUnlock()

	c.mu.Lock()
	defer c.mu.Unlock()
	// Handle miss: we need to check c.mu.nodes again...
}

petermattis added a commit that referenced this issue Jul 3, 2019
Updating the tableCacheShard LRU lists requires mutually exclusive
access to the list. Grabbing and releasing tableCacheShard.mu for each
access shows up prominently on blocking profiles. There is a known
technique for avoiding this overhead: record the access in a per-thread
data structure and then batch apply the buffered accesses when the
buffer is full. For a highly concurrent scan workload, this results in a
significant improvement in both throughput and stability of the
throughput numbers. Here are numbers when running with 100 concurrent
scanners before this change:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      109476226       912303.4

After:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      151736009      1264384.6

That's a 39% speedup. Even better, the instantaneous numbers during the
run were stable.

See #11
petermattis added a commit that referenced this issue Jul 3, 2019
The Clock-PRO algorithm doesn't require exclusive access to internal
state on hits. Switch to using a RWMutex which reduces a source of
contention in cached workloads.

For a concurrent scan workload, performance before this commit was:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      151736009      1264384.6

After:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      170326751      1419354.7

That's a 12% improvement in throughput.

See #11
petermattis added a commit that referenced this issue Jul 5, 2019
Updating the tableCacheShard LRU lists requires mutually exclusive
access to the list. Grabbing and releasing tableCacheShard.mu for each
access shows up prominently on blocking profiles. There is a known
technique for avoiding this overhead: record the access in a per-thread
data structure and then batch apply the buffered accesses when the
buffer is full. For a highly concurrent scan workload, this results in a
significant improvement in both throughput and stability of the
throughput numbers. Here are numbers when running with 100 concurrent
scanners before this change:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      109476226       912303.4

After:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      151736009      1264384.6

That's a 39% speedup. Even better, the instantaneous numbers during the
run were stable.

See #11
petermattis added a commit that referenced this issue Jul 5, 2019
The Clock-PRO algorithm doesn't require exclusive access to internal
state on hits. Switch to using a RWMutex which reduces a source of
contention in cached workloads.

For a concurrent scan workload, performance before this commit was:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      151736009      1264384.6

After:

____optype__elapsed_____ops(total)___ops/sec(cum)
  scan_100   120.0s      170326751      1419354.7

That's a 12% improvement in throughput.

See #11
@petermattis
Copy link
Collaborator Author

Should the block cache be placed outside of the visibility of the Go GC. This could be accomplished by implementing our own hash table on top of arena allocated memory.

This is being done in #523.

petermattis added a commit that referenced this issue Feb 6, 2020
Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
petermattis added a commit that referenced this issue Feb 7, 2020
Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
petermattis added a commit that referenced this issue Feb 7, 2020
Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
petermattis added a commit that referenced this issue Feb 10, 2020
Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
petermattis added a commit that referenced this issue Feb 10, 2020
Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
petermattis added a commit that referenced this issue Feb 11, 2020
Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
petermattis added a commit that referenced this issue Feb 12, 2020
Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants