Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/cache: move the bulk of allocations off the Go heap #523

Merged
merged 6 commits into from
Feb 12, 2020

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Feb 6, 2020

Use C malloc/free for the bulk of cache allocations. This required
elevating cache.Value to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on *allocCache and *Cache in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When invariants are enabled, finalizers are also set on *Value and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11

@petermattis
Copy link
Collaborator Author

This change is Reviewable

@petermattis petermattis force-pushed the pmattis/manual branch 3 times, most recently from 6b02fcd to d20a26a Compare February 6, 2020 17:18
Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)


sstable/reader.go, line 1243 at r1 (raw file):

	i, err := newRawBlockIter(bytes.Compare, data)
	b.Release()

This PR has a couple of actual bug fixes, which the manually managed block cache memory uncovered. This is an example. We were releasing the cache handle and then proceeding to use the data via the raw block iter. After the cache handle is released, the memory could be reused!

@petermattis petermattis force-pushed the pmattis/manual branch 2 times, most recently from 6897693 to 9c572de Compare February 6, 2020 19:04
@petermattis
Copy link
Collaborator Author

I added a second commit which moves the memTable arena memory into manually managed (C) memory. This was the next largest source of Pebble memory use.

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 20 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


internal/cache/alloc.go, line 78 at r2 (raw file):

	}
	c.rnd.Seed(uint64(time.Now().UnixNano()))
	runtime.SetFinalizer(c, freeAllocCache)

This is the only required use of SetFinalizer. I wonder if the allocCache is necessary any more. Previously, it massively reduced GC pressure. But GC pressure is no longer a concern. I'm going to add a TODO.


internal/cache/clockpro.go, line 72 at r2 (raw file):

	// Is buf automatically managed (i.e. it will be reclaimed by GC) or manually
	// managed?
	auto bool

TODO(peter): An alternative would be to set refs to a large positive or negative value (e.g. 1<<30) so it will never get released.


internal/cache/clockpro.go, line 74 at r2 (raw file):

	auto bool

	traces []string

TODO(peter): Figure out a way to move this into under a build tag. No need to bloat the Value struct normally.


internal/cache/clockpro.go, line 130 at r2 (raw file):

}

func (v *Value) trace(msg string) {

TODO(peter): move this under a build tag.


internal/cache/clockpro.go, line 736 at r2 (raw file):

}

func checkValue(obj interface{}) {

TODO(peter): move this under a build tag.

@petermattis petermattis force-pushed the pmattis/manual branch 2 times, most recently from b5bc2e9 to e16cf94 Compare February 7, 2020 16:09
@petermattis
Copy link
Collaborator Author

A nice side-effect of the second commit is much lower peak Go memory usage in CRDB unit tests.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat. I haven't read all the code yet, but sending some early comments.

Reviewed 4 of 18 files at r1, 6 of 8 files at r3.
Reviewable status: 10 of 25 files reviewed, 9 unresolved discussions (waiting on @itsbilal, @petermattis, and @sumeerbhola)


internal/cache/clockpro.go, line 135 at r4 (raw file):

	if old := e.getValue(); old != nil {
		if old.release() {
			if !old.auto {

this seems a bit odd: IIUC, auto values are only used for weak handles, which means their refcount should never become 0, so why should this need to be checked? Is it because entry.setValue() can be called before the WeakHandle is constructed?

Also, IIUC, there are two sources of references to Value:

  • Handle has a reference to Handle.value
  • The cache has a reference to entry.Val. This is the one being released here. And the reference to the new value has already been acquired before calling setValue().

Is that correct? If so, a code comment would be helpful.

I think it would be cleaner to make Value.release() call allocFree() if the reference count has gone to zero. The contract is that a caller of Value.release() stop using the Value after returning,

I think this module could use a high-level comment explaining how the memory management works, that includes how alloc.go produces only manually managed slices, how we track which ones are auto and which ones are manual, why we need auto etc.


internal/cache/clockpro.go, line 150 at r4 (raw file):

// Handle provides a strong reference to an entry in the cache. The reference
// does not pin the entry in the cache, but it does prevent the underlying byte
// slice from being reused.

Would the following comment be accurate:
When both are non-nil, value is initialized to the *Value corresponding to entry.val. Over the lifetime of the Handle, value stays unchanged but entry.val may point to some other Value.


internal/cache/clockpro.go, line 170 at r4 (raw file):

	if h.value != nil {
		if h.value.release() {
			if !h.value.auto {

same question here about needing to check auto.


internal/cache/clockpro.go, line 173 at r4 (raw file):

				allocFree(h.value.buf)
			}
			h.value.buf = nil

this setting to nil is not needed for correctness, yes?


internal/cache/clockpro.go, line 189 at r4 (raw file):

	}
	h.value.makeWeak()
	return (*WeakHandle)(h.entry)

Assuming I understood the invariant correctly in my earlier comment, when Weak() is called it is possible that h.entry.val is already not pointing to h.value so incrementing the refcount for it isn't necessary (though doing so is harmless, and there is always the possibility of a race there). Is that correct?
If so, could you add a comment.


internal/cache/value.go, line 62 at r4 (raw file):

// Truncate the buffer to the specified length. The buffer length should not be
// changed once the value has been added to the cache. Instead, a new Value

why this "buffer length should not be changed" comment? It seems that manual.Free() does the right thing by using the capacity of the slice.


internal/cache/value.go, line 70 at r4 (raw file):

func (v *Value) makeWeak() {
	if !v.auto {
		panic("pebble: cannot make auto Value into a weak Value")

s/auto/manual/

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 25 files reviewed, 9 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


internal/cache/clockpro.go, line 135 at r4 (raw file):

Previously, sumeerbhola wrote…

this seems a bit odd: IIUC, auto values are only used for weak handles, which means their refcount should never become 0, so why should this need to be checked? Is it because entry.setValue() can be called before the WeakHandle is constructed?

Also, IIUC, there are two sources of references to Value:

  • Handle has a reference to Handle.value
  • The cache has a reference to entry.Val. This is the one being released here. And the reference to the new value has already been acquired before calling setValue().

Is that correct? If so, a code comment would be helpful.

I think it would be cleaner to make Value.release() call allocFree() if the reference count has gone to zero. The contract is that a caller of Value.release() stop using the Value after returning,

I think this module could use a high-level comment explaining how the memory management works, that includes how alloc.go produces only manually managed slices, how we track which ones are auto and which ones are manual, why we need auto etc.

Agreed on the need for a high-level comment explaining the manual memory management. See the new comment above Cache.

You're right about looking at old.auto being confusing. There was a time in this PR when auto values could be returned to the cache, but that is no longer true. Note that Weak may never have been called on an auto value inserted into the cache, so we can actually hit this code path in practice. This makes me suspect that the right thing to do is to remove the Value.auto field and mark auto values by setting a very large negative reference count. I've gone ahead and done that.


internal/cache/clockpro.go, line 150 at r4 (raw file):

Previously, sumeerbhola wrote…

Would the following comment be accurate:
When both are non-nil, value is initialized to the *Value corresponding to entry.val. Over the lifetime of the Handle, value stays unchanged but entry.val may point to some other Value.

Yes. I added something along these lines with slightly different wording.


internal/cache/clockpro.go, line 170 at r4 (raw file):

Previously, sumeerbhola wrote…

same question here about needing to check auto.

Same answer. I'm going to see about getting rid of this, but it is needed for correctness right now.


internal/cache/clockpro.go, line 173 at r4 (raw file):

Previously, sumeerbhola wrote…

this setting to nil is not needed for correctness, yes?

It is needed for the leak checking facility to work. I've added a comment about this.


internal/cache/clockpro.go, line 189 at r4 (raw file):

Previously, sumeerbhola wrote…

Assuming I understood the invariant correctly in my earlier comment, when Weak() is called it is possible that h.entry.val is already not pointing to h.value so incrementing the refcount for it isn't necessary (though doing so is harmless, and there is always the possibility of a race there). Is that correct?
If so, could you add a comment.

Comment added.


internal/cache/value.go, line 62 at r4 (raw file):

Previously, sumeerbhola wrote…

why this "buffer length should not be changed" comment? It seems that manual.Free() does the right thing by using the capacity of the slice.

Once the value has been added to the cache, there may be concurrent readers of the Value. I've expanded the comment to mention this.


internal/cache/value.go, line 70 at r4 (raw file):

Previously, sumeerbhola wrote…

s/auto/manual/

Egads! Done.

@petermattis petermattis force-pushed the pmattis/manual branch 2 times, most recently from 7e534d8 to d3f2287 Compare February 7, 2020 18:59
Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 26 files reviewed, 10 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


docs/memory.md, line 8 at r7 (raw file):

Block Cache. MemTables buffer data that has been written to the WAL
but not yet flushed to an SSTable. The Block Cache provides a cache of
uncompressed SSTable data blocks.

@sumeerbhola I had already started writing this when you asked for the high-level comment on how the cache handles manual memory management. Could be the basis for a blog post in the future.

@petermattis petermattis force-pushed the pmattis/manual branch 4 times, most recently from 70f0a40 to b7514d8 Compare February 8, 2020 17:51
@petermattis
Copy link
Collaborator Author

petermattis commented Feb 8, 2020

I've added a few more commits to this PR extending it to encompass more cache memory allocations. Specifically, we now manually manage the memory for the Value structs themselves, by allocating space at the start of the underlying buffer. And we manually manage the memory for entry structs which old manually managed values (i.e. the vast majority of values in the cache).

The motivation for this extra work was noticing a performance problem on sysbench/oltp_update_index when running on top of this PR. Specifically, there was a drop in performance compared to letting Go manage the memory. The reason for this performance drop was subtle: the smaller Go heap was causing Go GC to run much more frequently. And while the heap was smaller, the number of pointers in the heap was the same. More frequent GC with the same work per GC equals performance impact. A heap profile showed that 50-60% of objects in the heap were cache.entry and cache.Value structs. Due to the beefy machines running the test, the cache is configured to 17GB. The heap profile showed ~1.2m entry and Value objects combined. Getting bulk of both of these objects out of the Go heap massively reduces the work the GC has to do.

The only remaining pointer-heavy structures in the cache is the shard.blocks and shard.files maps. It may be interesting to experiment with providing a custom map implementation. I've experimented with Robin-hood hashing in the past. It seems feasible to use that approach which would allow manually managing the memory backing these maps.

@petermattis petermattis force-pushed the pmattis/manual branch 2 times, most recently from a292d45 to e1474d5 Compare February 9, 2020 14:43
@petermattis
Copy link
Collaborator Author

The only remaining pointer-heavy structures in the cache is the shard.blocks and shard.files maps. It may be interesting to experiment with providing a custom map implementation. I've experimented with Robin-hood hashing in the past. It seems feasible to use that approach which would allow manually managing the memory backing these maps.

I've added a commit which implements the above. This PR has now spiraled out of control. Let me know if you want it to be broken up into more digestible pieces.

@petermattis petermattis force-pushed the pmattis/manual branch 2 times, most recently from 1a6c510 to f47d186 Compare February 10, 2020 19:19
@petermattis
Copy link
Collaborator Author

Rebased to fix a merge conflict with #524. I'm going to hold off on making any more changes to this PR until it is reviewed.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a look at all the cache related code. Just have some minor comments.

Reviewed 3 of 18 files at r1, 6 of 9 files at r2, 4 of 15 files at r8, 5 of 6 files at r10, 1 of 5 files at r11, 5 of 6 files at r17, 2 of 5 files at r18, 1 of 1 files at r19, 3 of 5 files at r21.
Reviewable status: 25 of 35 files reviewed, 16 unresolved discussions (waiting on @itsbilal, @petermattis, and @sumeerbhola)


internal/cache/alloc.go, line 36 at r14 (raw file):

// allocNew allocates a slice of size n. The use of sync.Pool provides a
// per-cpu cache of allocCache structures to allocate from.

Can you add a file level comment that allocNew() allocates non garbage-collected memory, so every call must be paired with allocFree().


internal/cache/clockpro.go, line 178 at r21 (raw file):

	// Acquire a reference for the cache. We'll be adjusting this below if the
	// value is not actually added to the cache.

aren't values created with a refcount of 1? isn't that the reference for the cache?


internal/cache/entry.go, line 48 at r21 (raw file):

// there are other pointers to the entry and shard which will keep them
// alive. In particular, every entry in the cache is referenced by the
// shard.blocks map. And every shard is referenced by the Cache.shards map.

is there a tracing mode that will tell us that the pointers stored in an entry point to other entries that have been reclaimed by Go GC?
Say a global map[*entry]int where the value is a refcount, and is incremented and decremented whenever an entry starts/stops pointing to another entry and also tracks the other pointers to an entry.


internal/cache/entry_normal.go, line 68 at r21 (raw file):

func (c *entryAllocCache) free(e *entry) {
	c.entries = append(c.entries, e)

is there no limit to the number of entry structs the cache will keep?


internal/cache/value_normal.go, line 18 at r14 (raw file):

//
// This is the definition of Value that is used in normal builds.
type Value struct {

isn't this the same as value_notracing.go -- why this duplication?


sstable/reader.go, line 1086 at r21 (raw file):

	// The value will be "freed" when the iterator is closed, so make a copy
	// which will outlast the lifetime of the iterator.
	newValue := make([]byte, len(value))

was this also a bug?

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 12 of 29 files reviewed, 16 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


internal/cache/alloc.go, line 36 at r14 (raw file):

Previously, sumeerbhola wrote…

Can you add a file level comment that allocNew() allocates non garbage-collected memory, so every call must be paired with allocFree().

I didn't see a good place to leave a file-level comment, but I extended the comment here to indicate allocFree MUST be called.


internal/cache/clockpro.go, line 178 at r21 (raw file):

Previously, sumeerbhola wrote…

aren't values created with a refcount of 1? isn't that the reference for the cache?

Yes, values are created with a reference count of 1. I've been thinking of that reference count as for the handle. I've moved this comment below to where this reference count is being transferred to the handle, and moved the acquire so it is only done if the value is being added to the cache.


internal/cache/entry.go, line 48 at r21 (raw file):

Previously, sumeerbhola wrote…

is there a tracing mode that will tell us that the pointers stored in an entry point to other entries that have been reclaimed by Go GC?
Say a global map[*entry]int where the value is a refcount, and is incremented and decremented whenever an entry starts/stops pointing to another entry and also tracks the other pointers to an entry.

I don't have something like you're describing, but there is shard.metaCheck which looks through the shard maps to try and find dangling references to an entry. Is that sufficient?


internal/cache/entry_normal.go, line 68 at r21 (raw file):

Previously, sumeerbhola wrote…

is there no limit to the number of entry structs the cache will keep?

Correct. My thinking was that the objects in the sync.Pool will be gc'd eventually and the memory usage isn't significant.


internal/cache/value_normal.go, line 18 at r14 (raw file):

Previously, sumeerbhola wrote…

isn't this the same as value_notracing.go -- why this duplication?

Yes, it is identical. No reason for the duplication other than my not figuring out earlier how to get rid of it. Done.


sstable/reader.go, line 1086 at r21 (raw file):

Previously, sumeerbhola wrote…

was this also a bug?

Sort of. This was something I found when trying to use manual memory management for weak handles. With auto handles, we'd never try to reuse the cached block. Also note, this code path is only used for testing as pebble.DB.Get uses iterators in order to handle merge records and range tombstones. Seems worth keeping, though.

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 29 files reviewed, 15 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


internal/cache/entry_normal.go, line 68 at r21 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Correct. My thinking was that the objects in the sync.Pool will be gc'd eventually and the memory usage isn't significant.

I thought about this more last night and decided to add a limit (128). I think there could be scenarios where the sync.Pool is not clearing the caches and large amounts of entries are piling up.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 21 files at r15, 1 of 21 files at r22, 1 of 7 files at r23, 1 of 5 files at r24, 1 of 1 files at r26, 5 of 5 files at r27.
Reviewable status: 22 of 29 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @petermattis, and @sumeerbhola)


docs/memory.md, line 44 at r27 (raw file):

Attempting to adjust `GOGC` to account for the significant amount of
memory used by the Block Cache is fraught. What value should be used?
`10%`? `20%`? Should the setting be tuned dynamically? Rather than

Do we know the memory limit when the node starts and do we have the block cache as a percentage of that memory limit?
If the latter were 0.8 and we knew that it would be consumed by the cache, it seems like (1/0.8 - 1) * 100 would work as the percentage. And we could possibly set it dynamically based on the current live bytes in the cache (say B) and the memory limit L. Something like:
f = B/L
min(1/f - 1, 2) * 100

Is it worth running one benchmark both with this PR and the simple change and looking at the difference in memory and cpu (especially if you are thinking of a blog post)?


internal/cache/robin_hood.go, line 243 at r27 (raw file):

}

func (m *robinHoodMap) Delete(k key) {

given the context this is used, I am assuming there isn't a need to shrink the capacity if deletes are leaving too many empty slots?


internal/cache/robin_hood_test.go, line 13 at r27 (raw file):

	"golang.org/x/exp/rand"
)

do you have a benchmark comparing the two?

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 29 files reviewed, 18 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


docs/memory.md, line 44 at r27 (raw file):

Previously, sumeerbhola wrote…

Do we know the memory limit when the node starts and do we have the block cache as a percentage of that memory limit?
If the latter were 0.8 and we knew that it would be consumed by the cache, it seems like (1/0.8 - 1) * 100 would work as the percentage. And we could possibly set it dynamically based on the current live bytes in the cache (say B) and the memory limit L. Something like:
f = B/L
min(1/f - 1, 2) * 100

Is it worth running one benchmark both with this PR and the simple change and looking at the difference in memory and cpu (especially if you are thinking of a blog post)?

We know the block cache size when the node starts, but there is no hard memory limit. We can see the memory limit of the machine / container, but it isn't necessarily ok to use all of that memory. And we also need to leave memory for SQL/other usage in the system. I could expand here on why I think tweaking the GOGC percent is fraught. Let's consider the sysbench/oltp_update_index workload. The block cache is configured at 17GB and gets mostly filled. If we dynamically lowered GOGC we'd run into exactly the situation that the first commit in this PR experienced: more frequent GCs with more work to be done. This is going to be a pure drag on performance compared to manually managing memory. There is also the "ballast trick" (mentioned in golang/go#23044) which attempts a similar effect to dynamically changing GOGC by performing large "ballast" allocations that are otherwise unused.


internal/cache/robin_hood.go, line 243 at r27 (raw file):

Previously, sumeerbhola wrote…

given the context this is used, I am assuming there isn't a need to shrink the capacity if deletes are leaving too many empty slots?

That's my expectation as well. Having this map be mostly empty doesn't seem harmful. If the cache grew to a certain size at some point, it is likely to grow to that size again.

I wonder if there is fraction m.count / m.size at which it makes sense to shrink. I'm not sure what it is for Robin Hood hashing. Maybe the internets know.


internal/cache/robin_hood_test.go, line 13 at r27 (raw file):

Previously, sumeerbhola wrote…

do you have a benchmark comparing the two?

https://github.com/petermattis/maptoy. The implementation here has diverged a little bit from the maptoy implementation (a few bug fixes), but I expect the numbers are fairly similar. I just didn't port that over. Even if robinHoodMap was slightly slower, the manually managed memory makes it beneficial.

Should I port the benchmark?

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 20 of 29 files reviewed, 18 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


internal/cache/robin_hood_test.go, line 13 at r27 (raw file):

Previously, petermattis (Peter Mattis) wrote…

https://github.com/petermattis/maptoy. The implementation here has diverged a little bit from the maptoy implementation (a few bug fixes), but I expect the numbers are fairly similar. I just didn't port that over. Even if robinHoodMap was slightly slower, the manually managed memory makes it beneficial.

Should I port the benchmark?

I ported the benchmark.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 20 of 29 files reviewed, 21 unresolved discussions (waiting on @itsbilal, @petermattis, and @sumeerbhola)


docs/memory.md, line 44 at r27 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We know the block cache size when the node starts, but there is no hard memory limit. We can see the memory limit of the machine / container, but it isn't necessarily ok to use all of that memory. And we also need to leave memory for SQL/other usage in the system. I could expand here on why I think tweaking the GOGC percent is fraught. Let's consider the sysbench/oltp_update_index workload. The block cache is configured at 17GB and gets mostly filled. If we dynamically lowered GOGC we'd run into exactly the situation that the first commit in this PR experienced: more frequent GCs with more work to be done. This is going to be a pure drag on performance compared to manually managing memory. There is also the "ballast trick" (mentioned in golang/go#23044) which attempts a similar effect to dynamically changing GOGC by performing large "ballast" allocations that are otherwise unused.

I don't quite understand. If we know the limit on (other usage + block cache) that is the L in the earlier equation.


internal/cache/entry.go, line 48 at r28 (raw file):

// there are other pointers to the entry and shard which will keep them
// alive. In particular, every entry in the cache is referenced by the
// shard.blocks map. And every shard is referenced by the Cache.shards map.

this is the entries map now.

I was thinking about this PR again, and I am wondering if this is getting too intricate and complicated.
Given that Go does not allow us to construct a reffed_ptr that would follow the RAII idiom, due to the lack of control over assignment operators and no destructors, we have to remember to accurately ref and unref in the code, which seems bug prone.

If we could adjust GOGC to make the large heap problem go away, it seems the remaining problem is the number of pointers in the internal data-structures of the cache. Can that be "fixed" by changing all the cache internal data-structures to use a uintptr instead of pointers and rely on the entries map to keep things alive? Making the cache remove all the uintptrs for a pointer that it was evicting is a more localized change -- it doesn't matter if someone outside the cache is keeping that memory alive. So all memory would be Go allocated.


internal/cache/robin_hood.go, line 23 at r28 (raw file):

var hashSeed = uint64(time.Now().UnixNano())

// Fibonacci hash: https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/

I didn't read the full article but I am assuming the following is a mapping of the following to our context?

size_t index_for_hash(size_t hash)
{
    hash ^= hash >> shift_amount;
    return (11400714819323198485llu * hash) >> shift_amount;
}

what is the motivation for the ... | 1)-- I am guessing you want the output to always be odd but by doing so isn't the difference between the consecutive even, odd inputs being lost and causing a hash collision?


internal/cache/robin_hood.go, line 35 at r28 (raw file):

type robinHoodEntry struct {
	key   key
	value *entry

can this also be a Go pointer? If yes, I think it is worth repeating here the comment from entry.go.

Use C malloc/free for the bulk of cache allocations. This required
elevating `cache.Value` to a public citizen of the cache package. A
distinction is made between manually managed memory and automatically
managed memory. Weak handles can only be made from values stored in
automatically managed memory. Note that weak handles are only used for
the index, filter, and range-del blocks, so the number of weak handles
is O(num-tables).

A finalizer is set on `*allocCache` and `*Cache` in order to ensure that
any outstanding manually allocated memory is released when these objects
are collected.

When `invariants` are enabled, finalizers are also set on `*Value` and
sstable iterators to ensure that we're not leaking manually managed
memory.

Fixes #11
For manual Values, allocate the `Value` at the start of the backing
buffer. For large block caches, the number of Values can reach 500k-1m
or higher, and consume a significant fraction of the total allocated
objects in the Go heap.
Each entry in the cache has an associated `entry` struct. For large
block caches, the number of entries can reach 500k-1m or higher, and
consume a significant fraction of the total allocated objects in the Go
heap. For manually managed values, the lifetime of the associated
`entry` is known and the `*entry` never escapes outside of the cache
package. This allows us to manually manage the memory for these entries,
which are the majority of entries in the cache.
The "invariants" build tag causes different code paths to be
utilized. Add a no-invariants test configuration for testing the
invariant-less code paths.

Change the race builds to not specify the "invariants" tags as doing so
causes them to frequently flake on the CI machines. This appears to be
due to the size of the machines (RAM available) and not something more
serious.
Add `robinHoodMap` which is a hash map implemented using Robin Hood
hashing. `robinHoodMap` is specialized to use `key` and `*entry` as the
key and value for the map respectively. The memory for the backing array
is manually allocated. This lowers GC pressure by moving the allocated
memory out of the Go heap, and by moving all of `*entry` pointers out of
the Go heap.

Use `robinHoodMap` for both the `shard.blocks` and `shard.files` maps.

Old is Go-map, new is RobinHood-map.

name              old time/op  new time/op  delta
MapInsert-16       144ns ± 5%   101ns ± 8%  -29.88%  (p=0.000 n=10+9)
MapLookupHit-16    137ns ± 1%    47ns ± 1%  -65.58%  (p=0.000 n=8+10)
MapLookupMiss-16  89.9ns ± 4%  48.6ns ± 3%  -45.95%  (p=0.000 n=10+9)
Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 20 of 29 files reviewed, 21 unresolved discussions (waiting on @itsbilal, @petermattis, and @sumeerbhola)


internal/cache/entry.go, line 48 at r28 (raw file):

this is the entries map now.

Fixed.

Given that Go does not allow us to construct a reffed_ptr that would follow the RAII idiom, due to the lack of control over assignment operators and no destructors, we have to remember to accurately ref and unref in the code, which seems bug prone.

It certainly is a bit awkward in Go to do this manual memory management, yet we have to do similar for file descriptors in Go. So it is possible to get this right. I'm also feeling fairly good about the correctness due to the checks that take place under the "invariants" build tag. Without those checks I'd be very scared of this PR. With them, I feel pretty good given that the number of places we need to track references is limited.

If we could adjust GOGC to make the large heap problem go away, it seems the remaining problem is the number of pointers in the internal data-structures of the cache.

I have no evidence this is possible. Only speculation from myself, and murmurings on the internets. My suspicion is that we could get some benefit from dynamically adjusting GOGC, but we'll be fighting with those heuristics for a long time. In contrast, I think this PR is a bit of near term pain to get right, but then it will be solid and can generally be ignored.

I'll also reiterate another problem I think I alluded to earlier, it would be fairly obnoxious from a usability perspective for Pebble, a library, to muck with GOGC, a process-wide control.

Other systems built on GC'd languages have placed cache's in "off heap" memory. Cassandra does this, for instance.

Can that be "fixed" by changing all the cache internal data-structures to use a uintptr instead of pointers and rely on the entries map to keep things alive? Making the cache remove all the uintptrs for a pointer that it was evicting is a more localized change -- it doesn't matter if someone outside the cache is keeping that memory alive. So all memory would be Go allocated.

If Go has allocated the *entry pointer such that we need an entries map to keep it referenced and alive, then the GC will have to traverse all of the entries in the map. Using uintptrs for all of the internal pointers would reduce the number of pointers in the heap by a constant factor, but you'd still have O(num-entries) pointers at minimum. Would that reduction be sufficient? Possibly, though it seems like it can't be better (from a performance perspective) than the approach in this PR.


internal/cache/robin_hood.go, line 23 at r28 (raw file):

Previously, sumeerbhola wrote…

I didn't read the full article but I am assuming the following is a mapping of the following to our context?

size_t index_for_hash(size_t hash)
{
    hash ^= hash >> shift_amount;
    return (11400714819323198485llu * hash) >> shift_amount;
}

what is the motivation for the ... | 1)-- I am guessing you want the output to always be odd but by doing so isn't the difference between the consecutive even, odd inputs being lost and causing a hash collision?

The motivation was to avoid returning a zero hash value, but I don't think that is problematic anymore. I've removed the | 1.


internal/cache/robin_hood.go, line 35 at r28 (raw file):

Previously, sumeerbhola wrote…

can this also be a Go pointer? If yes, I think it is worth repeating here the comment from entry.go.

Yes, this can be a Go allocated object.

Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 29 files reviewed, 21 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


docs/memory.md, line 44 at r27 (raw file):

Previously, sumeerbhola wrote…

I don't quite understand. If we know the limit on (other usage + block cache) that is the L in the earlier equation.

I'm also not following the equation you gave above. Values of GOGC above 100 make the GC less aggressive and cause the heap to grow. Values below 100 make the GC run more frequently. I think a heuristic would need to keep GOGC in the range [20,100], depending on how full the block cache is.

I'm mildly interested to know whether tweaking GOGC dynamically is a beneficial technique, but as I mentioned elsewhere, this is really speculative. I've never done this. I've never seen any Go code which does this. And a search right now doesn't reveal anyone recommending this (the subject is silent, though I know I saw mention of this at some point). I did find golang/go#32105 which was a recently fixed deadlock bug with SetGCPercent.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 29 files reviewed, 21 unresolved discussions (waiting on @itsbilal, @petermattis, and @sumeerbhola)


docs/memory.md, line 44 at r27 (raw file):

I'm also not following the equation you gave above.

it should have said
min(1/f - 1, 1) * 100
But anyway that was a rough idea -- it would need some refinement.


internal/cache/entry.go, line 48 at r28 (raw file):

I'll also reiterate another problem I think I alluded to earlier, it would be fairly obnoxious from a usability perspective for Pebble, a library, to muck with GOGC, a process-wide control.

That is a fair point.

@petermattis
Copy link
Collaborator Author

Ok, in we go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf: faster table/block cache
3 participants