Skip to content

Commit

Permalink
Merge pull request #523 from cockroachdb/pmattis/manual
Browse files Browse the repository at this point in the history
internal/cache: move the bulk of allocations off the Go heap
  • Loading branch information
petermattis committed Feb 12, 2020
2 parents d91aa94 + 0f2704f commit c39589c
Show file tree
Hide file tree
Showing 29 changed files with 1,757 additions and 311 deletions.
11 changes: 5 additions & 6 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,11 @@ matrix:
- name: "go1.13.x-linux-race"
go: 1.13.x
os: linux
script: make testrace
script: make testrace TAGS=
- name: "go1.13.x-linux-no-invariants"
go: 1.13.x
os: linux
script: make test TAGS=
- name: "go1.13.x-darwin"
go: 1.13.x
os: osx
Expand All @@ -26,11 +30,6 @@ matrix:
go: 1.13.x
os: windows
script: go test ./...
- name: "go1.13.x-freebsd"
go: 1.13.x
os: linux
# NB: "env: GOOS=freebsd" does not have the desired effect.
script: GOOS=freebsd go build -v ./...

notifications:
email:
Expand Down
6 changes: 5 additions & 1 deletion db_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,7 @@ func TestIterLeak(t *testing.T) {
t.Fatal(err)
}
} else {
defer iter.Close()
if err := d.Close(); err == nil {
t.Fatalf("expected failure, but found success")
} else if !strings.HasPrefix(err.Error(), "leaked iterators:") {
Expand All @@ -714,7 +715,10 @@ func TestMemTableReservation(t *testing.T) {
// Add a block to the cache. Note that the memtable size is larger than the
// cache size, so opening the DB should cause this block to be evicted.
tmpID := opts.Cache.NewID()
opts.Cache.Set(tmpID, 0, 0, []byte("hello world"))
helloWorld := []byte("hello world")
value := opts.Cache.AllocManual(len(helloWorld))
copy(value.Buf(), helloWorld)
opts.Cache.Set(tmpID, 0, 0, value).Release()

d, err := Open("", opts)
if err != nil {
Expand Down
86 changes: 86 additions & 0 deletions docs/memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Memory Management

## Background

Pebble has two significant sources of memory usage: MemTables and the
Block Cache. MemTables buffer data that has been written to the WAL
but not yet flushed to an SSTable. The Block Cache provides a cache of
uncompressed SSTable data blocks.

Originally, Pebble used regular Go memory allocation for the memory
backing both MemTables and the Block Cache. This was problematic as it
put significant pressure on the Go GC. The higher the bandwidth of
memory allocations, the more work GC has to do to reclaim the
memory. In order to lessen the pressure on the Go GC, an "allocation
cache" was introduced to the Block Cache which allowed reusing the
memory backing cached blocks in most circumstances. This produced a
dramatic reduction in GC pressure and a measurable performance
improvement in CockroachDB workloads.

Unfortunately, the use of Go allocated memory still caused a
problem. CockroachDB running on top of Pebble often resulted in an RSS
(resident set size) 2x what it was when using RocksDB. The cause of
this effect is due to the Go runtime's heuristic for triggering GC:

> A collection is triggered when the ratio of freshly allocated data
> to live data remaining after the previous collection reaches this
> percentage.
This percentage can be configured by the
[`GOGC`](https://golang.org/pkg/runtime/) environment variable or by
calling
[`debug.SetGCPercent`](https://golang.org/pkg/runtime/debug/#SetGCPercent). The
default value is `100`, which means that GC is triggered when the
freshly allocated data is equal to the amount of live data at the end
of the last collection period. This generally works well in practice,
but the Pebble Block Cache is often configured to be 10s of gigabytes
in size. Waiting for 10s of gigabytes of data to be allocated before
triggering a GC results in very large Go heap sizes.

## Manual Memory Management

Attempting to adjust `GOGC` to account for the significant amount of
memory used by the Block Cache is fraught. What value should be used?
`10%`? `20%`? Should the setting be tuned dynamically? Rather than
introducing a heuristic which may have cascading effects on the
application using Pebble, we decided to move the Block Cache and
MemTable memory out of the Go heap. This is done by using the C memory
allocator, though it could also be done by providing a simple memory
allocator in Go which uses `mmap` to allocate memory.

In order to support manual memory management for the Block Cache and
MemTables, Pebble needs to precisely track their lifetime. This was
already being done for the MemTable in order to account for its memory
usage in metrics. It was mostly being done for the Block Cache. Values
stores in the Block Cache are reference counted and are returned to
the "alloc cache" when their reference count falls
to 0. Unfortunately, this tracking wasn't precise and there were
numerous cases where the cache values were being leaked. This was
acceptable in a world where the Go GC would clean up after us. It is
unacceptable if the leak becomes permanent.

## Leak Detection

In order to find all of the cache value leaks, Pebble has a leak
detection facility built on top of
[`runtime.SetFinalizer`](https://golang.org/pkg/runtime/#SetFinalizer). A
finalizer is a function associated with an object which is run when
the object is no longer reachable. On the surface, this sounds perfect
as a facility for performing all memory reclamation. Unfortunately,
finalizers are generally frowned upon by the Go implementors, and come
with very loose guarantees:

> The finalizer is scheduled to run at some arbitrary time after the
> program can no longer reach the object to which obj points. There is
> no guarantee that finalizers will run before a program exits, so
> typically they are useful only for releasing non-memory resources
> associated with an object during a long-running program
This language is somewhat frightening, but in practice finalizers are
run at the end of every GC period. Pebble does not use finalizers for
correctness, but instead uses them for its leak detection facility. In
the block cache, a finalizer is associated with the Go allocated
`cache.Value` object. When the finalizer is run, it checks that the
buffer backing the `cache.Value` has been freed. This leak detection
facility is enabled by the `"invariants"` build tag which is enabled
by the Pebble unit tests.
29 changes: 25 additions & 4 deletions internal/cache/alloc.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@
package cache

import (
"runtime"
"sync"
"time"

"github.com/cockroachdb/pebble/internal/manual"
"golang.org/x/exp/rand"
)

Expand All @@ -30,8 +32,17 @@ var allocPool = sync.Pool{
},
}

// allocNew allocates a slice of size n. The use of sync.Pool provides a
// per-cpu cache of allocCache structures to allocate from.
// allocNew allocates a non-garbage collected slice of size n. Every call to
// allocNew() MUST be paired with a call to allocFree(). Failure to do so will
// result in a memory leak. The use of sync.Pool provides a per-cpu cache of
// allocCache structures to allocate from.
//
// TODO(peter): Is the allocCache still necessary for performance? Before the
// introduction of manual memory management, the allocCache dramatically
// reduced GC pressure by reducing allocation bandwidth. It no longer serves
// this purpose because manual.{New,Free} don't produce any GC pressure. Will
// need to run benchmark workloads to see if this can be removed which would
// allow the removal of the one required use of runtime.SetFinalizer.
func allocNew(n int) []byte {
a := allocPool.Get().(*allocCache)
b := a.alloc(n)
Expand Down Expand Up @@ -73,12 +84,20 @@ func newAllocCache() *allocCache {
bufs: make([][]byte, 0, allocCacheCountLimit),
}
c.rnd.Seed(uint64(time.Now().UnixNano()))
runtime.SetFinalizer(c, freeAllocCache)
return c
}

func freeAllocCache(obj interface{}) {
c := obj.(*allocCache)
for i := range c.bufs {
manual.Free(c.bufs[i])
}
}

func (c *allocCache) alloc(n int) []byte {
if n < allocCacheMinSize || n >= allocCacheMaxSize {
return make([]byte, n)
return manual.New(n)
}

class := sizeToClass(n)
Expand All @@ -92,12 +111,13 @@ func (c *allocCache) alloc(n int) []byte {
}
}

return make([]byte, n, classToSize(class))
return manual.New(classToSize(class))[:n]
}

func (c *allocCache) free(b []byte) {
n := cap(b)
if n < allocCacheMinSize || n >= allocCacheMaxSize {
manual.Free(b)
return
}
b = b[:n:n]
Expand All @@ -117,6 +137,7 @@ func (c *allocCache) free(b []byte) {
// are biased, but that is fine for the usage here.
j := (uint32(len(c.bufs)) * (uint32(c.rnd.Uint64()) & ((1 << 16) - 1))) >> 16
c.size -= cap(c.bufs[j])
manual.Free(c.bufs[j])
c.bufs[i], c.bufs[j] = nil, c.bufs[i]
c.bufs = c.bufs[:i]
}
Expand Down
8 changes: 5 additions & 3 deletions internal/cache/alloc_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@ package cache
import (
"testing"
"unsafe"

"github.com/cockroachdb/pebble/internal/manual"
)

func TestAllocCache(t *testing.T) {
c := newAllocCache()
for i := 0; i < 64; i++ {
c.free(make([]byte, 1025))
c.free(manual.New(1025))
if c.size == 0 {
t.Fatalf("expected cache size to be non-zero")
}
Expand All @@ -34,7 +36,7 @@ func TestAllocCache(t *testing.T) {
func TestAllocCacheEvict(t *testing.T) {
c := newAllocCache()
for i := 0; i < allocCacheCountLimit; i++ {
c.free(make([]byte, 1024))
c.free(manual.New(1024))
}

bufs := make([][]byte, allocCacheCountLimit)
Expand All @@ -61,7 +63,7 @@ func BenchmarkAllocCache(b *testing.B) {
// Populate the cache with buffers if one size class.
c := newAllocCache()
for i := 0; i < allocCacheCountLimit; i++ {
c.free(make([]byte, 1024))
c.free(manual.New(1024))
}

// Benchmark allocating buffers of a different size class.
Expand Down
Loading

0 comments on commit c39589c

Please sign in to comment.