[DNM] storage: introduce SST snapshot strategy #25134

nvanbenschoten · 2018-04-27T22:12:47Z

Fixes #16954.
Related to #25047.

This depends on the following two upstream changes to RockDB:

The change introduces a new snapshot strategy called "SST". This strategy
stream sst files consisting of all keys in a range from the sender to the
receiver. These sst files are then atomically ingested directly into RocksDB.
An important property of the strategy is that the amount of memory required
for a receiver using the strategy is constant with respect to the size of
a range, instead of linear as it is with the KV_BATCH strategy. This will
be critical for increasing the default range size and potentially for
increasing the number of concurrent snapshots allowed per node. The
strategy also seems to significantly speed up snapshots once ranges are
above a certain size (somewhere in the single digit MBs).

This is a WIP change. Before it can be merged it needs:

to be cleaned up a bit
more testing (unit test, testing knobs, maybe some chaos)
proper version handling
heuristic tuning
decisions on questions like compactions after ingestion

facebook/rocksdb#3778 adds a DeleteRange method to SstFileWriter and adds support for ingesting SSTs with range deletion tombstones. facebook/rocksdb#3779 adds a virtual Truncate method to Env, which truncates the named file to the specified size. Release note: None

This change adds the `Merge`, `Clear`, and `ClearRange` methods to `RocksDBSstFileWriter`. In doing so, it makes the type implement the `engine.Writer` interface. Release note: None

cockroach-teamcity · 2018-04-27T22:12:57Z

This change is

This change adds a `Truncate` method to `RocksDBSstFileWriter`. This method truncates the current SST file and returns the data that was deleted. This can be used to chunk the SST into pieces. Because SSTs are built in an append-only manner, this is safe and the sum of the resulting chunks is equivalent to an SST built without ever calling `Truncate`. Release note: None

Instead of decomposing the results of a received snapshot and inserting each separate piece of state into an `IncomingSnapshot`, we now store this state directly on each `snapshotStrategy`. The `snapshotStrategy` is then attached to a `IncomingSnapshot`. This makes more sense, because the state is specific to the snapshotStrategy, and different strategies will receive different types of data in different representations. Release note: None

Fixes cockroachdb#16954. Related to cockroachdb#25047. This depends on the following two upstream changes to RockDB: - facebook/rocksdb#3778 - facebook/rocksdb#3779 The change introduces a new snapshot strategy called "SST". This strategy stream sst files consisting of all keys in a range from the sender to the receiver. These sst files are then atomically ingested directly into RocksDB. An important property of the strategy is that the amount of memory required for a receiver using the strategy is constant with respect to the size of a range, instead of linear as it is with the KV_BATCH strategy. This will be critical for increasing the default range size and potentially for increasing the number of concurrent snapshots allowed per node. The strategy also seems to significantly speed up snapshots once ranges are above a certain size (somewhere in the single digit MBs). This is a WIP change. Before it can be merged it needs: - to be cleaned up a bit - more testing (unit test, testing knobs, maybe some chaos) - proper version handling - heuristic tuning - decisions on questions like compactions after ingestion Release note: None

bdarnell · 2018-04-30T21:26:55Z

This generally looks good so far (although some of these methods are getting huge and ought to be split up). It's unfortunate that this is a large amount of new code instead of replacing the old code paths. I assume the reason we can't just use the SST strategy all the time is that the extra I/O from writing multiple SSTs adds up. Would it make sense to postprocess the received SSTs into either a single large SST or a KV batch (if the total size is small enough) so we could phase out the old KV batch paths.

Reviewed 1 of 1 files at r1, 9 of 13 files at r2, 4 of 4 files at r3, 10 of 10 files at r5.
Review status: all files reviewed at latest revision, all discussions resolved.

c-deps/libroach/include/libroach.h, line 367 at r5 (raw file):

// if the underlying RocksDB blocks have not been flushed. Close cannot have
// been called.
DBStatus DBSstFileWriterTruncate(DBSstFileWriter* fw, DBString* data);

s/Truncate/Flush/, if I'm understanding this correctly.

pkg/storage/replica_command.go, line 1878 at r5 (raw file):

	}

	// TODO we should introduce a heuristic here for which snapshot strategy

Add your name to the TODO.

pkg/storage/replica_command.go, line 1879 at r5 (raw file):

	// TODO we should introduce a heuristic here for which snapshot strategy
	// to use. An easy heuristic would be to use SST snapshots only when a

A heuristic like this risks being under-tested - I'm surprised that we have enough large snapshots in the test that this could make a difference. We may need to have some way to run all the tests with each strategy.

pkg/storage/replica_raftstorage.go, line 872 at r5 (raw file):

		stats.commit = timeutil.Now()
	case *sstSnapshotStrategy:
		// TODO DURING REVIEW: do we need to worry about RaftTombstoneIncorrectLegacyKey here?

I hope not since everything should be migrated before this version is introduced, but @tschottdorf should confirm.

pkg/storage/replica_raftstorage.go, line 972 at r5 (raw file):

		}

		// TODO should we force a compaction? Only in the global keyspace? The

We should probably add a suggestion to the compaction queue. I think forcing a compaction is going to negate a lot of the benefit of the sst ingestion strategy.

Comments from Reviewable

nvanbenschoten · 2018-05-17T22:08:09Z

I assume the reason we can't just use the SST strategy all the time is that the extra I/O from writing multiple SSTs adds up.

That and because it would be slower and result in a lot of small SSTs, which doesn't sound like a good idea.

Would it make sense to postprocess the received SSTs into either a single large SST or a KV batch (if the total size is small enough) so we could phase out the old KV batch paths.

We could use a RocksDBSstFileReader to postprocess the received SSTs into a KV batch if the total size is small enough. I don't think that would be a great idea though, because it would still require us to write the SSTs to disk instead of buffering all KVs in memory if the snapshot is below a certain size.

I don't see the two snapshot strategies as a huge issue for a few reasons. The first is that we're still going to need to support the KV batch approach in the short-term because of mixed version clusters. We also know that the KV batch approach is stable, so keeping it doesn't seem like a very big liability. My biggest concern is testing, which as you mention will need to be comprehensive for both approaches. We could have a testing hook that forces senders to use both strategies and asserts that they both result in the same exact resulting range.

Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

c-deps/libroach/include/libroach.h, line 367 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/Truncate/Flush/, if I'm understanding this correctly.

It actually doesn't flush (since there aren't good hooks into the underlying BlockBasedTableBuilder), it only truncates what's already been flushed to the inMem env. This is discussed further in TestRocksDBSstFileWriterTruncate.

pkg/storage/replica_command.go, line 1878 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add your name to the TODO.

This was a TODO for this PR. I'm going to revisit it when all prerequisite ROcksDB changes have been merged.

pkg/storage/replica_raftstorage.go, line 872 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I hope not since everything should be migrated before this version is introduced, but @tschottdorf should confirm.

That was my understanding as well, but I was hoping for confirmation.

pkg/storage/replica_raftstorage.go, line 972 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We should probably add a suggestion to the compaction queue. I think forcing a compaction is going to negate a lot of the benefit of the sst ingestion strategy.

Adding a suggestion to the compaction queue sounds like the right approach to me.

Comments from Reviewable

bdarnell · 2018-05-21T01:36:36Z

We could use a RocksDBSstFileReader to postprocess the received SSTs into a KV batch if the total size is small enough. I don't think that would be a great idea though, because it would still require us to write the SSTs to disk instead of buffering all KVs in memory if the snapshot is below a certain size.

This is fixable if we want to go in this direction. We could either make a custom rocksdb::Env that supports ingesting from in-memory buffer or (perhaps simpler) just decode the sst directly instead of going through all the rocksdb ingestion machinery.

nvanbenschoten · 2018-05-21T20:49:58Z

This is fixable if we want to go in this direction.

I'm not convinced that we do because I think that direction would add a lot of new logic in an effort to remove stable old logic. Sure, doing this would allow us to remove one of the two strategies on-the-wire (once we no longer need to worry about compatibility), but it would still require two different code paths to apply the snapshot on the receiver's side, which is the part that changes more often and is harder to test. With everything considered and after the migration period finished in 2.2, I think we would just be going through extra hoops to send SSTs over the wire instead of KVs in the cases where we already know we want to write KVs in WriteBatches on the receiver anyway.

Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

Comments from Reviewable

benesch · 2018-06-03T16:15:52Z

Are we planning to open source the AddSSTable command or will this be CCL binaries/enterprise licenses only?

benesch · 2018-06-03T16:16:49Z

Oh, never mind. I guess this doesn't actually use AddSSTable.

nvanbenschoten · 2018-12-12T01:06:48Z

#32931 just merged, which includes the RocksDB changes necessary for this change. @tbg what do you think we should do with this PR? Do you have any desire to see it adopted in the 2.2 release cycle?

Preemptive snapshots are sent to a Store (by another Store) as part of the process of adding a new Replica to a Range. The sequence of events is: - send a preemptive snapshot (replicaID=0) to the target - target creates a Replica from the preemptive snapshot (replicaID=0) - allocate new replicaID and add the target officially under that replicaID - success (replicaID=nonzero) They are problematic for a variety of reasons: 1. they introduce a Replica state, namely that of Replicas that have data but don't have a replicaID. Such replicas can't serve traffic and can't even have an initialized Raft group, so they're barely Replicas at all. Every bit of code in Replica needs to know about that. 2. the above state is implemented in an ad-hoc fashion and adds significantly to the complexity of the Store/Replica codebase. 3. Preemptive snapshots are subject to accidental garbage collection. There's currently no mechanism to decide whether a preemptive snapshot is simply waiting to be upgraded or whether it's abandoned. Accidental deletion causes another snapshot (this time Raft) to be sent. 4. Adding to 1., there are transitions between regular Replicas and preemptive snapshots that add additional complexity. For example, a regular snapshot can apply on top of a preemptive snapshot and vice versa. We try to prevent some of them but there are technical problems. 5. Preemptive snapshots have a range descriptor that doesn't include the Replica created from them. This is another gotcha that code needs to be aware of. (we cannot fix this in the first iteration, but it will be fixed when [learner replicas] are standard) Our answer to all but the last of these problems is that we want to remove the concept of preemptive snapshots altogether and instead rely on [learner replicas]. This is a Raft concept denoting essentially a member of a replication group without a vote. By replacing the preemptive snapshot with the addition of a learner replica (before upgrading to a full voting member), preemptive snapshots are replaced by full replicas with a flag set. However, as often the case, the interesting question becomes that of the migration, or, the possibility of running a mixed version cluster in which one node knows about these changes and another doesn't. The basic requirement that falls out of this is that we have to be able to send preemptive snapshots to followers even using the new code, and we have to be able to receive preemptive snapshots using the new code (though that code will go cold once the cluster setting upgrade has happened). Fortunately, sending and receiving preemptive snapshots is not what makes them terrible. In fact, the code that creates and receives preemptive snapshots is 100% shared with that for Raft snapshots. The complexity surrounding preemptive snapshots come from what happens when they are used to create a Replica object too early, but this is an implementation detail not visible across RPC boundaries. This suggests investigating how we can receive preemptive snapshots without actually using any of the internal code that handles them, so that this code can be removed in 19.2. The basic idea is that we will write the preemptive snapshot to a temporary location (instead of creating a Replica from it, and apply it as a Raft snapshot the moment we observe a local Replica for the matching RangeID created as a full member of the Raft group (i.e. with nonzero replicaID). This is carried out in this PR. Preemptive snapshots are put into a temporary in-memory map the size of which we aggressively keep under control (and which is cleared out periodically). Replica objects with replicaID zero are no longer instantiated. See the companion POC [learner replicas] which doesn't bother about the migration but explores actually using learner replicas. When learner replicas are standard, 5. above is also mostly addressed: the replica will always be contained in its range descriptor, even though it may be as a learner. TODO(tbg): preemptive snapshots stored on disk before this PR need to be deleted before we instantiate a Replica from them (because after this PR that will fail). [learner replicas]: cockroachdb#35787 [SST snapshots]: cockroachdb#25134 Release note: None

38932: storage: build SSTs from KV_BATCH snapshot r=jeffrey-xiao a=jeffrey-xiao Implements the SST snapshot strategy discussed in #16954 and partially implemented in #25134 and #38873, but only have the logic on the receiver side for ease of testing and compatibility. This PR also handles the complications of subsumed replicas that are not fully contained by the current replica. The maximum number of SSTs created using this strategy is 4 + SR + 2 where SR is the number of subsumed replicas. - Three SSTs get streamed from the sender (range local keys, replicated range-id local keys, and data keys) - One SST is constructed for the unreplicated range-id local keys. - One SST is constructed for every subsumed replica to clear the range-id local keys. These SSTs consists of one range deletion tombstone and one `RaftTombstone` key. - A maximum of two SSTs for all subsumed replicas to account for the case of not fully contained subsumed replicas. Note that currently, subsumed replicas can have keys right of the current replica, but not left of, so there will be a maximum of one SST created for the range-local keys and one for the data keys. These SSTs consist of one range deletion tombstone. This number can be further reduced to 3 + SR if we pass the file handles and sst writers from the receiving step to the application step. We can combine the SSTs of the unreplicated range id and replicated id, and the range local of the subsumed replicas and data SSTs of the subsumed replicas. We probably don't want to do this optimization since we'll have to undo this optimization if we start constructing the SSTs from the sender or start chunking large SSTs into smaller SSTs. Blocked by facebook/rocksdb#5649. # Test Plan - [x] Testing knob to inspect SSTs before ingestion. Ensure that expected SSTs for subsumed replicas are ingested. - [x] Unit tests for `SSTSnapshotStorage`. # Metrics and Evaluation One way to evaluate this change is the following steps: 1. Setup 3 node cluster 2. Set default Raft log truncation threshold to some low constant: ```go defaultRaftLogTruncationThreshold = envutil.EnvOrDefaultInt64( "COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD", 128<<10 /* 128 KB */) ``` 3. Set `range_min_bytes` to 0 and `range_max_bytes` to some large number. 4. Increase `kv.snapshot_recovery.max_rate` and `kv.snapshot_rebalance.max_rate` to some large number. 5. Disable load-based splitting. 6. Stop node 2. 7. Run an insert heavy workload (kv0) on the cluster. 8. Start node 2. 9. Time how long it takes for node 2 to have all the ranges. Roachtest: https://gist.github.com/jeffrey-xiao/e69fcad04968822d603f6807ca77ef3b We can have two independent variables 1. Fixed total data size (4000000 ops; ~3.81 GiB), variable number of splits - 32 splits (~121 MiB ranges) - 64 splits (~61.0 MiB ranges) - 128 splits (~31.2 MiB ranges) - 256 splits (~15.7 MiB ranges) - 512 splits (~7.9 MiB ranges) - 1024 splits (~3.9 MiB ranges) 2. Fixed number of splits (32), variable total data size - 125000 (~ 3.7 MiB ranges) - 250000 (~7.5 MiB ranges) - 500000 (~15 MiB ranges) - 1000000 (~30 MiB ranges) - 2000000 (60 MiB ranges) - 4000000 (121 MiB ranges) # Fsync Chunk Size The size of the SST chunk that we write before fsync-ing impacts how fast node 2 has all the ranges. I've experimented 32 splits and an median range size of 121 MB with no fsync-ing (~27s recovery), fsync-ing in 8 MB chunks (~30s recovery), fsync-ing in 2 MB chunks (~40s recovery), fsync-ing in 256 KB chunks (~42s recovery). The default bulk sst sync rate is 2MB and #20352 sets `bytes_per_sync` to 512 KB, so something between those options is probably good. The reason we would want to fsync is to prevent the OS from accumulating such a large buffer that it blocks unrelated small/fast writes for a long time when it flushes. # Impact on Foreground Traffic For testing the impact on foreground traffic, I ran kv0 on a four node cluster with the merge queue and split queue disabled and starting with a constant number of splits. After 5 minutes, I decommissioned node 1 so its replicas would drain to other nodes using snapshots. Roachtest: https://gist.github.com/jeffrey-xiao/5d9443a37b0929884aca927f9c320b6c **Average Range Size of 3 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398633-41a2bb00-b547-11e9-9e3d-747ee724943b.png) - [After](https://user-images.githubusercontent.com/8853434/62398634-41a2bb00-b547-11e9-85e7-445b7989d173.png) **Average Range Size of 32 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398631-410a2480-b547-11e9-9019-86d3bd2e6f73.png) - [After](https://user-images.githubusercontent.com/8853434/62398632-410a2480-b547-11e9-9513-8763e132e76b.png) **Average Range Size 128 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398558-15873a00-b547-11e9-8ab6-2e8e9bae658c.png) - [After](https://user-images.githubusercontent.com/8853434/62398559-15873a00-b547-11e9-9c72-b3e90fce1acc.png) We see p99 latency wins for larger range sizes and comparable performance for smaller range sizes. Release note (performance improvement): Snapshots sent between replicas are now applied more performantly and use less memory. Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>

This was the original intention with the `SnapshotRequest_Strategy` structure and was prototyped in cockroachdb#25134, but we never pushed it over the finish line because we did not find cases where SST ingestion was disruptive enough to warrant the extra complexity. cockroachdb#62700 tells a different story.

nvanbenschoten added 2 commits April 27, 2018 18:11

engine: implement engine.Writer interface on RocksDBSstFileWriter

ea37a5c

This change adds the `Merge`, `Clear`, and `ClearRange` methods to `RocksDBSstFileWriter`. In doing so, it makes the type implement the `engine.Writer` interface. Release note: None

nvanbenschoten requested review from a team April 27, 2018 22:12

nvanbenschoten added 2 commits April 27, 2018 18:45

nvanbenschoten force-pushed the nvanbenschoten/snapshotSST branch 2 times, most recently from 37ba952 to 2881a8f Compare April 27, 2018 22:58

tbg mentioned this pull request Mar 15, 2019

storage: delay application of preemptive snapshots #35786

Closed

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

nvanbenschoten assigned jeffrey-xiao Jul 10, 2019

This was referenced Jul 15, 2019

storage: implement SST snapshot strategy #38873

Closed

storage: build SSTs from KV_BATCH snapshot #38932

Merged

nvanbenschoten closed this Oct 19, 2020

nvanbenschoten deleted the nvanbenschoten/snapshotSST branch October 19, 2020 20:10

This was referenced Apr 2, 2021

[DNM] kv: use write batch for small Raft snapshots #63013

Draft

kv/sql: GC after TRUNCATE temporarily craters write throughput #62700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] storage: introduce SST snapshot strategy #25134

[DNM] storage: introduce SST snapshot strategy #25134

nvanbenschoten commented Apr 27, 2018

cockroach-teamcity commented Apr 27, 2018

bdarnell commented Apr 30, 2018

nvanbenschoten commented May 17, 2018

bdarnell commented May 21, 2018

nvanbenschoten commented May 21, 2018

benesch commented Jun 3, 2018

benesch commented Jun 3, 2018

nvanbenschoten commented Dec 12, 2018

[DNM] storage: introduce SST snapshot strategy #25134

[DNM] storage: introduce SST snapshot strategy #25134

Conversation

nvanbenschoten commented Apr 27, 2018

cockroach-teamcity commented Apr 27, 2018

bdarnell commented Apr 30, 2018

nvanbenschoten commented May 17, 2018

bdarnell commented May 21, 2018

nvanbenschoten commented May 21, 2018

benesch commented Jun 3, 2018

benesch commented Jun 3, 2018

nvanbenschoten commented Dec 12, 2018