Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support range deletion tombstones in IngestExternalFile SSTs #3778

Closed
wants to merge 3 commits into from

Conversation

nvanbenschoten
Copy link
Contributor

Fixes #3391.

This change adds a DeleteRange method to SstFileWriter and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 27, 2018
facebook/rocksdb#3778 adds a DeleteRange method
to SstFileWriter and adds support for ingesting SSTs with range deletion
tombstones.

facebook/rocksdb#3779 adds a virtual Truncate method
to Env, which truncates the named file to the specified size.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 27, 2018
Fixes cockroachdb#16954.
Related to cockroachdb#25047.

This depends on the following two upstream changes to RockDB:
- facebook/rocksdb#3778
- facebook/rocksdb#3779

The change introduces a new snapshot strategy called "SST". This strategy
stream sst files consisting of all keys in a range from the sender to the
receiver. These sst files are then atomically ingested directly into RocksDB.
An important property of the strategy is that the amount of memory required
for a receiver using the strategy is constant with respect to the size of
a range, instead of linear as it is with the KV_BATCH strategy. This will
be critical for increasing the default range size and potentially for
increasing the number of concurrent snapshots allowed per node. The
strategy also seems to significantly speed up snapshots once ranges are
above a certain size (somewhere in the single digit MBs).

This is a WIP change. Before it can be merged it needs:
- to be cleaned up a bit
- more testing (unit test, testing knobs, maybe some chaos)
- proper version handling
- heuristic tuning
- decisions on questions like compactions after ingestion

Release note: None
@anand1976 anand1976 self-requested a review April 30, 2018 18:35
@anand1976 anand1976 self-assigned this Apr 30, 2018
@nvanbenschoten
Copy link
Contributor Author

@anand1976 let me know if you need to me explain anything.

As discussed in the commit message, the second commit is needed to get ExternalSSTFileBasicTest.IngestFileWithMixedValueType working. It implies to me that there's a bug in RangeDelAggregator's pinning of memory when an SST has multiple range deletion tombstones. Looking at the loop in RangeDelAggregator::AddTombstones, this seems possible because of the argument I layed out in #3391 (comment). However, it's possible that my understanding is incorrect.

@@ -166,6 +164,7 @@ class RangeDelAggregator {
struct Rep {
StripeMap stripe_map_;
PinnedIteratorsManager pinned_iters_mgr_;
std::list<std::string> pinned_slices_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After debugging I believe the corruption can only happen if UpdateInternalKey is called or if block_restart_intervals != 1, as either of those prevent the map from holding pointers to the full keys inside the range deletion meta-block (which the PinnedIteratorsManager should be capable of pinning). UpdateInternalKey is only called for range deletion keys as of this PR, and we always set block_restart_intervals = 1. I wonder if there's a way to prevent the file ingestion path from using UpdateInternalKey, rather than maintaining this list for all range deletion keys (even though it's only needed for ones added through the file ingestion path).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this and thanks for adding an assertion to guard against this in 7b65521.

As you brought up, the file ingestion path uses UpdateInternalKey because it creates blocks with global sequence numbers. This is used to avoid needing to update every single key in an SST when a sequence number is computed during ingestion time. I see two general approaches to addressing this issue, but I'm not confident about which one would be the better approach:

The first is that we look into addressing the following TODO:

rocksdb/table/block.cc

Lines 256 to 257 in aa53579

// TODO(tec): Investigate updating the seqno in the loaded block
// directly instead of doing a copy and update.

If we updated the sequence number in the loaded block directly in BlockIter::ParseNextKey when we see a global sequence number then we wouldn't ever need to un-pin the keys. This would allow us to avoid needing to worry about making any changes to RangeDelAggregator. However, there are a few downsides here, including that this could alter performance characteristics outside of this change. I also don't feel comfotable making a change like that without some guidance because I expect that it could have far reaching consequences.

The other approach, mentioned by @anand1976 below, is that we allow unpinned keys in RangeDelAggregator::AddTombstones but we only perform the copy if a key is unpinned. This has a few nice properties, including that it is a well contained change, it shouldn't have any unexpected consequences or performance implications, and it adds a bit more flexibility to RangeDelAggregator in case another future user of the class provides an iterator with unpinned keys. This is the direction I'm leaning towards. Please let me know if you agree. Also, if so, could you give me a pointer about how this kind of memory pinning usually takes place in this code base? The std::list was just for prototyping, but it certainly isn't the best way to do this kind of thing. Is there some kind of arena object that I should be copying the keys into?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping. Hoping to make progress on this sometime this week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In approach #1, it looks like there is no guarantee that the sequence number can be updated in the loaded block directly, since the key may be sharing bytes with the previous key and would have to be copied anyway. So I suggest going with approach#2. @ajkr Do you agree?

As far as pinning is concerned, maybe rep_->pinned_slices_.emplace_back(input->key().data(), input->key().size()) would be more efficient. Also, pinned_slices_ could be std::vector or std::deque instead of std::list to avoid repeated allocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change as suggested. I also had to do the same thing for input->value() when !input->IsValuePinned(). This is a new case which seems to have been introduced/revealed by fea2b1d.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Yeah, approach 2 sounds safer to me too. And if we decide to do approach 1 later, you'll get the benefit of avoiding the copy without having to change anything.

I don't have a concrete idea of a problem for approach 1. Regarding shared bytes, we don't delta encode keys in range deletion meta-blocks so it shouldn't be a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the IsValuePinned() recent change

@@ -204,7 +204,8 @@ Status RangeDelAggregator::AddTombstones(
first_iter = false;
}
ParsedInternalKey parsed_key;
if (!ParseInternalKey(input->key(), &parsed_key)) {
rep_->pinned_slices_.push_back(input->key().ToString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on comment by @ajkr below, maybe we should do this only if input->IsKeyPinned() is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the best option too, but I provided an alternative in my comment above.

// TODO DURING REVIEW: without this, ingested sstables without any keys
// other than range deletions hit an assertion in BlockPrefixIndex::Builder.
// I need to figure out why this is.
if (num_blocks == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a prefix meta block created even if there are no keys, only range deletions? That doesn't seem right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it is. I was hoping that this was a result of #3906, but the block is definitely create, even when there are no keys. I don't actually see any logic that would prevent that. The meta block written in HashIndexBuilder::Finish, and this is called from BlockBasedTableBuilder::Finish regardless of whether empty_data_block is true or not. Is it possible that this code path has just never been hit before, even though SSTs with only range deletions were already supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this code path has just never been hit before, even though SSTs with only range deletions were already supported?

Actully, I think this might be is exactly what's going on. See #3774, which is the same assertion I hit in my test without this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A recent change to testing code created a new case that this change hadn't been tested against before - building a table with only range deletion tombstones and no normal keys while using a two level index (PartitionedIndexBuilder). This is hitting this assertion:

assert(!entries_.empty());

This is making me question whether sstables with only range deletion tombstones and no normal keys are actually supported, and if so, how. I'd appreciate some help getting more context on this situation and how we expect it behave with BlockBasedTableBuilder::Finish. That method currently avoids writing its filter block if the data block is empty, but it writes its index block unconditionally.

@nvanbenschoten
Copy link
Contributor Author

I've rebased this on top of #4016 and #4021. Once those go in, I think this is ready to merge as well. Thanks for all the help @ajkr and @anand1976!

benesch added a commit to benesch/rocksdb that referenced this pull request Jun 20, 2018
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.
facebook-github-bot pushed a commit that referenced this pull request Jun 27, 2018
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes #4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
nvanbenschoten added a commit to nvanbenschoten/rocksdb that referenced this pull request Jul 10, 2018
Upstream PR: facebook#3778.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
nvanbenschoten pushed a commit to nvanbenschoten/rocksdb that referenced this pull request Jul 11, 2018
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes facebook#4016
@facebook-github-bot
Copy link
Contributor

@nvanbenschoten has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@anand1976
Copy link
Contributor

Nice. The memory leak is gone now.
Travis CI is failing in the test ExternalSSTFileBasicTest.IngestFileWithMultipleValueType though.

petermattis pushed a commit to petermattis/rocksdb that referenced this pull request Jul 13, 2018
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes facebook#4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
@facebook-github-bot
Copy link
Contributor

@nvanbenschoten has updated the pull request. Re-import the pull request

@nvanbenschoten
Copy link
Contributor Author

Yeah I made what looked like a harmless change and it ended up causing IngestFileWithMultipleValueType to fail, but only on Mac. With that fixed, all tests passed in CI except for DBTest.GroupCommitTest, but the failure seems completely unrelated to my change and it passes for me on my Mac laptop and on a Linux machine. Perhaps it was a flake? Could we re-run the CI?

@nvanbenschoten
Copy link
Contributor Author

Yeah this flake seems to be hitting other PRs as well, like #4132.

@nvanbenschoten
Copy link
Contributor Author

Looks like all tests are passing now!

@anand1976
Copy link
Contributor

Yeah, it finally passed after a few re-runs and different failures. I think we're ready to land this now!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@anand1976
Copy link
Contributor

@nvanbenschoten The land is still failing, though the failure is rather cryptic. I think it might be due to conflicts with #4014 which landed yesterday. Can you rebase one more time?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@anand1976
Copy link
Contributor

Trying to land again using a suggestion from @ajkr to use a slightly different workflow.

@anand1976
Copy link
Contributor

@nvanbenschoten My initial guess was right. There is a merge conflict in db/range_del_aggregator.cc. Can you rebase and resolve?

Fixes facebook#3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
@nvanbenschoten
Copy link
Contributor Author

@anand1976 thanks for figuring that out. Rebased and resolved the conflict.

@facebook-github-bot
Copy link
Contributor

@nvanbenschoten has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes facebook#4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
…k#3778)

Summary:
Fixes facebook#3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: facebook#3778

Differential Revision: D8821836

Pulled By: anand1976

fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants