Skip to content

Commit

Permalink
Support pro-actively erasing obsolete block cache entries
Browse files Browse the repository at this point in the history
Summary: Currently, when files become obsolete, the block cache entries
associated with them just age out naturally. With pure LRU, this is not
too bad, as once you "use" enough cache entries to (re-)fill the cache,
you are guranteed to have purged the obsolete entries. However,
HyperClockCache is a counting clock cache with a somewhat longer memory,
so could be more negatively impacted by previously-hot cache entries
becoming obsolete, and taking longer to age out than newer single-hit
entries.

Part of the reason we still have this natural aging-out is that there's
almost no connection between block cache entries and the file they are
associated with. Everything is hashed into the same pool(s) of entries
with nothing like a secondary index based on file. Keeping track of such
an index could be expensive.

This change adds a new, mutable CF option `uncache_aggressiveness` for
erasing obsolete block cache entries. The process can be speculative,
lossy, or unproductive because not all potential block cache entries
associated with files will be resident in memory, and attempting to
remove them all could be wasted CPU time. Rather than a simple on/off
switch, `uncache_aggressiveness` basically tells RocksDB how much CPU
you're willing to burn trying to purge obsolete block cache entries.
When such efforts are not sufficiently productive for a file, we stop
and move on.

The option is in ColumnFamilyOptions so that it is dynamically
changeable for already-open files, and customizeable by CF.

Note that this block cache removal happens as part of the process of
purging obsolete files, which happens in a background thread rather than
along any CPU critical paths.

Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with
  `only_delete_metadata`: unnecessary TableCache::Evict in that case and
  missing from the ObsoleteFileInfo move operator.
* Remove suspicious TableCache::Erase() from
  VersionSet::AddObsoleteBlobFile() (TODO follow-up item)

Marked EXPERIMENTAL until more thorough validation is complete.

Test Plan:
Added to crash test
TODO: unit test
TODO: Should I add stats? Which ones (how detailed)?

Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```

Averaging 10 runs each case, block cache data block hit rates

```
lru_cache
UA=0   -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0

fixed_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2

auto_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```

Conclusion: up to roughly 1 percentage point of improved block cache hit
rate, likely leading to overall improved efficiency (because the
foreground CPU cost of cache misses likely outweighs the background CPU
cost of erasure, let alone I/O savings).
  • Loading branch information
pdillinger committed May 23, 2024
1 parent c72ee45 commit 1c47fd5
Show file tree
Hide file tree
Showing 29 changed files with 371 additions and 32 deletions.
26 changes: 18 additions & 8 deletions db/db_impl/db_impl_files.cc
Original file line number Diff line number Diff line change
Expand Up @@ -410,12 +410,24 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
state.manifest_delete_files.size());
// We may ignore the dbname when generating the file names.
for (auto& file : state.sst_delete_files) {
if (!file.only_delete_metadata) {
candidate_files.emplace_back(
MakeTableFileName(file.metadata->fd.GetNumber()), file.path);
}
if (file.metadata->table_reader_handle) {
table_cache_->Release(file.metadata->table_reader_handle);
auto* handle = file.metadata->table_reader_handle;
if (file.only_delete_metadata) {
if (handle) {
// Simply release handle of file that is not being deleted
table_cache_->Release(handle);
}
} else {
// File is being deleted (actually obsolete)
auto number = file.metadata->fd.GetNumber();
candidate_files.emplace_back(MakeTableFileName(number), file.path);
if (handle == nullptr) {
// For files not "pinned" in table cache
handle = TableCache::Lookup(table_cache_.get(), number);
}
if (handle) {
TableCache::ReleaseObsolete(table_cache_.get(), handle,
file.uncache_aggressiveness);
}
}
file.DeleteMetadata();
}
Expand Down Expand Up @@ -577,8 +589,6 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
std::string fname;
std::string dir_to_sync;
if (type == kTableFile) {
// evict from cache
TableCache::Evict(table_cache_.get(), number);
fname = MakeTableFileName(candidate_file.file_path, number);
dir_to_sync = candidate_file.file_path;
} else if (type == kBlobFile) {
Expand Down
15 changes: 15 additions & 0 deletions db/table_cache.cc
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,11 @@ Status TableCache::GetTableReader(
return s;
}

Cache::Handle* TableCache::Lookup(Cache* cache, uint64_t file_number) {
Slice key = GetSliceForFileNumber(&file_number);
return cache->Lookup(key);
}

Status TableCache::FindTable(
const ReadOptions& ro, const FileOptions& file_options,
const InternalKeyComparator& internal_comparator,
Expand Down Expand Up @@ -727,4 +732,14 @@ uint64_t TableCache::ApproximateSize(

return result;
}

void TableCache::ReleaseObsolete(Cache* cache, Cache::Handle* h,
uint32_t uncache_aggressiveness) {
CacheInterface typed_cache(cache);
TypedHandle* table_handle = reinterpret_cast<TypedHandle*>(h);
TableReader* table_reader = typed_cache.Value(table_handle);
table_reader->MarkObsolete(uncache_aggressiveness);
typed_cache.ReleaseAndEraseIfLastRef(table_handle);
}

} // namespace ROCKSDB_NAMESPACE
8 changes: 8 additions & 0 deletions db/table_cache.h
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,14 @@ class TableCache {
// Evict any entry for the specified file number
static void Evict(Cache* cache, uint64_t file_number);

// Handles releasing, erasing, etc. of what should be the last reference
// to an obsolete file.
static void ReleaseObsolete(Cache* cache, Cache::Handle* handle,
uint32_t uncache_aggressiveness);

// Return handle to an existing cache entry if there is one
static Cache::Handle* Lookup(Cache* cache, uint64_t file_number);

// Find table reader
// @param skip_filters Disables loading/accessing the filter block
// @param level == -1 means not specified
Expand Down
8 changes: 8 additions & 0 deletions db/version_set.cc
Original file line number Diff line number Diff line change
Expand Up @@ -857,10 +857,14 @@ Version::~Version() {
f->refs--;
if (f->refs <= 0) {
assert(cfd_ != nullptr);
// When not in the process of closing the DB, we'll have a superversion
// to get current mutable options from
auto* sv = cfd_->GetSuperVersion();
uint32_t path_id = f->fd.GetPathId();
assert(path_id < cfd_->ioptions()->cf_paths.size());
vset_->obsolete_files_.emplace_back(
f, cfd_->ioptions()->cf_paths[path_id].path,
sv ? sv->mutable_cf_options.uncache_aggressiveness : 0,
cfd_->GetFileMetadataCacheReservationManager());
}
}
Expand Down Expand Up @@ -5193,6 +5197,10 @@ VersionSet::~VersionSet() {
column_family_set_.reset();
for (auto& file : obsolete_files_) {
if (file.metadata->table_reader_handle) {
// NOTE: DB is shutting down, so file is probably not obsolete, just
// no longer referenced by Versions in memory.
// For more context, see comment on "table_cache_->EraseUnRefEntries()"
// in DBImpl::CloseHelper().
table_cache_->Release(file.metadata->table_reader_handle);
TableCache::Evict(table_cache_, file.metadata->fd.GetNumber());
}
Expand Down
21 changes: 14 additions & 7 deletions db/version_set.h
Original file line number Diff line number Diff line change
Expand Up @@ -797,16 +797,20 @@ struct ObsoleteFileInfo {
// the file, usually because the file is trivial moved so two FileMetadata
// is managing the file.
bool only_delete_metadata = false;
// To apply to this file
uint32_t uncache_aggressiveness = 0;

ObsoleteFileInfo() noexcept
: metadata(nullptr), only_delete_metadata(false) {}
ObsoleteFileInfo(FileMetaData* f, const std::string& file_path,
uint32_t _uncache_aggressiveness,
std::shared_ptr<CacheReservationManager>
file_metadata_cache_res_mgr_arg = nullptr)
: metadata(f),
path(file_path),
only_delete_metadata(false),
file_metadata_cache_res_mgr(file_metadata_cache_res_mgr_arg) {}
uncache_aggressiveness(_uncache_aggressiveness),
file_metadata_cache_res_mgr(
std::move(file_metadata_cache_res_mgr_arg)) {}

ObsoleteFileInfo(const ObsoleteFileInfo&) = delete;
ObsoleteFileInfo& operator=(const ObsoleteFileInfo&) = delete;
Expand All @@ -816,9 +820,13 @@ struct ObsoleteFileInfo {
}

ObsoleteFileInfo& operator=(ObsoleteFileInfo&& rhs) noexcept {
path = std::move(rhs.path);
metadata = rhs.metadata;
rhs.metadata = nullptr;
path = std::move(rhs.path);
only_delete_metadata = rhs.only_delete_metadata;
rhs.only_delete_metadata = false;
uncache_aggressiveness = rhs.uncache_aggressiveness;
rhs.uncache_aggressiveness = 0;
file_metadata_cache_res_mgr = rhs.file_metadata_cache_res_mgr;
rhs.file_metadata_cache_res_mgr = nullptr;

Expand Down Expand Up @@ -1495,10 +1503,7 @@ class VersionSet {
void GetLiveFilesMetaData(std::vector<LiveFileMetaData>* metadata);

void AddObsoleteBlobFile(uint64_t blob_file_number, std::string path) {
assert(table_cache_);

table_cache_->Erase(GetSliceForKey(&blob_file_number));

// TODO: Erase file from BlobFileCache?
obsolete_blob_files_.emplace_back(blob_file_number, std::move(path));
}

Expand Down Expand Up @@ -1676,6 +1681,8 @@ class VersionSet {
// Current size of manifest file
uint64_t manifest_file_size_;

// Obsolete files, or during DB shutdown any files not referenced by what's
// left of the in-memory LSM state.
std::vector<ObsoleteFileInfo> obsolete_files_;
std::vector<ObsoleteBlobFileInfo> obsolete_blob_files_;
std::vector<std::string> obsolete_manifests_;
Expand Down
1 change: 1 addition & 0 deletions db_stress_tool/db_stress_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,7 @@ DECLARE_bool(enable_memtable_insert_with_hint_prefix_extractor);
DECLARE_bool(check_multiget_consistency);
DECLARE_bool(check_multiget_entity_consistency);
DECLARE_bool(inplace_update_support);
DECLARE_uint32(uncache_aggressiveness);

constexpr long KB = 1024;
constexpr int kRandomValueMaxFactor = 3;
Expand Down
7 changes: 7 additions & 0 deletions db_stress_tool/db_stress_gflags.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1404,4 +1404,11 @@ DEFINE_bool(check_multiget_entity_consistency, true,
DEFINE_bool(inplace_update_support,
ROCKSDB_NAMESPACE::Options().inplace_update_support,
"Options.inplace_update_support");

DEFINE_uint32(uncache_aggressiveness,
ROCKSDB_NAMESPACE::ColumnFamilyOptions().uncache_aggressiveness,
"Aggressiveness of erasing cache entries that are likely "
"obsolete. 0 = disabled, 1 = minimum, 100 = moderate, 10000 = "
"normal max");

#endif // GFLAGS
1 change: 1 addition & 0 deletions db_stress_tool/db_stress_test_base.cc
Original file line number Diff line number Diff line change
Expand Up @@ -3865,6 +3865,7 @@ void InitializeOptionsFromFlags(
options.lowest_used_cache_tier =
static_cast<CacheTier>(FLAGS_lowest_used_cache_tier);
options.inplace_update_support = FLAGS_inplace_update_support;
options.uncache_aggressiveness = FLAGS_uncache_aggressiveness;
}

void InitializeOptionsGeneral(
Expand Down
42 changes: 42 additions & 0 deletions include/rocksdb/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,48 @@ struct ColumnFamilyOptions : public AdvancedColumnFamilyOptions {
// Dynamically changeable through SetOptions() API
uint32_t memtable_max_range_deletions = 0;

// EXPERIMENTAL
// When > 0, RocksDB attempts to erase some block cache entries for files
// that have become obsolete, which means they are about to be deleted.
// To avoid excessive tracking, this "uncaching" process is iterative and
// speculative, meaning it could incur extra background CPU effort if the
// file's blocks are generally not cached. A larger number indicates more
// willingness to spend CPU time to maximize block cache hit rates by
// erasing known-obsolete entries.
//
// When uncache_aggressiveness=1, block cache entries for an obsolete file
// are only erased until any attempted erase operation fails because the
// block is not cached. Then no further attempts are made to erase cached
// blocks for that file.
//
// For larger values, erasure is attempted until evidence incidates that the
// chance of success is < 0.99^(a-1), where a = uncache_aggressiveness. For
// example:
// 2 -> Attempt only while expecting >= 99% successful/useful erasure
// 11 -> 90%
// 69 -> 50%
// 110 -> 33%
// 230 -> 10%
// 460 -> 1%
// 690 -> 0.1%
// 1000 -> 1 in 23000
// 10000 -> Always (for all practical purposes)
// NOTE: UINT32_MAX and nearby values could take additional special meanings
// in the future.
//
// Pinned cache entries (guaranteed present) are always erased if
// uncache_aggressiveness > 0, but are not used in predicting the chances of
// successful erasure of non-pinned entries.
//
// NOTE: In the case of copied DBs (such as Checkpoints) sharing a block
// cache, it is possible that a file becoming obsolete doesn't mean its
// block cache entries (shared among copies) are obsolete. Such a scenerio
// is the best case for uncache_aggressiveness = 0.
//
// Once validated in production, the default will likely change to something
// around 300.
uint32_t uncache_aggressiveness = 0;

// Create ColumnFamilyOptions with default values for all fields
ColumnFamilyOptions();
// Create ColumnFamilyOptions from Options
Expand Down
9 changes: 7 additions & 2 deletions options/cf_options.cc
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,10 @@ static std::unordered_map<std::string, OptionTypeInfo>
{offsetof(struct MutableCFOptions, bottommost_file_compaction_delay),
OptionType::kUInt32T, OptionVerificationType::kNormal,
OptionTypeFlags::kMutable}},
{"uncache_aggressiveness",
{offsetof(struct MutableCFOptions, uncache_aggressiveness),
OptionType::kUInt32T, OptionVerificationType::kNormal,
OptionTypeFlags::kMutable}},
{"block_protection_bytes_per_key",
{offsetof(struct MutableCFOptions, block_protection_bytes_per_key),
OptionType::kUInt8T, OptionVerificationType::kNormal,
Expand Down Expand Up @@ -1118,11 +1122,12 @@ void MutableCFOptions::Dump(Logger* log) const {
report_bg_io_stats);
ROCKS_LOG_INFO(log, " compression: %d",
static_cast<int>(compression));
ROCKS_LOG_INFO(log,
" experimental_mempurge_threshold: %f",
ROCKS_LOG_INFO(log, " experimental_mempurge_threshold: %f",
experimental_mempurge_threshold);
ROCKS_LOG_INFO(log, " bottommost_file_compaction_delay: %" PRIu32,
bottommost_file_compaction_delay);
ROCKS_LOG_INFO(log, " uncache_aggressiveness: %" PRIu32,
uncache_aggressiveness);

// Universal Compaction Options
ROCKS_LOG_INFO(log, "compaction_options_universal.size_ratio : %d",
Expand Down
8 changes: 6 additions & 2 deletions options/cf_options.h
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,8 @@ struct MutableCFOptions {
compression_per_level(options.compression_per_level),
memtable_max_range_deletions(options.memtable_max_range_deletions),
bottommost_file_compaction_delay(
options.bottommost_file_compaction_delay) {
options.bottommost_file_compaction_delay),
uncache_aggressiveness(options.uncache_aggressiveness) {
RefreshDerivedOptions(options.num_levels, options.compaction_style);
}

Expand Down Expand Up @@ -223,7 +224,9 @@ struct MutableCFOptions {
memtable_protection_bytes_per_key(0),
block_protection_bytes_per_key(0),
sample_for_compression(0),
memtable_max_range_deletions(0) {}
memtable_max_range_deletions(0),
bottommost_file_compaction_delay(0),
uncache_aggressiveness(0) {}

explicit MutableCFOptions(const Options& options);

Expand Down Expand Up @@ -319,6 +322,7 @@ struct MutableCFOptions {
std::vector<CompressionType> compression_per_level;
uint32_t memtable_max_range_deletions;
uint32_t bottommost_file_compaction_delay;
uint32_t uncache_aggressiveness;

// Derived options
// Per-level target file size.
Expand Down
1 change: 1 addition & 0 deletions options/options_helper.cc
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,7 @@ void UpdateColumnFamilyOptions(const MutableCFOptions& moptions,
cf_opts->last_level_temperature = moptions.last_level_temperature;
cf_opts->default_write_temperature = moptions.default_write_temperature;
cf_opts->memtable_max_range_deletions = moptions.memtable_max_range_deletions;
cf_opts->uncache_aggressiveness = moptions.uncache_aggressiveness;
}

void UpdateColumnFamilyOptions(const ImmutableCFOptions& ioptions,
Expand Down
3 changes: 2 additions & 1 deletion options/options_settable_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -566,7 +566,8 @@ TEST_F(OptionsSettableTest, ColumnFamilyOptionsAllFieldsSettable) {
"persist_user_defined_timestamps=true;"
"block_protection_bytes_per_key=1;"
"memtable_max_range_deletions=999999;"
"bottommost_file_compaction_delay=7200;",
"bottommost_file_compaction_delay=7200;"
"uncache_aggressiveness=1234;",
new_options));

ASSERT_NE(new_options->blob_cache.get(), nullptr);
Expand Down
64 changes: 63 additions & 1 deletion table/block_based/block_based_table_reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,47 @@ extern const uint64_t kBlockBasedTableMagicNumber;
extern const std::string kHashIndexPrefixesBlock;
extern const std::string kHashIndexPrefixesMetadataBlock;

BlockBasedTable::~BlockBasedTable() { delete rep_; }
BlockBasedTable::~BlockBasedTable() {
if (rep_->uncache_aggressiveness > 0 && rep_->table_options.block_cache) {
if (rep_->filter) {
rep_->filter->EraseFromCacheBeforeDestruction(
rep_->uncache_aggressiveness);
}
if (rep_->index_reader) {
{
// TODO: Also uncache data blocks known after any gaps in partitioned
// index. Right now the iterator errors out as soon as there's an
// index partition not in cache.
IndexBlockIter iiter_on_stack;
ReadOptions ropts;
ropts.read_tier = kBlockCacheTier; // No I/O
auto iiter = NewIndexIterator(
ropts, /*disable_prefix_seek=*/false, &iiter_on_stack,
/*get_context=*/nullptr, /*lookup_context=*/nullptr);
std::unique_ptr<InternalIteratorBase<IndexValue>> iiter_unique_ptr;
if (iiter != &iiter_on_stack) {
iiter_unique_ptr.reset(iiter);
}
// Un-cache the data blocks the index iterator with tell us about
// without I/O. (NOTE: It's extremely unlikely that a data block
// will be in block cache without the index block pointing to it
// also in block cache.)
UncacheAggressivenessAdvisor advisor(rep_->uncache_aggressiveness);
for (iiter->SeekToFirst(); iiter->Valid() && advisor.ShouldContinue();
iiter->Next()) {
bool erased = EraseFromCache(iiter->value().handle);
advisor.Report(erased);
}
iiter->status().PermitUncheckedError();
}

// Un-cache the index block(s)
rep_->index_reader->EraseFromCacheBeforeDestruction(
rep_->uncache_aggressiveness);
}
}
delete rep_;
}

namespace {
// Read the block identified by "handle" from "file".
Expand Down Expand Up @@ -2663,6 +2703,24 @@ Status BlockBasedTable::VerifyChecksumInMetaBlocks(
return s;
}

bool BlockBasedTable::EraseFromCache(const BlockHandle& handle) const {
assert(rep_ != nullptr);

Cache* const cache = rep_->table_options.block_cache.get();
if (cache == nullptr) {
return false;
}

CacheKey key = GetCacheKey(rep_->base_cache_key, handle);

Cache::Handle* const cache_handle = cache->Lookup(key.AsSlice());
if (cache_handle == nullptr) {
return false;
}

return cache->Release(cache_handle, /*erase_if_last_ref=*/true);
}

bool BlockBasedTable::TEST_BlockInCache(const BlockHandle& handle) const {
assert(rep_ != nullptr);

Expand Down Expand Up @@ -3232,4 +3290,8 @@ void BlockBasedTable::DumpKeyValue(const Slice& key, const Slice& value,
out_stream << " ------\n";
}

void BlockBasedTable::MarkObsolete(uint32_t uncache_aggressiveness) {
rep_->uncache_aggressiveness = uncache_aggressiveness;
}

} // namespace ROCKSDB_NAMESPACE
Loading

0 comments on commit 1c47fd5

Please sign in to comment.