Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: use stable iterator for absl::btree #5704

Merged
merged 5 commits into from
Aug 2, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 16 additions & 13 deletions src/v/cloud_storage/remote_partition.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#include "cloud_storage/types.h"
#include "storage/parser_errc.h"
#include "storage/types.h"
#include "utils/gate_guard.h"
#include "utils/retry_chain_node.h"
#include "utils/stream_utils.h"

Expand Down Expand Up @@ -501,27 +502,29 @@ remote_partition::get_term_last_offset(model::term_id term) const {

ss::future<std::vector<cluster::rm_stm::tx_range>>
remote_partition::aborted_transactions(offset_range offsets) {
gate_guard guard(_gate);
// Here we have to use kafka offsets to locate the segments and
// redpanda offsets to extract aborted transactions metadata because
// tx-manifests contains redpanda offsets.
std::vector<cluster::rm_stm::tx_range> result;
auto first_it = _segments.upper_bound(offsets.begin);
if (first_it != _segments.begin()) {

// that's a stable btree iterator that makes key lookup on increment
auto first_it = upper_bound(offsets.begin);
if (first_it != begin()) {
first_it = std::prev(first_it);
}
for (auto it = first_it; it != _segments.end(); it++) {
for (auto it = first_it; it != end(); it++) {
Comment on lines -512 to +516
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description of the problem seen, this still looks problematic. IIUC how the remote partition iterator works it's safe to increment it because it caches a key and performs a fresh lookup. However, imagine here that it is valid. Then a few lines down a reference to the value is taken:

        auto& st = it->second;

and then scheduling points occur that ostensibly are affecting the underlying container:

        auto tx = co_await ss::visit(
          st,
          [this, &st, offsets, offset_key = it->first](

Directly from the abseil documentation about invalidations:

Importantly, insertions and deletions may invalidate outstanding iterators
pointers, and references to elements. Such invalidations are typically only
an issue if insertion and deletion operations are interleaved with the use of
more than one iterator, pointer, or reference simultaneously.

However, even if all that is ok here, in the special iterator in remote partition it still looks unsafe. Deference looks up the key (which will be present), and then unconditionally dereferences it. And I don't see any reason why it is guarnateed that the key actually exists int he container after scheduling point.

auto it = _map.get().find(*_key);
return *it;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now it exists because we never delete from _segments, we only insert_or_assign() to it. But I agree that generally we want better guarantees.

cc @Lazin

Copy link
Contributor

@andrwng andrwng Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree looking at the definition of the stable iterator it doesn't look safe for deletes. Another instances is in incrementing, we expect the key to exist:

auto it = _map.get().find(*_key);
++it;

I think the undocumented expectation is that we don't support removal from _segments (a quick look around I don't see any calls to erase() at least). If that's the case, we should add that to the iterator comments. If not, I agree, the stable iterator doesn't look safe as a drop-in replacement.

Copy link
Member

@dotnwat dotnwat Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the undocumented expectation is that we don't support removal from _segments (a quick look around I don't see any calls to erase() at least).

I don't see any erase method (which is curious, how does it shrink? :)), but I do see an insert_or_assign where the value contains the type of things that look problematic if destructed (e.g. unique_ptrs etc...) while there is an open reference. wdyt @andrwng ?

But I agree that generally we want better guarantees.

I think Andrew's point is good. If interleaved modifications to the container are restricted to a set of operations that make this safe, then can that be documented? otherwise this isn't about better guarantees but rather about a latent bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the dereferences, it's not that straightforward, but nothing really jumped out as flagrantly incorrect, e.g. in the places we use the value state, I don't think we ever destruct the value. I think it's reasonably safe that callers that are able to dereference from the map expect the values to stay valid, and that dereferencers never destruct them, but it's definitely another thing to document since it seems fairly easy to misuse.

It's also worth thinking about what improvements we can bake into the APIs to make it harder to misuse, though nothing concrete comes to mind at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. in the places we use the value state, I don't think we ever destruct the value

map::insert_or_assign will destruct the previous value, no?

Copy link
Member

@dotnwat dotnwat Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, the dereference looks up the value again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back, we do hold a refernece to the value here auto& st = it->second; right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but the value references are safe to hold, provided that the underlying btree never erases anything and that other callers with references to the underlying values don't reset the value. Holding a reference to a unique_ptr should be fine on its own since it won't destruct if it leaves scope (though again, this is error prone and worth documenting)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding this code

auto& st = it->second;

This approach is used in many places in the SI code.
Inside the ss::visist we're actually writing to the referenced value, I think that we should use some extra synchronization that will prevent container modifications. It's safe for now but we will introduce removals from the archival metadata snapshot and as result we will have to remove data from _segments. But probably, outside of this PR.

if (it->first > offsets.end) {
break;
}
auto& st = it->second;
auto tx = co_await ss::visit(
st,
[this, &st, offsets, offset_key = it->first](
it->second,
[this, offsets, offset_key = it->first](
offloaded_segment_state& off_state) {
auto tmp = off_state->materialize(*this, offset_key);
auto res = tmp->segment->aborted_transactions(
offsets.begin_rp, offsets.end_rp);
st = std::move(tmp);
_segments.insert_or_assign(offset_key, std::move(tmp));
return res;
},
[offsets](materialized_segment_ptr& m_state) {
Expand All @@ -540,10 +543,10 @@ remote_partition::aborted_transactions(offset_range offsets) {
"found {} aborted transactions for {}-{} offset range ({}-{} before "
"offset translaction)",
result.size(),
offsets.begin_rp,
offsets.begin,
offsets.end_rp,
offsets.end);
offsets.end,
offsets.begin_rp,
offsets.end_rp);
co_return result;
}

Expand All @@ -560,9 +563,9 @@ ss::future<> remote_partition::stop() {
co_await std::visit([](auto&& rs) { return rs->stop(); }, rs);
}

for (auto& [offset, seg] : _segments) {
vlog(_ctxlog.debug, "remote partition stop {}", offset);
co_await std::visit([](auto&& st) { return st->stop(); }, seg);
for (auto it = begin(); it != end(); it++) {
vlog(_ctxlog.debug, "remote partition stop {}", it->first);
co_await std::visit([](auto&& st) { return st->stop(); }, it->second);
}
}

Expand Down
11 changes: 9 additions & 2 deletions src/v/cloud_storage/remote_partition.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,7 @@ namespace details {
/// iterator stability guarantee by caching the key and
/// doing a lookup on every increment.
/// This turns iterator increment into O(logN) operation
/// but this is OK since the underlying btree_map has high
/// fan-out ratio so the logarithm base is relatively large.
/// Deleting from underlying btree_map is not supported.
template<class TKey, class TVal>
class btree_map_stable_iterator
: public boost::iterator_facade<
Expand Down Expand Up @@ -93,6 +92,10 @@ class btree_map_stable_iterator
vassert(
_key.has_value(), "btree_map_stable_iterator can't be incremented");
auto it = _map.get().find(*_key);
// _key should be present since deletions are not supported
vassert(
it != _map.get().end(),
"btree_map_stable_iterator can't be incremented");
++it;
if (it == _map.get().end()) {
set_end();
Expand Down Expand Up @@ -326,6 +329,10 @@ class remote_partition
cache& _cache;
const partition_manifest& _manifest;
s3::bucket_name _bucket;

// Deleting from _segments is not supported.
// absl::btree_map doesn't provide a pointer stabilty. We are
// using remote_partition::btree_map_stable_iterator to work around this.
segment_map_t _segments;
eviction_list_t _eviction_list;
intrusive_list<
Expand Down