-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: use stable iterator for absl::btree #5704
Conversation
absl::btree_map doesn't provide a pointer stability, so incerementing iterator in a loop that has a scheduling point can lead to segfault. remote_partition uses stable iterator that make a lookup on every iterator increment.
LGTM, I defer to @Lazin for approval as he's more familiar with these iterator helpers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs some more detail about the specific scenario that is being fixed. From what I can see there could still be some issue.
for (auto it = first_it; it != _segments.end(); it++) { | ||
for (auto it = first_it; it != end(); it++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the description of the problem seen, this still looks problematic. IIUC how the remote partition iterator works it's safe to increment it because it caches a key and performs a fresh lookup. However, imagine here that it
is valid. Then a few lines down a reference to the value is taken:
auto& st = it->second;
and then scheduling points occur that ostensibly are affecting the underlying container:
auto tx = co_await ss::visit(
st,
[this, &st, offsets, offset_key = it->first](
Directly from the abseil documentation about invalidations:
Importantly, insertions and deletions may invalidate outstanding iterators
pointers, and references to elements. Such invalidations are typically only
an issue if insertion and deletion operations are interleaved with the use of
more than one iterator, pointer, or reference simultaneously.
However, even if all that is ok here, in the special iterator in remote partition it still looks unsafe. Deference looks up the key (which will be present), and then unconditionally dereferences it. And I don't see any reason why it is guarnateed that the key actually exists int he container after scheduling point.
auto it = _map.get().find(*_key);
return *it;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now it exists because we never delete from _segments
, we only insert_or_assign()
to it. But I agree that generally we want better guarantees.
cc @Lazin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree looking at the definition of the stable iterator it doesn't look safe for deletes. Another instances is in incrementing, we expect the key to exist:
auto it = _map.get().find(*_key);
++it;
I think the undocumented expectation is that we don't support removal from _segments
(a quick look around I don't see any calls to erase()
at least). If that's the case, we should add that to the iterator comments. If not, I agree, the stable iterator doesn't look safe as a drop-in replacement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the undocumented expectation is that we don't support removal from _segments (a quick look around I don't see any calls to erase() at least).
I don't see any erase
method (which is curious, how does it shrink? :)), but I do see an insert_or_assign
where the value contains the type of things that look problematic if destructed (e.g. unique_ptrs etc...) while there is an open reference. wdyt @andrwng ?
But I agree that generally we want better guarantees.
I think Andrew's point is good. If interleaved modifications to the container are restricted to a set of operations that make this safe, then can that be documented? otherwise this isn't about better guarantees but rather about a latent bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the dereferences, it's not that straightforward, but nothing really jumped out as flagrantly incorrect, e.g. in the places we use the value state, I don't think we ever destruct the value. I think it's reasonably safe that callers that are able to dereference from the map expect the values to stay valid, and that dereferencers never destruct them, but it's definitely another thing to document since it seems fairly easy to misuse.
It's also worth thinking about what improvements we can bake into the APIs to make it harder to misuse, though nothing concrete comes to mind at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. in the places we use the value state, I don't think we ever destruct the value
map::insert_or_assign will destruct the previous value, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see, the dereference looks up the value again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take that back, we do hold a refernece to the value here auto& st = it->second;
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but the value references are safe to hold, provided that the underlying btree never erases anything and that other callers with references to the underlying values don't reset the value. Holding a reference to a unique_ptr
should be fine on its own since it won't destruct if it leaves scope (though again, this is error prone and worth documenting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding this code
auto& st = it->second;
This approach is used in many places in the SI code.
Inside the ss::visist
we're actually writing to the referenced value, I think that we should use some extra synchronization that will prevent container modifications. It's safe for now but we will introduce removals from the archival metadata snapshot and as result we will have to remove data from _segments
. But probably, outside of this PR.
We can use
|
I'm against any node based containers because too much effort was put into making this memory efficient. The whole point of using an
The I think that we should add a good comment here and revisit this in a different PR to prepare the code for cloud data eviction. |
Yeah, I'd be hesitant to do an overhaul of the underlying type so close to the release, especially if we think the current implementation is safe with the current usage pattern, and is more memory efficient. I agree we'll probably need to revisit this soon to support erasing, but that can come later. @LenaAn did you confirm that the new usage of stable iterator avoids the crash when you add the sleeps to reproduce? |
no, actually I couldn't repro. Ok I will add a comment about deletion, but I don't understand why using |
one possible solution is to capture a key instead of doing |
The change is pretty minimal, but it has to be applied in several other places. It looks like it's not safe even without the deletions because you can capture the reference like that, then after a scheduling point few more elements will be added to the |
Ah indeed this isn't safe. For some reason I thought the iterators would be invalidated but the references would remain valid, but a small test proves otherwise:
I think @Lazin is suggesting is to do another |
remote_partition uses stable iterator wrapper for absl::btree. However stable iterator doesn't support deletions from the underlying btree_map. Add a comment about that.
Reference to the absl::btree_map is not stable, so don't use a reference that was captured before the scheduling point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
One of the CI failures looks new. |
ci failure is #5725 which is already closed a couple of hours ago, I didn't update the branch that recently so that should be fine |
Cover letter
Use stable iterator for
absl::btree
.absl::btree_map
doesn't provide a pointer stability, so incirementingiterator in a loop that has a scheduling point can lead to segfault.
remote_partition
uses stable iterator that make a lookup on everyiterator increment.
Fixes #5613
Release notes