Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: use stable iterator for absl::btree #5704

Merged
merged 5 commits into from
Aug 2, 2022

Conversation

LenaAn
Copy link
Contributor

@LenaAn LenaAn commented Jul 28, 2022

Cover letter

Use stable iterator for absl::btree. absl::btree_map doesn't provide a pointer stability, so incirementing
iterator in a loop that has a scheduling point can lead to segfault.
remote_partition uses stable iterator that make a lookup on every
iterator increment.

Fixes #5613

Release notes

  • none

absl::btree_map doesn't provide a pointer stability, so incerementing
iterator in a loop that has a scheduling point can lead to segfault.
remote_partition uses stable iterator that make a lookup on every
iterator increment.
@jcsp
Copy link
Contributor

jcsp commented Jul 28, 2022

LGTM, I defer to @Lazin for approval as he's more familiar with these iterator helpers.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some more detail about the specific scenario that is being fixed. From what I can see there could still be some issue.

Comment on lines -512 to +516
for (auto it = first_it; it != _segments.end(); it++) {
for (auto it = first_it; it != end(); it++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description of the problem seen, this still looks problematic. IIUC how the remote partition iterator works it's safe to increment it because it caches a key and performs a fresh lookup. However, imagine here that it is valid. Then a few lines down a reference to the value is taken:

        auto& st = it->second;

and then scheduling points occur that ostensibly are affecting the underlying container:

        auto tx = co_await ss::visit(
          st,
          [this, &st, offsets, offset_key = it->first](

Directly from the abseil documentation about invalidations:

Importantly, insertions and deletions may invalidate outstanding iterators
pointers, and references to elements. Such invalidations are typically only
an issue if insertion and deletion operations are interleaved with the use of
more than one iterator, pointer, or reference simultaneously.

However, even if all that is ok here, in the special iterator in remote partition it still looks unsafe. Deference looks up the key (which will be present), and then unconditionally dereferences it. And I don't see any reason why it is guarnateed that the key actually exists int he container after scheduling point.

auto it = _map.get().find(*_key);
return *it;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now it exists because we never delete from _segments, we only insert_or_assign() to it. But I agree that generally we want better guarantees.

cc @Lazin

Copy link
Contributor

@andrwng andrwng Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree looking at the definition of the stable iterator it doesn't look safe for deletes. Another instances is in incrementing, we expect the key to exist:

auto it = _map.get().find(*_key);
++it;

I think the undocumented expectation is that we don't support removal from _segments (a quick look around I don't see any calls to erase() at least). If that's the case, we should add that to the iterator comments. If not, I agree, the stable iterator doesn't look safe as a drop-in replacement.

Copy link
Member

@dotnwat dotnwat Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the undocumented expectation is that we don't support removal from _segments (a quick look around I don't see any calls to erase() at least).

I don't see any erase method (which is curious, how does it shrink? :)), but I do see an insert_or_assign where the value contains the type of things that look problematic if destructed (e.g. unique_ptrs etc...) while there is an open reference. wdyt @andrwng ?

But I agree that generally we want better guarantees.

I think Andrew's point is good. If interleaved modifications to the container are restricted to a set of operations that make this safe, then can that be documented? otherwise this isn't about better guarantees but rather about a latent bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the dereferences, it's not that straightforward, but nothing really jumped out as flagrantly incorrect, e.g. in the places we use the value state, I don't think we ever destruct the value. I think it's reasonably safe that callers that are able to dereference from the map expect the values to stay valid, and that dereferencers never destruct them, but it's definitely another thing to document since it seems fairly easy to misuse.

It's also worth thinking about what improvements we can bake into the APIs to make it harder to misuse, though nothing concrete comes to mind at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. in the places we use the value state, I don't think we ever destruct the value

map::insert_or_assign will destruct the previous value, no?

Copy link
Member

@dotnwat dotnwat Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, the dereference looks up the value again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back, we do hold a refernece to the value here auto& st = it->second; right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but the value references are safe to hold, provided that the underlying btree never erases anything and that other callers with references to the underlying values don't reset the value. Holding a reference to a unique_ptr should be fine on its own since it won't destruct if it leaves scope (though again, this is error prone and worth documenting)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding this code

auto& st = it->second;

This approach is used in many places in the SI code.
Inside the ss::visist we're actually writing to the referenced value, I think that we should use some extra synchronization that will prevent container modifications. It's safe for now but we will introduce removals from the archival metadata snapshot and as result we will have to remove data from _segments. But probably, outside of this PR.

@LenaAn
Copy link
Contributor Author

LenaAn commented Jul 29, 2022

We can use absl::node_hash_map with a wrapper to make it ordered instead of using absl::btree_map with a wrapper to make it pointer stable. We will store std::set of keys alongside with absl::node_hash_map and on increment we will take the next key from std::set and do a lookup in absl::node_hash_map.

  • Insert: insert key into std::set of keys, insert key and value to absl::node_hash_map, time is logN for std::set + const for absl::node_hash_map
  • Delete: delete key value from absl::btree_map, delete key from std::set, time is logN for std::set + const for absl::node_hash_map
  • Iterator increase: increase the iterator for set of keys, find the new value in absl::node_hash_map, const for std::set + const for absl::node_hash_map

@Lazin @dotnwat @andrwng WDYT?

@Lazin
Copy link
Contributor

Lazin commented Jul 29, 2022

cc @LenaAn @dotnwat @andrwng

I'm against any node based containers because too much effort was put into making this memory efficient. The whole point of using an absl::btree_map is to save some memory and to use fewer memory allocations. We have seen a lot of bad_alloc exceptions previously and end up using btree map and this iterator wrapper.

std::set<model::offset> uses at least 32 bytes per value - 3 pointers + value, the absl::node_hash_map is difficult to account for but anyway, every element is at least 16 bytes of payload + pointer + there is some memory to store the hash table itself which is number of elements in container * 8 bytes * some factor.

The absl::btree_map<model::offset, std::unique_ptr<...>> on the other hand should use around 20 bytes per element (with branching factor 16).

I think that we should add a good comment here and revisit this in a different PR to prepare the code for cloud data eviction.

@andrwng
Copy link
Contributor

andrwng commented Jul 29, 2022

Yeah, I'd be hesitant to do an overhaul of the underlying type so close to the release, especially if we think the current implementation is safe with the current usage pattern, and is more memory efficient. I agree we'll probably need to revisit this soon to support erasing, but that can come later.

@LenaAn did you confirm that the new usage of stable iterator avoids the crash when you add the sleeps to reproduce?

@LenaAn
Copy link
Contributor Author

LenaAn commented Jul 29, 2022

no, actually I couldn't repro.

Ok I will add a comment about deletion, but I don't understand why using auto& st = it->second; is safe.

@Lazin
Copy link
Contributor

Lazin commented Jul 29, 2022

one possible solution is to capture a key instead of doing auto& st = it->second; and capture st
and inside the lambda you can look up the key and update it

@Lazin
Copy link
Contributor

Lazin commented Jul 29, 2022

The change is pretty minimal, but it has to be applied in several other places. It looks like it's not safe even without the deletions because you can capture the reference like that, then after a scheduling point few more elements will be added to the _segments and the node that owns the object that st references will be split. The st will become a dangling reference since the value will be moved to the new btree-node.

@andrwng
Copy link
Contributor

andrwng commented Jul 29, 2022

Ah indeed this isn't safe. For some reason I thought the iterators would be invalidated but the references would remain valid, but a small test proves otherwise:

  SEASTAR_THREAD_TEST_CASE(test_valid_iterator_dereference) {
      using map_t = absl::btree_map<int, std::unique_ptr<int>>;
      using iter_t = cloud_storage::details::btree_map_stable_iterator<int, std::unique_ptr<int>>;
      // Insert a bunch.
      static constexpr int N = 10000;
      map_t m;
      for (int i = 0; i < N; i++) {
          m[i] = std::make_unique<int>(i * 2);
      }
      // Take references to the unique pointers.
      std::vector<std::pair<int, std::unique_ptr<int>*>> refs;
      refs.reserve(N);
      for (auto it = iter_t(m, m.begin()->first); it != iter_t(m); it++) {
        refs.emplace_back(std::make_pair(it->first, &(it->second)));
      }
      // Insert more to encourage node splits
      for (int i = N; i < N * 2; i++) {
          m[i] = std::make_unique<int>(i * 2);
      }

      // Oh no! Some of our values no longer point at the right value!
      for (const auto& [i, r] : refs) {
          BOOST_REQUIRE_EQUAL(**r, 2 * i);
      }
  }

I think @Lazin is suggesting is to do another find() in the visit body for the offloaded_segment_state& case. Am I understanding that correctly? I guess that would work as long as the scheduling of the visit() has already evaluated the underlying type of the variant.

remote_partition uses stable iterator wrapper for absl::btree. However
stable iterator doesn't support deletions from the underlying btree_map.
Add a comment about that.
Reference to the absl::btree_map is not stable, so don't use a reference
that was captured before the scheduling point.
Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Lazin
Copy link
Contributor

Lazin commented Aug 2, 2022

One of the CI failures looks new.

@LenaAn
Copy link
Contributor Author

LenaAn commented Aug 2, 2022

@LenaAn
Copy link
Contributor Author

LenaAn commented Aug 2, 2022

ci failure is #5725 which is already closed a couple of hours ago, I didn't update the branch that recently so that should be fine

@LenaAn LenaAn merged commit 4b9875b into redpanda-data:dev Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SI Crash under load in FranzGoVerifiableWithSiTest.test_si_with_timeboxed,test_si_without_timeboxed
6 participants