Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SI Crash under load in FranzGoVerifiableWithSiTest.test_si_with_timeboxed,test_si_without_timeboxed #5613

Closed
jcsp opened this issue Jul 25, 2022 · 9 comments · Fixed by #5704
Assignees
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working

Comments

@jcsp
Copy link
Contributor

jcsp commented Jul 25, 2022

This was a clustered ducktape run from 20th July https://buildkite.com/redpanda/vtools/builds/2906

Tail of crashed node (ip-172-31-54-196) log is:

DEBUG 2022-07-20 07:33:30,731 [shard 2] cloud_storage - [fiber18~3 kafka/topic-tefqfbpbjg/88] - remote_segment.cc:235 - Using index to locate 121, the result is rp-offset: 126, kafka-offset: 120, file-pos: 15001626
DEBUG 2022-07-20 07:33:30,731 [shard 2] cloud_storage - [fiber18~3 kafka/topic-tefqfbpbjg/88] - remote_segment.cc:644 - found 0 aborted transactions for 127-136 offset range in this segment
DEBUG 2022-07-20 07:33:30,731 [shard 2] cloud_storage - [fiber18~3 kafka/topic-tefqfbpbjg/88] - remote_segment.cc:235 - Using index to locate 122, the result is rp-offset: 127, kafka-offset: 121, file-pos: 16001723
DEBUG 2022-07-20 07:33:30,731 [shard 1] cloud_storage - cache_service.cc:308 - Trying to get 2c09875d/kafka/topic-tefqfbpbjg/91_22/69-1-v1.log.1 from archival cache.
DEBUG 2022-07-20 07:33:30,731 [shard 1] cloud_storage - [fiber8~5~2~0|1|56796ms] - remote.cc:203 - Download manifest "5cb4cb50/kafka/topic-tefqfbpbjg/23_22/145-2-v1.log.2.tx"
Segmentation fault on shard 2.
Backtrace:
  0x46f7d06
  0x475aaf6
  0x29e6de1559ff
  0x3f58b40
  0x24a0f8a
  0x471510f
  0x4718de7
  0x475c245
  0x46b5c7f
  /opt/redpanda/lib/libpthread.so.0+0x9298
  /opt/redpanda/lib/libc.so.6+0x1006a2

These clustered ducktape runs use nightly packages, so it should be possible to download the packages from that night (including debug symbols package) and decode that backtrace.

@jcsp jcsp added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem labels Jul 25, 2022
@jcsp jcsp changed the title Crash under load in FranzGoVerifiableWithSiTest.test_si_with_timeboxed SI Crash under load in FranzGoVerifiableWithSiTest.test_si_with_timeboxed Jul 25, 2022
@dotnwat
Copy link
Member

dotnwat commented Jul 25, 2022

Weee I think I did it.

IMPORTANT: this was decoded based on matching sha1 not based on time (ie cdt chooses the latest build not a specific build). So it is possible that this is wrong. I will double check.

Backtrace:[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:758
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:788
seastar::print_with_backtrace(char const*, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:800
 (inlined by) seastar::sigsegv_action() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3685
 (inlined by) operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3671
 (inlined by) __invoke at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3667
?? ??:0
cloud_storage::remote_partition::aborted_transactions(cloud_storage::offset_range) [clone .resume] at remote_partition.cc:?
 (inlined by) absl::lts_20210324::container_internal::btree_iterator<absl::lts_20210324::container_internal::btree_node<absl::lts_20210324::container_internal::map_params<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > >, std::__1::less<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > > >, 256, false> >, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >&, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >*>::increment() at /vectorized/include/absl/container/internal/btree.h:985
 (inlined by) absl::lts_20210324::container_internal::btree_iterator<absl::lts_20210324::container_internal::btree_node<absl::lts_20210324::container_internal::map_params<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > >, std::__1::less<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > > >, 256, false> >, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >&, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >*>::operator++() at /vectorized/include/absl/container/internal/btree.h:1021
 (inlined by) absl::lts_20210324::container_internal::btree_iterator<absl::lts_20210324::container_internal::btree_node<absl::lts_20210324::container_internal::map_params<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > >, std::__1::less<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > > >, 256, false> >, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >&, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >*>::operator++(int) at /vectorized/include/absl/container/internal/btree.h:1030
 (inlined by) cloud_storage::remote_partition::aborted_transactions(cloud_storage::offset_range) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0babfcc69a72e5b51-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cloud_storage/remote_partition.cc:512
std::__1::coroutine_handle<seastar::internal::coroutine_traits_base<std::__1::vector<cluster::rm_stm::tx_range, std::__1::allocator<cluster::rm_stm::tx_range> > >::promise_type>::resume() const at /vectorized/llvm/bin/../include/c++/v1/__coroutine/coroutine_handle.h:168
 (inlined by) seastar::internal::coroutine_traits_base<std::__1::vector<cluster::rm_stm::tx_range, std::__1::allocator<cluster::rm_stm::tx_range> > >::promise_type::run_and_dispose() at /vectorized/include/seastar/core/coroutine.hh:78
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2356
 (inlined by) seastar::reactor::run_some_tasks() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2769
seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2938
operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4163
 (inlined by) decltype ((static_cast<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94&>({parm#1}))()) std::__1::__invoke<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94&) at /vectorized/llvm/bin/../include/c++/v1/type_traits:3640
 (inlined by) void std::__1::__invoke_void_return_wrapper<void, true>::__call<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94&) at /vectorized/llvm/bin/../include/c++/v1/__functional/invoke.h:61
 (inlined by) std::__1::__function::__alloc_func<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94, std::__1::allocator<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94>, void ()>::operator()() at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:180
 (inlined by) std::__1::__function::__func<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94, std::__1::allocator<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_94>, void ()>::operator()() at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:354
std::__1::__function::__value_func<void ()>::operator()() const at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:507
 (inlined by) std::__1::function<void ()>::operator()() const at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:1184
 (inlined by) seastar::posix_thread::start_routine(void*) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/posix.cc:60
addr2line: '/opt/redpanda/lib/libpthread.so.0': No such file
/opt/redpanda/lib/libpthread.so.0 0x9298 
addr2line: '/opt/redpanda/lib/libc.so.6': No such file
/opt/redpanda/lib/libc.so.6 0x1006a2 

@andrwng
Copy link
Contributor

andrwng commented Jul 26, 2022

I have a CI failure with a different test that shows basically the same issue with aborted_transactions().

Module: rptest.tests.test_si_cache_space_leak
Class:  ShadowIndexingCacheSpaceLeakTest
Method: test_si_cache
Arguments:
{
  "concurrency": 2,
  "message_size": 10000,
  "num_messages": 100000,
  "num_read": 1000
}

https://ci-artifacts.dev.vectorized.cloud/redpanda/018237e0-b4cc-4849-a17f-432fe6110f8c/vbuild/ducktape/results/2022-07-26--001/report.html

Stacktrace: https://ci-artifacts.dev.vectorized.cloud/redpanda/018237e0-b4cc-4849-a17f-432fe6110f8c/vbuild/ducktape/results/2022-07-26--001/ShadowIndexingCacheSpaceLeakTest/test_si_cache/message_size=10000.num_messages=100000.num_read=1000.concurrency=2/168/RedpandaService-0-281473366681264/docker-rp-12/redpanda_backtrace.log

Backtrace:
[Backtrace #35]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:758
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:788
seastar::print_with_backtrace(char const*, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:800
 (inlined by) seastar::sigsegv_action() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3685
 (inlined by) operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3671
 (inlined by) __invoke at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3667
linux-vdso.so.1 0x667 
cloud_storage::remote_partition::aborted_transactions(cloud_storage::offset_range) [clone .resume] at remote_partition.cc:?
 (inlined by) absl::lts_20210324::container_internal::btree_iterator<absl::lts_20210324::container_internal::btree_node<absl::lts_20210324::container_internal::map_params<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > >, std::__1::less<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > > >, 256, false> >, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >&, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >*>::increment() at /vectorized/include/absl/container/internal/btree.h:985
 (inlined by) absl::lts_20210324::container_internal::btree_iterator<absl::lts_20210324::container_internal::btree_node<absl::lts_20210324::container_internal::map_params<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > >, std::__1::less<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > > >, 256, false> >, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >&, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >*>::operator++() at /vectorized/include/absl/container/internal/btree.h:1021
 (inlined by) absl::lts_20210324::container_internal::btree_iterator<absl::lts_20210324::container_internal::btree_node<absl::lts_20210324::container_internal::map_params<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > >, std::__1::less<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > > >, 256, false> >, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >&, std::__1::pair<detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> > const, std::__1::variant<cloud_storage::remote_partition::offloaded_segment_state, std::__1::unique_ptr<cloud_storage::remote_partition::materialized_segment_state, std::__1::default_delete<cloud_storage::remote_partition::materialized_segment_state> > > >*>::operator++(int) at /vectorized/include/absl/container/internal/btree.h:1030
 (inlined by) cloud_storage::remote_partition::aborted_transactions(cloud_storage::offset_range) at /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-0f3d6ab2dd27de46d-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/cloud_storage/remote_partition.cc:512
std::__1::coroutine_handle<seastar::internal::coroutine_traits_base<std::__1::vector<cluster::rm_stm::tx_range, std::__1::allocator<cluster::rm_stm::tx_range> > >::promise_type>::resume() const at /vectorized/llvm/bin/../include/c++/v1/__coroutine/coroutine_handle.h:168
 (inlined by) seastar::internal::coroutine_traits_base<std::__1::vector<cluster::rm_stm::tx_range, std::__1::allocator<cluster::rm_stm::tx_range> > >::promise_type::run_and_dispose() at /vectorized/include/seastar/core/coroutine.hh:78
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2356
 (inlined by) seastar::reactor::run_some_tasks() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2769
seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2938
seastar::reactor::run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2821
seastar::app_template::run_deprecated(int, char**, std::__1::function<void ()>&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:265
seastar::app_template::run(int, char**, std::__1::function<seastar::future<int> ()>&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:156
application::run(int, char**) at /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-0f3d6ab2dd27de46d-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/redpanda/application.cc:224
main at /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-0f3d6ab2dd27de46d-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/redpanda/main.cc:22
/var/lib/buildkite-agent/builds/arm64-xfs-builders-i-0f3d6ab2dd27de46d-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/lib/libc.so.6: ELF 64-bit LSB shared object, ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, BuildID[sha1]=5db67e386dfcfa6a2a3a0ee9f496d7336e243d64, for GNU/Linux 3.7.0, not stripped

@LenaAn
Copy link
Contributor

LenaAn commented Jul 26, 2022

if I read that correctly the crash happens when absl::btree_map::iterator gets incremented in src/v/cloud_storage/remote_partition.cc:512: for (auto it = first_it; it != _segments.end(); it++) {

abseil doesn't specify what absl::btree_map::upper_bound returns in case there's no elements greater than target. Looking at https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal/btree.h

@andrwng
Copy link
Contributor

andrwng commented Jul 26, 2022

I haven't looked too deeply so maybe the suggestion doesn't make sense, but it also looks like we're susceptible to iterator invalidation when inserting and deleting from the map (see here), and it looks like aborted_transactions() will co_await as it iterates through _segments. Are we sure aborted_transactions() can't overlap with anything else?

@LenaAn
Copy link
Contributor

LenaAn commented Jul 27, 2022

@andrwng good catch! I see that _segments is updated in remote_partition::update_segments_incrementally, which is called from remote_partition::make_reader and remote_partition::start(). remote_partition::aborted_transactions is called from ntp_archiver::upload_tx, which runs in a loop in ntp_archiver::upload_loop.

So to reproduce this issue we need to run remote_partition::update_segments_incrementally() during remote_partition::aborted_transactions(). Will try to insert sleeps to reproduce the error.

@LenaAn
Copy link
Contributor

LenaAn commented Jul 27, 2022

I was able to reproduce the crash via inserting sleep(10s) in a loop in remote_partition::aborted_transactions. Locally EndToEndShadowIndexingTest is failing with similar stacktrace, see #5678

Ideas how to fix it:

  1. use another data structure for _segments.
  • just use std::map (it's sorted, has pointer stability)
  • We need an ordered map so we can't use absl::flat_hash_map or
    absl::node_hash_map.
  1. use read/write mutex for _segments (I think that would be more slow then just using std::map, but I can't prove it).

@Lazin WDYT?

@Lazin
Copy link
Contributor

Lazin commented Jul 27, 2022

remote_partition uses btree_stable_iterator wrapper that prevents this particular problem

@Lazin
Copy link
Contributor

Lazin commented Jul 27, 2022

instead of using begin end and upper_bound of the _segment collection you need to use methods from remote_partition

@jcsp jcsp changed the title SI Crash under load in FranzGoVerifiableWithSiTest.test_si_with_timeboxed SI Crash under load in FranzGoVerifiableWithSiTest.test_si_with_timeboxed,test_si_without_timeboxed Aug 1, 2022
@jcsp
Copy link
Contributor Author

jcsp commented Aug 1, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants