storage: assertion on heavily overloaded slow drive #5433

jcsp · 2022-07-12T17:58:50Z

This assertion in segment_appender::do_next_adaptive_fallocation

                 vassert(
                   _prev_head_write->available_units() == 1,
                   "Unexpected pending head write {}",
                   *this);

Triggered when running my development branch of ManyPartitionsTest on a cluster where I had accidentally broken the symlink from /mnt/vectorized/redpanda to /var/lib/redpanda, so was inadvertently running on the slow root drive on an i3en.6xlarge node.

Suggests we have a bug here that only shows up when really hammering a really slow drive.

This ticket is light on details but exists to make sure we go back and re-test before releasing 22.2. Basically just take dev and the latest/greatest ManyPartitionsTest, blow away the symlink + see if it crashes. It did it twice in two runs, so was pretty reproducible.

If it doesn't reproduce, close this ticket: I was running lots of whacky development branches of dubious provenance.

The text was updated successfully, but these errors were encountered:

dotnwat · 2022-07-18T18:35:34Z

Thanks for reporting. Most previous assertions thrown in segment appender have been a result of callers (raft and storage) violating the rules of the interface. So this probably a good one to try to reproduce. There may still be a logic error internal to the segment appender but i looked a little bit this morning and didn't see anything obvious.

I wonder if reducing the falloc size would help reproduce so that that code path is triggered more frequently (in addition to using the slow device which would put pressure on the number of queued up i/o requests).

jcsp · 2022-10-18T13:17:01Z

This happened on a CI run here (first time I've seen it in that context):
https://buildkite.com/redpanda/redpanda/builds/16833#0183e7b7-cf4b-4858-9f79-55a60b68c43c

It's a debug build on docker, so that lines up with the original observation of this happening when storage is very slow.

Reported failure: redpanda-data#5433 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

dotnwat · 2022-10-19T02:18:43Z

I've put together a reproducer that is live in PR #6818.

Right at the moment I'm not entirely sure what the best fix is.

Details

/*
 * Assumption about the segment appender:
 *
 *    usage of the appender api is exclusive: you cannot call append or truncate
 *    or any other interface until the future from any previous method
 * completes.
 *
 * The exception to this rule is `flush`, which should be allowed to be invoked
 * asynchronously to append calls, even if an append has not completed:
 *
 *    commit 5668d7da3309d7bd2d03339f4067b64320cb3802
 *    Author: Noah Watkins <noah@vectorized.io>
 *    Date:   Thu Jun 3 15:18:33 2021 -0700
 *
 *        storage: support concurrent append and flushing
 *
 *        This patch adds support to the segment appender for allowing append()
 * to be called before the future from a previous flush() is completed. This new
 * functionality allows raft to pipeline appends while while waiting on flushes
 * to complete asynchronously.
 *
 * It would appear that there is a race condition in which an assertion fails,
 * and this race condition is rare. It requires that truncate or fallocation run
 * in parallel to a slow dma_write.
 *
 * The following test will reproduce the bug on DEBUG builds. Some strategic
 * sleeps may be required to make it reproduce on a RELEASE build.
 */
SEASTAR_THREAD_TEST_CASE(exclusive_concurrent_flushing_drains_all_io) {
    // create an empty file
    temporary_dir dir("exclusive_concurrent_flushing_drains_all_io");
    auto fpath = dir.get_path() / "segment.dat";
    auto f = open_file(fpath.string());

    // appender with 32_KiB fallocation size
    const int falloc_size = 32_KiB;
    storage::storage_resources resources(
      config::mock_binding<size_t>(falloc_size));
    auto appender = make_segment_appender(f, resources);

    const iobuf one_byte = make_random_data(1);
    const iobuf many_bytes = make_random_data(falloc_size * 2);
    std::vector<ss::future<>> bg_append;
    std::vector<ss::future<>> bg_flush;

    /*
     * we want to be able to control when fallocation occurs, and it always
     * occurs on the first append. so let's get that out of the way.
     *
     * after this call the head write-ahead buffer contains 1 byte.
     */
    appender.append(one_byte).get();

    /*
     * FLUSH-A
     *
     * next we want to trigger a write by calling flush. since the head buffer
     * contains 1 byte of dirty data a background write will be started via
     * dispatch_background_head_write. here that is with the important parts:
     *
     *    ss::future<> segment_appender::flush() {
     *
     *       void segment_appender::dispatch_background_head_write() {
     *
     *           // NOT FULL
     *           const auto full = _head->is_full();
     *
     *           // PREV UNITS
     *           auto prev = _prev_head_write;
     *           auto units = ss::get_units(*prev, 1);
     *
     *           // BACKGROUND WRITE
     *           (void)ss::with_semaphore(
     *             _concurrent_flushes,
     *             1,
     *             [units = std::move(units), full]() mutable {
     *                 return units
     *                   .then([this, h, w, start_offset, expected, src, full](
     *                           ssx::semaphore_units u) mutable {
     *                       return _out
     *                         .dma_write(start_offset, src, expected,
     * _opts.priority)
     *
     *
     *           // PREV HEAD _NOT_ RESET
     *           if (full) {
     *               _prev_head_write = ss::make_lw_shared<ssx::semaphore>(1,
     * head_sem_name);
     *           }
     *
     * so in words, at this point there is a background write that is holding 1
     * unit of _concurrent_flushes, and 1 unit of _prev_head_write. finally,
     * since the head buffer was not full, it is also not reset to a fresh mutex
     * before exiting.
     *
     * Note that the flush is start asynchronously in the background which is
     * allowed according to the assumptions stated above.
     */
    bg_flush.push_back(appender.flush());

    /*
     * we want to be able to trigger another background write.  if we called
     * flush again now it would be a noop since the previous dirty data is now
     * in flight ot the disk. so let's add another byte to make the head buffer
     * dirty again.
     */
    appender.append(one_byte).get();

    /*
     * here is the first part of where the race happens. we are going to now to
     * start a new append that is large enough to trigger the fallocation path,
     * and we are going to background the append.
     *
     * this usage is still is still correct according to the assumptions stated
     * above: there is 1 inflight flush and all previous appends have completed.
     *
     * when append is a called the need for fallocation is checked. if an
     * fallocation is needed it is done prior to any remaining data in the
     * append call being written to the current head buffer.
     *
     *    ss::future<> segment_appender::append(const iobuf& io) {
     *
     *       ss::future<> segment_appender::do_next_adaptive_fallocation() {
     *
     *          return ss::with_semaphore(
     *                   _concurrent_flushes,
     *                   ss::semaphore::max_counter(),
     *                   [this, step]() mutable {
     *                       vassert(
     *                         _prev_head_write->available_units() == 1,
     *                         "Unexpected pending head write {}",
     *                         *this);
     *
     * at this point the append we just started should be blocked on
     * _concurrent_flushes because it is trying grab exclusive control
     * (max_counter) AND the background write started from FLUSH-A above is
     * holding 1 unit of this semaphore.
     */
    bg_append.push_back(appender.append(many_bytes));

    /*
     * currently there is 1 flush and 1 append in flight.
     */
    BOOST_REQUIRE(bg_append.size() == 1);
    BOOST_REQUIRE(bg_flush.size() == 1);

    /*
     * now for the final piece of the race. we are going to call flush once
     * more. recall that we have 1 byte of dirty data in the head buffer and
     * many bytes that are in an inflight append, but waiting on fallocate.
     *
     * recall from above what happens when flush starts the background write:
     *
     *    ss::future<> segment_appender::flush() {
     *       void segment_appender::dispatch_background_head_write() {
     *       ...
     *
     *
     * the background write grabs 1 unit of _concurrent_flushes and 1 unit of
     * _prev_head_write. since the head buffer is NOT full, it is not reset to a
     * new semaphore before returning.
     */
    bg_flush.push_back(appender.flush());
    BOOST_REQUIRE(bg_append.size() == 1);
    BOOST_REQUIRE(bg_flush.size() == 2);

    /*
     * now let the race complete.
     *
     * when the background write from the original FLUSH-A completes and
     * releases its unit of _concurrent_flushes and _prev_head_write two things
     * will happen.
     *
     * first, the next waiter on _concurrent_flushes will be woken up. this is
     * the fallocation call from the last append that is in the background.
     *
     * second, _prev_head_write units will be released and control of the mutex
     * will be given to the next waiter which is the background write scheduled
     * from the immediately preceeding flush.
     *
     * when fallocate is scheduled holding exclusive access to concurrent
     * flushes it will find that this assertion doesn't hold because there is a
     * new owner of the current _prev_head_write semaphore.
     *
     *    vassert(
     *      _prev_head_write->available_units() == 1,
     *      "Unexpected pending head write {}",
     *      *this);
     */
    ss::when_all_succeed(
      ss::when_all_succeed(bg_append.begin(), bg_append.end()),
      ss::when_all_succeed(bg_flush.begin(), bg_flush.end()))
      .get();

    appender.close().get();
}

dotnwat · 2022-11-28T21:33:46Z

we've only seen this failure in CI, but it looks like it could also be possible in production.

piyushredpanda · 2022-12-06T22:08:30Z

@dotnwat or @jcsp : this still needs some love? If so, which team does it go to so we can bucket/plan?

dotnwat · 2022-12-13T05:20:11Z

It would probably make a good bug for someone from storage team to work on, given there is a reproducer. Happy to help out. Rather surprised we never hit it outside CI.

rockwotj · 2023-08-14T16:21:03Z

Just hit this in CI: https://buildkite.com/redpanda/redpanda/builds/34991#0189f47c-eced-4d6c-b633-73207bd14d59

travisdowns · 2023-08-16T05:40:07Z

I'm preparing a fix for this as part of #12473.

In the segment appender we assert in a few places that a particular semaphore has 1 unit available, which is a proxy for "no in flight writes on the last chunk". For example, when we grab max units from the _concurrent_flushes (_cf) semaphore, we expect there to be no concurrent writers, since other writers also obtain at least 1 unit from this semaphore before dispatching a write. In its current state this assert can fire. Issue redpanda-data#5433 describes one way this can happen in detail, but in summary examining the units can give false positives here: the assert may fail (0 units available) but there are no concurrent writers because writers also try to obtain the _cf semaphore after the _phw semaphore and may be safely blocked there. This change tracks dispatched writes explicitly: incrementing a counter when we dispatch a write and decrementing it after. The counter can be incremented in the "right" place, immediately before actually calling dma_write and after obtaining the _cf semaphore, avoiding the problem above. Additionally, the check is broader in that it applies to all prior chunks, not just the most recent one. The existing check would naturally want to check this broader condition as well, but there was no convenient way with the way state was tracked by the appender. In a sense, the _inflight_dispatched counter is redundant: it could be reconstructed simply by walking the _inflight array and counting the number of writes in DISPATCHED status. However, I am concerned that this array could grow very large and so take a long time to traverse, so I opted to maintain an explicit counter. Fixes redpanda-data#5433.

In the segment appender we assert in a few places that a particular semaphore has 1 unit available, which is a proxy for "no in flight writes on the last chunk". For example, when we grab max units from the _concurrent_flushes (_cf) semaphore, we expect there to be no concurrent writers, since other writers also obtain at least 1 unit from this semaphore before dispatching a write. In its current state this assert can fire. Issue #5433 describes one way this can happen in detail, but in summary examining the units can give false positives here: the assert may fail (0 units available) but there are no concurrent writers because writers also try to obtain the _cf semaphore after the _phw semaphore and may be safely blocked there. This change tracks dispatched writes explicitly: incrementing a counter when we dispatch a write and decrementing it after. The counter can be incremented in the "right" place, immediately before actually calling dma_write and after obtaining the _cf semaphore, avoiding the problem above. Additionally, the check is broader in that it applies to all prior chunks, not just the most recent one. The existing check would naturally want to check this broader condition as well, but there was no convenient way with the way state was tracked by the appender. In a sense, the _inflight_dispatched counter is redundant: it could be reconstructed simply by walking the _inflight array and counting the number of writes in DISPATCHED status. However, I am concerned that this array could grow very large and so take a long time to traverse, so I opted to maintain an explicit counter. Fixes #5433. (cherry picked from commit ea423ea)

mmedenjak added kind/bug Something isn't working area/storage labels Jul 15, 2022

mmedenjak added the known-issue label Jul 29, 2022

jcsp mentioned this issue Oct 18, 2022

storage: refactor: reinstate indexing on non-data batches, for internal topics #6697

Merged

6 tasks

dotnwat added a commit to dotnwat/redpanda that referenced this issue Oct 19, 2022

test: reproducer for segment appender assertion

b243746

Reported failure: redpanda-data#5433 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

dotnwat mentioned this issue Oct 19, 2022

test: reproducer for segment appender assertion #6818

Closed

6 tasks

dotnwat mentioned this issue Oct 25, 2022

segment_utils.cc: runtime error: signed integer overflow #6930

Closed

dotnwat mentioned this issue Nov 28, 2022

test/raft: segment appender concurrent flush and write sanitizer error #3841

Closed

jcsp added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Dec 13, 2022

piyushredpanda added sev/high loss of availability, pathological performance degradation, recoverable corruption and removed sev/high loss of availability, pathological performance degradation, recoverable corruption labels Apr 16, 2023

andrwng self-assigned this Jun 13, 2023

andrwng mentioned this issue Jun 14, 2023

storage: tweak ordering of semaphores in appender #11420

Closed

7 tasks

travisdowns mentioned this issue Aug 22, 2023

Improved segment appender write merging #12473

Merged

7 tasks

piyushredpanda closed this as completed in ea423ea Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: assertion on heavily overloaded slow drive #5433

storage: assertion on heavily overloaded slow drive #5433

jcsp commented Jul 12, 2022 •

edited

Loading

dotnwat commented Jul 18, 2022

jcsp commented Oct 18, 2022

dotnwat commented Oct 19, 2022 •

edited

Loading

dotnwat commented Nov 28, 2022

piyushredpanda commented Dec 6, 2022

dotnwat commented Dec 13, 2022

rockwotj commented Aug 14, 2023

travisdowns commented Aug 16, 2023

storage: assertion on heavily overloaded slow drive #5433

storage: assertion on heavily overloaded slow drive #5433

Comments

jcsp commented Jul 12, 2022 • edited Loading

dotnwat commented Jul 18, 2022

jcsp commented Oct 18, 2022

dotnwat commented Oct 19, 2022 • edited Loading

Details

dotnwat commented Nov 28, 2022

piyushredpanda commented Dec 6, 2022

dotnwat commented Dec 13, 2022

rockwotj commented Aug 14, 2023

travisdowns commented Aug 16, 2023

jcsp commented Jul 12, 2022 •

edited

Loading

dotnwat commented Oct 19, 2022 •

edited

Loading