Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: assertion on heavily overloaded slow drive #5433

Closed
jcsp opened this issue Jul 12, 2022 · 8 comments
Closed

storage: assertion on heavily overloaded slow drive #5433

jcsp opened this issue Jul 12, 2022 · 8 comments
Assignees
Labels
area/storage kind/bug Something isn't working known-issue sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@jcsp
Copy link
Contributor

jcsp commented Jul 12, 2022

This assertion in segment_appender::do_next_adaptive_fallocation

                 vassert(
                   _prev_head_write->available_units() == 1,
                   "Unexpected pending head write {}",
                   *this);

Triggered when running my development branch of ManyPartitionsTest on a cluster where I had accidentally broken the symlink from /mnt/vectorized/redpanda to /var/lib/redpanda, so was inadvertently running on the slow root drive on an i3en.6xlarge node.

Suggests we have a bug here that only shows up when really hammering a really slow drive.

This ticket is light on details but exists to make sure we go back and re-test before releasing 22.2. Basically just take dev and the latest/greatest ManyPartitionsTest, blow away the symlink + see if it crashes. It did it twice in two runs, so was pretty reproducible.

If it doesn't reproduce, close this ticket: I was running lots of whacky development branches of dubious provenance.

@mmedenjak mmedenjak added kind/bug Something isn't working area/storage labels Jul 15, 2022
@dotnwat
Copy link
Member

dotnwat commented Jul 18, 2022

Thanks for reporting. Most previous assertions thrown in segment appender have been a result of callers (raft and storage) violating the rules of the interface. So this probably a good one to try to reproduce. There may still be a logic error internal to the segment appender but i looked a little bit this morning and didn't see anything obvious.

I wonder if reducing the falloc size would help reproduce so that that code path is triggered more frequently (in addition to using the slow device which would put pressure on the number of queued up i/o requests).

@jcsp
Copy link
Contributor Author

jcsp commented Oct 18, 2022

This happened on a CI run here (first time I've seen it in that context):
https://buildkite.com/redpanda/redpanda/builds/16833#0183e7b7-cf4b-4858-9f79-55a60b68c43c

It's a debug build on docker, so that lines up with the original observation of this happening when storage is very slow.

dotnwat added a commit to dotnwat/redpanda that referenced this issue Oct 19, 2022
Reported failure: redpanda-data#5433

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@dotnwat
Copy link
Member

dotnwat commented Oct 19, 2022

I've put together a reproducer that is live in PR #6818.

Right at the moment I'm not entirely sure what the best fix is.

Details

/*
 * Assumption about the segment appender:
 *
 *    usage of the appender api is exclusive: you cannot call append or truncate
 *    or any other interface until the future from any previous method
 * completes.
 *
 * The exception to this rule is `flush`, which should be allowed to be invoked
 * asynchronously to append calls, even if an append has not completed:
 *
 *    commit 5668d7da3309d7bd2d03339f4067b64320cb3802
 *    Author: Noah Watkins <noah@vectorized.io>
 *    Date:   Thu Jun 3 15:18:33 2021 -0700
 *
 *        storage: support concurrent append and flushing
 *
 *        This patch adds support to the segment appender for allowing append()
 * to be called before the future from a previous flush() is completed. This new
 * functionality allows raft to pipeline appends while while waiting on flushes
 * to complete asynchronously.
 *
 * It would appear that there is a race condition in which an assertion fails,
 * and this race condition is rare. It requires that truncate or fallocation run
 * in parallel to a slow dma_write.
 *
 * The following test will reproduce the bug on DEBUG builds. Some strategic
 * sleeps may be required to make it reproduce on a RELEASE build.
 */
SEASTAR_THREAD_TEST_CASE(exclusive_concurrent_flushing_drains_all_io) {
    // create an empty file
    temporary_dir dir("exclusive_concurrent_flushing_drains_all_io");
    auto fpath = dir.get_path() / "segment.dat";
    auto f = open_file(fpath.string());

    // appender with 32_KiB fallocation size
    const int falloc_size = 32_KiB;
    storage::storage_resources resources(
      config::mock_binding<size_t>(falloc_size));
    auto appender = make_segment_appender(f, resources);

    const iobuf one_byte = make_random_data(1);
    const iobuf many_bytes = make_random_data(falloc_size * 2);
    std::vector<ss::future<>> bg_append;
    std::vector<ss::future<>> bg_flush;

    /*
     * we want to be able to control when fallocation occurs, and it always
     * occurs on the first append. so let's get that out of the way.
     *
     * after this call the head write-ahead buffer contains 1 byte.
     */
    appender.append(one_byte).get();

    /*
     * FLUSH-A
     *
     * next we want to trigger a write by calling flush. since the head buffer
     * contains 1 byte of dirty data a background write will be started via
     * dispatch_background_head_write. here that is with the important parts:
     *
     *    ss::future<> segment_appender::flush() {
     *
     *       void segment_appender::dispatch_background_head_write() {
     *
     *           // NOT FULL
     *           const auto full = _head->is_full();
     *
     *           // PREV UNITS
     *           auto prev = _prev_head_write;
     *           auto units = ss::get_units(*prev, 1);
     *
     *           // BACKGROUND WRITE
     *           (void)ss::with_semaphore(
     *             _concurrent_flushes,
     *             1,
     *             [units = std::move(units), full]() mutable {
     *                 return units
     *                   .then([this, h, w, start_offset, expected, src, full](
     *                           ssx::semaphore_units u) mutable {
     *                       return _out
     *                         .dma_write(start_offset, src, expected,
     * _opts.priority)
     *
     *
     *           // PREV HEAD _NOT_ RESET
     *           if (full) {
     *               _prev_head_write = ss::make_lw_shared<ssx::semaphore>(1,
     * head_sem_name);
     *           }
     *
     * so in words, at this point there is a background write that is holding 1
     * unit of _concurrent_flushes, and 1 unit of _prev_head_write. finally,
     * since the head buffer was not full, it is also not reset to a fresh mutex
     * before exiting.
     *
     * Note that the flush is start asynchronously in the background which is
     * allowed according to the assumptions stated above.
     */
    bg_flush.push_back(appender.flush());

    /*
     * we want to be able to trigger another background write.  if we called
     * flush again now it would be a noop since the previous dirty data is now
     * in flight ot the disk. so let's add another byte to make the head buffer
     * dirty again.
     */
    appender.append(one_byte).get();

    /*
     * here is the first part of where the race happens. we are going to now to
     * start a new append that is large enough to trigger the fallocation path,
     * and we are going to background the append.
     *
     * this usage is still is still correct according to the assumptions stated
     * above: there is 1 inflight flush and all previous appends have completed.
     *
     * when append is a called the need for fallocation is checked. if an
     * fallocation is needed it is done prior to any remaining data in the
     * append call being written to the current head buffer.
     *
     *    ss::future<> segment_appender::append(const iobuf& io) {
     *
     *       ss::future<> segment_appender::do_next_adaptive_fallocation() {
     *
     *          return ss::with_semaphore(
     *                   _concurrent_flushes,
     *                   ss::semaphore::max_counter(),
     *                   [this, step]() mutable {
     *                       vassert(
     *                         _prev_head_write->available_units() == 1,
     *                         "Unexpected pending head write {}",
     *                         *this);
     *
     * at this point the append we just started should be blocked on
     * _concurrent_flushes because it is trying grab exclusive control
     * (max_counter) AND the background write started from FLUSH-A above is
     * holding 1 unit of this semaphore.
     */
    bg_append.push_back(appender.append(many_bytes));

    /*
     * currently there is 1 flush and 1 append in flight.
     */
    BOOST_REQUIRE(bg_append.size() == 1);
    BOOST_REQUIRE(bg_flush.size() == 1);

    /*
     * now for the final piece of the race. we are going to call flush once
     * more. recall that we have 1 byte of dirty data in the head buffer and
     * many bytes that are in an inflight append, but waiting on fallocate.
     *
     * recall from above what happens when flush starts the background write:
     *
     *    ss::future<> segment_appender::flush() {
     *       void segment_appender::dispatch_background_head_write() {
     *       ...
     *
     *
     * the background write grabs 1 unit of _concurrent_flushes and 1 unit of
     * _prev_head_write. since the head buffer is NOT full, it is not reset to a
     * new semaphore before returning.
     */
    bg_flush.push_back(appender.flush());
    BOOST_REQUIRE(bg_append.size() == 1);
    BOOST_REQUIRE(bg_flush.size() == 2);

    /*
     * now let the race complete.
     *
     * when the background write from the original FLUSH-A completes and
     * releases its unit of _concurrent_flushes and _prev_head_write two things
     * will happen.
     *
     * first, the next waiter on _concurrent_flushes will be woken up. this is
     * the fallocation call from the last append that is in the background.
     *
     * second, _prev_head_write units will be released and control of the mutex
     * will be given to the next waiter which is the background write scheduled
     * from the immediately preceeding flush.
     *
     * when fallocate is scheduled holding exclusive access to concurrent
     * flushes it will find that this assertion doesn't hold because there is a
     * new owner of the current _prev_head_write semaphore.
     *
     *    vassert(
     *      _prev_head_write->available_units() == 1,
     *      "Unexpected pending head write {}",
     *      *this);
     */
    ss::when_all_succeed(
      ss::when_all_succeed(bg_append.begin(), bg_append.end()),
      ss::when_all_succeed(bg_flush.begin(), bg_flush.end()))
      .get();

    appender.close().get();
}

@dotnwat
Copy link
Member

dotnwat commented Nov 28, 2022

we've only seen this failure in CI, but it looks like it could also be possible in production.

@piyushredpanda
Copy link
Contributor

@dotnwat or @jcsp : this still needs some love? If so, which team does it go to so we can bucket/plan?

@dotnwat
Copy link
Member

dotnwat commented Dec 13, 2022

It would probably make a good bug for someone from storage team to work on, given there is a reproducer. Happy to help out. Rather surprised we never hit it outside CI.

@jcsp jcsp added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Dec 13, 2022
@piyushredpanda piyushredpanda added sev/high loss of availability, pathological performance degradation, recoverable corruption and removed sev/high loss of availability, pathological performance degradation, recoverable corruption labels Apr 16, 2023
@andrwng andrwng self-assigned this Jun 13, 2023
@rockwotj
Copy link
Contributor

@travisdowns
Copy link
Member

I'm preparing a fix for this as part of #12473.

travisdowns added a commit to travisdowns/redpanda that referenced this issue Aug 16, 2023
In the segment appender we assert in a few places that a particular
semaphore has 1 unit available, which is a proxy for "no in flight
writes on the last chunk". For example, when we grab max units from
the _concurrent_flushes (_cf) semaphore, we expect there to be no
concurrent writers, since other writers also obtain at least 1 unit
from this semaphore before dispatching a write.

In its current state this assert can fire. Issue redpanda-data#5433 describes one
way this can happen in detail, but in summary examining the units can
give false positives here: the assert may fail (0 units available) but
there are no concurrent writers because writers also try to obtain the
_cf semaphore after the _phw semaphore and may be safely blocked there.

This change tracks dispatched writes explicitly: incrementing a counter
when we dispatch a write and decrementing it after. The counter can
be incremented in the "right" place, immediately before actually
calling dma_write and after obtaining the _cf semaphore, avoiding
the problem above.

Additionally, the check is broader in that it applies to all prior
chunks, not just the most recent one. The existing check would naturally
want to check this broader condition as well, but there was no
convenient way with the way state was tracked by the appender.

In a sense, the _inflight_dispatched counter is redundant: it could be
reconstructed simply by walking the _inflight array and counting the
number of writes in DISPATCHED status. However, I am concerned that
this array could grow very large and so take a long time to traverse,
so I opted to maintain an explicit counter.

Fixes redpanda-data#5433.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Aug 17, 2023
In the segment appender we assert in a few places that a particular
semaphore has 1 unit available, which is a proxy for "no in flight
writes on the last chunk". For example, when we grab max units from
the _concurrent_flushes (_cf) semaphore, we expect there to be no
concurrent writers, since other writers also obtain at least 1 unit
from this semaphore before dispatching a write.

In its current state this assert can fire. Issue redpanda-data#5433 describes one
way this can happen in detail, but in summary examining the units can
give false positives here: the assert may fail (0 units available) but
there are no concurrent writers because writers also try to obtain the
_cf semaphore after the _phw semaphore and may be safely blocked there.

This change tracks dispatched writes explicitly: incrementing a counter
when we dispatch a write and decrementing it after. The counter can
be incremented in the "right" place, immediately before actually
calling dma_write and after obtaining the _cf semaphore, avoiding
the problem above.

Additionally, the check is broader in that it applies to all prior
chunks, not just the most recent one. The existing check would naturally
want to check this broader condition as well, but there was no
convenient way with the way state was tracked by the appender.

In a sense, the _inflight_dispatched counter is redundant: it could be
reconstructed simply by walking the _inflight array and counting the
number of writes in DISPATCHED status. However, I am concerned that
this array could grow very large and so take a long time to traverse,
so I opted to maintain an explicit counter.

Fixes redpanda-data#5433.
StephanDollberg pushed a commit that referenced this issue Aug 30, 2023
In the segment appender we assert in a few places that a particular
semaphore has 1 unit available, which is a proxy for "no in flight
writes on the last chunk". For example, when we grab max units from
the _concurrent_flushes (_cf) semaphore, we expect there to be no
concurrent writers, since other writers also obtain at least 1 unit
from this semaphore before dispatching a write.

In its current state this assert can fire. Issue #5433 describes one
way this can happen in detail, but in summary examining the units can
give false positives here: the assert may fail (0 units available) but
there are no concurrent writers because writers also try to obtain the
_cf semaphore after the _phw semaphore and may be safely blocked there.

This change tracks dispatched writes explicitly: incrementing a counter
when we dispatch a write and decrementing it after. The counter can
be incremented in the "right" place, immediately before actually
calling dma_write and after obtaining the _cf semaphore, avoiding
the problem above.

Additionally, the check is broader in that it applies to all prior
chunks, not just the most recent one. The existing check would naturally
want to check this broader condition as well, but there was no
convenient way with the way state was tracked by the appender.

In a sense, the _inflight_dispatched counter is redundant: it could be
reconstructed simply by walking the _inflight array and counting the
number of writes in DISPATCHED status. However, I am concerned that
this array could grow very large and so take a long time to traverse,
so I opted to maintain an explicit counter.

Fixes #5433.

(cherry picked from commit ea423ea)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage kind/bug Something isn't working known-issue sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants