-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: assertion on heavily overloaded slow drive #5433
Comments
Thanks for reporting. Most previous assertions thrown in segment appender have been a result of callers (raft and storage) violating the rules of the interface. So this probably a good one to try to reproduce. There may still be a logic error internal to the segment appender but i looked a little bit this morning and didn't see anything obvious. I wonder if reducing the falloc size would help reproduce so that that code path is triggered more frequently (in addition to using the slow device which would put pressure on the number of queued up i/o requests). |
This happened on a CI run here (first time I've seen it in that context): It's a debug build on docker, so that lines up with the original observation of this happening when storage is very slow. |
Reported failure: redpanda-data#5433 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
I've put together a reproducer that is live in PR #6818. Right at the moment I'm not entirely sure what the best fix is. Details/*
* Assumption about the segment appender:
*
* usage of the appender api is exclusive: you cannot call append or truncate
* or any other interface until the future from any previous method
* completes.
*
* The exception to this rule is `flush`, which should be allowed to be invoked
* asynchronously to append calls, even if an append has not completed:
*
* commit 5668d7da3309d7bd2d03339f4067b64320cb3802
* Author: Noah Watkins <noah@vectorized.io>
* Date: Thu Jun 3 15:18:33 2021 -0700
*
* storage: support concurrent append and flushing
*
* This patch adds support to the segment appender for allowing append()
* to be called before the future from a previous flush() is completed. This new
* functionality allows raft to pipeline appends while while waiting on flushes
* to complete asynchronously.
*
* It would appear that there is a race condition in which an assertion fails,
* and this race condition is rare. It requires that truncate or fallocation run
* in parallel to a slow dma_write.
*
* The following test will reproduce the bug on DEBUG builds. Some strategic
* sleeps may be required to make it reproduce on a RELEASE build.
*/
SEASTAR_THREAD_TEST_CASE(exclusive_concurrent_flushing_drains_all_io) {
// create an empty file
temporary_dir dir("exclusive_concurrent_flushing_drains_all_io");
auto fpath = dir.get_path() / "segment.dat";
auto f = open_file(fpath.string());
// appender with 32_KiB fallocation size
const int falloc_size = 32_KiB;
storage::storage_resources resources(
config::mock_binding<size_t>(falloc_size));
auto appender = make_segment_appender(f, resources);
const iobuf one_byte = make_random_data(1);
const iobuf many_bytes = make_random_data(falloc_size * 2);
std::vector<ss::future<>> bg_append;
std::vector<ss::future<>> bg_flush;
/*
* we want to be able to control when fallocation occurs, and it always
* occurs on the first append. so let's get that out of the way.
*
* after this call the head write-ahead buffer contains 1 byte.
*/
appender.append(one_byte).get();
/*
* FLUSH-A
*
* next we want to trigger a write by calling flush. since the head buffer
* contains 1 byte of dirty data a background write will be started via
* dispatch_background_head_write. here that is with the important parts:
*
* ss::future<> segment_appender::flush() {
*
* void segment_appender::dispatch_background_head_write() {
*
* // NOT FULL
* const auto full = _head->is_full();
*
* // PREV UNITS
* auto prev = _prev_head_write;
* auto units = ss::get_units(*prev, 1);
*
* // BACKGROUND WRITE
* (void)ss::with_semaphore(
* _concurrent_flushes,
* 1,
* [units = std::move(units), full]() mutable {
* return units
* .then([this, h, w, start_offset, expected, src, full](
* ssx::semaphore_units u) mutable {
* return _out
* .dma_write(start_offset, src, expected,
* _opts.priority)
*
*
* // PREV HEAD _NOT_ RESET
* if (full) {
* _prev_head_write = ss::make_lw_shared<ssx::semaphore>(1,
* head_sem_name);
* }
*
* so in words, at this point there is a background write that is holding 1
* unit of _concurrent_flushes, and 1 unit of _prev_head_write. finally,
* since the head buffer was not full, it is also not reset to a fresh mutex
* before exiting.
*
* Note that the flush is start asynchronously in the background which is
* allowed according to the assumptions stated above.
*/
bg_flush.push_back(appender.flush());
/*
* we want to be able to trigger another background write. if we called
* flush again now it would be a noop since the previous dirty data is now
* in flight ot the disk. so let's add another byte to make the head buffer
* dirty again.
*/
appender.append(one_byte).get();
/*
* here is the first part of where the race happens. we are going to now to
* start a new append that is large enough to trigger the fallocation path,
* and we are going to background the append.
*
* this usage is still is still correct according to the assumptions stated
* above: there is 1 inflight flush and all previous appends have completed.
*
* when append is a called the need for fallocation is checked. if an
* fallocation is needed it is done prior to any remaining data in the
* append call being written to the current head buffer.
*
* ss::future<> segment_appender::append(const iobuf& io) {
*
* ss::future<> segment_appender::do_next_adaptive_fallocation() {
*
* return ss::with_semaphore(
* _concurrent_flushes,
* ss::semaphore::max_counter(),
* [this, step]() mutable {
* vassert(
* _prev_head_write->available_units() == 1,
* "Unexpected pending head write {}",
* *this);
*
* at this point the append we just started should be blocked on
* _concurrent_flushes because it is trying grab exclusive control
* (max_counter) AND the background write started from FLUSH-A above is
* holding 1 unit of this semaphore.
*/
bg_append.push_back(appender.append(many_bytes));
/*
* currently there is 1 flush and 1 append in flight.
*/
BOOST_REQUIRE(bg_append.size() == 1);
BOOST_REQUIRE(bg_flush.size() == 1);
/*
* now for the final piece of the race. we are going to call flush once
* more. recall that we have 1 byte of dirty data in the head buffer and
* many bytes that are in an inflight append, but waiting on fallocate.
*
* recall from above what happens when flush starts the background write:
*
* ss::future<> segment_appender::flush() {
* void segment_appender::dispatch_background_head_write() {
* ...
*
*
* the background write grabs 1 unit of _concurrent_flushes and 1 unit of
* _prev_head_write. since the head buffer is NOT full, it is not reset to a
* new semaphore before returning.
*/
bg_flush.push_back(appender.flush());
BOOST_REQUIRE(bg_append.size() == 1);
BOOST_REQUIRE(bg_flush.size() == 2);
/*
* now let the race complete.
*
* when the background write from the original FLUSH-A completes and
* releases its unit of _concurrent_flushes and _prev_head_write two things
* will happen.
*
* first, the next waiter on _concurrent_flushes will be woken up. this is
* the fallocation call from the last append that is in the background.
*
* second, _prev_head_write units will be released and control of the mutex
* will be given to the next waiter which is the background write scheduled
* from the immediately preceeding flush.
*
* when fallocate is scheduled holding exclusive access to concurrent
* flushes it will find that this assertion doesn't hold because there is a
* new owner of the current _prev_head_write semaphore.
*
* vassert(
* _prev_head_write->available_units() == 1,
* "Unexpected pending head write {}",
* *this);
*/
ss::when_all_succeed(
ss::when_all_succeed(bg_append.begin(), bg_append.end()),
ss::when_all_succeed(bg_flush.begin(), bg_flush.end()))
.get();
appender.close().get();
} |
we've only seen this failure in CI, but it looks like it could also be possible in production. |
It would probably make a good bug for someone from storage team to work on, given there is a reproducer. Happy to help out. Rather surprised we never hit it outside CI. |
I'm preparing a fix for this as part of #12473. |
In the segment appender we assert in a few places that a particular semaphore has 1 unit available, which is a proxy for "no in flight writes on the last chunk". For example, when we grab max units from the _concurrent_flushes (_cf) semaphore, we expect there to be no concurrent writers, since other writers also obtain at least 1 unit from this semaphore before dispatching a write. In its current state this assert can fire. Issue redpanda-data#5433 describes one way this can happen in detail, but in summary examining the units can give false positives here: the assert may fail (0 units available) but there are no concurrent writers because writers also try to obtain the _cf semaphore after the _phw semaphore and may be safely blocked there. This change tracks dispatched writes explicitly: incrementing a counter when we dispatch a write and decrementing it after. The counter can be incremented in the "right" place, immediately before actually calling dma_write and after obtaining the _cf semaphore, avoiding the problem above. Additionally, the check is broader in that it applies to all prior chunks, not just the most recent one. The existing check would naturally want to check this broader condition as well, but there was no convenient way with the way state was tracked by the appender. In a sense, the _inflight_dispatched counter is redundant: it could be reconstructed simply by walking the _inflight array and counting the number of writes in DISPATCHED status. However, I am concerned that this array could grow very large and so take a long time to traverse, so I opted to maintain an explicit counter. Fixes redpanda-data#5433.
In the segment appender we assert in a few places that a particular semaphore has 1 unit available, which is a proxy for "no in flight writes on the last chunk". For example, when we grab max units from the _concurrent_flushes (_cf) semaphore, we expect there to be no concurrent writers, since other writers also obtain at least 1 unit from this semaphore before dispatching a write. In its current state this assert can fire. Issue redpanda-data#5433 describes one way this can happen in detail, but in summary examining the units can give false positives here: the assert may fail (0 units available) but there are no concurrent writers because writers also try to obtain the _cf semaphore after the _phw semaphore and may be safely blocked there. This change tracks dispatched writes explicitly: incrementing a counter when we dispatch a write and decrementing it after. The counter can be incremented in the "right" place, immediately before actually calling dma_write and after obtaining the _cf semaphore, avoiding the problem above. Additionally, the check is broader in that it applies to all prior chunks, not just the most recent one. The existing check would naturally want to check this broader condition as well, but there was no convenient way with the way state was tracked by the appender. In a sense, the _inflight_dispatched counter is redundant: it could be reconstructed simply by walking the _inflight array and counting the number of writes in DISPATCHED status. However, I am concerned that this array could grow very large and so take a long time to traverse, so I opted to maintain an explicit counter. Fixes redpanda-data#5433.
In the segment appender we assert in a few places that a particular semaphore has 1 unit available, which is a proxy for "no in flight writes on the last chunk". For example, when we grab max units from the _concurrent_flushes (_cf) semaphore, we expect there to be no concurrent writers, since other writers also obtain at least 1 unit from this semaphore before dispatching a write. In its current state this assert can fire. Issue #5433 describes one way this can happen in detail, but in summary examining the units can give false positives here: the assert may fail (0 units available) but there are no concurrent writers because writers also try to obtain the _cf semaphore after the _phw semaphore and may be safely blocked there. This change tracks dispatched writes explicitly: incrementing a counter when we dispatch a write and decrementing it after. The counter can be incremented in the "right" place, immediately before actually calling dma_write and after obtaining the _cf semaphore, avoiding the problem above. Additionally, the check is broader in that it applies to all prior chunks, not just the most recent one. The existing check would naturally want to check this broader condition as well, but there was no convenient way with the way state was tracked by the appender. In a sense, the _inflight_dispatched counter is redundant: it could be reconstructed simply by walking the _inflight array and counting the number of writes in DISPATCHED status. However, I am concerned that this array could grow very large and so take a long time to traverse, so I opted to maintain an explicit counter. Fixes #5433. (cherry picked from commit ea423ea)
This assertion in segment_appender::do_next_adaptive_fallocation
Triggered when running my development branch of ManyPartitionsTest on a cluster where I had accidentally broken the symlink from /mnt/vectorized/redpanda to /var/lib/redpanda, so was inadvertently running on the slow root drive on an i3en.6xlarge node.
Suggests we have a bug here that only shows up when really hammering a really slow drive.
This ticket is light on details but exists to make sure we go back and re-test before releasing 22.2. Basically just take
dev
and the latest/greatest ManyPartitionsTest, blow away the symlink + see if it crashes. It did it twice in two runs, so was pretty reproducible.If it doesn't reproduce, close this ticket: I was running lots of whacky development branches of dubious provenance.
The text was updated successfully, but these errors were encountered: