storage: assert out on EIO in read #7633

jcsp · 2022-12-05T18:57:47Z

We already assert out on EIO in writes. However on reads we only surfaced the exceptional future, and some code paths (like state_machine.cc) have general exception handlers that retry forever, resulting in nodes with bad disks staying up but not progressing.

Fixes #7637

Backports Required

UX Changes

See release notes.

Release Notes

Bug Fixes

In some cases, Redpanda could hang on EIOs from the underlying storage device. This behavior has changed to terminate redpanda with an assertion on EIO, in anticipation of the node/drive requiring replacement.

andrwng · 2022-12-06T16:25:09Z

src/v/storage/log_reader.cc

+ if (ec.code().value() == EIO) {
+ vassert(false, "I/O error during read! Disk failure?");


+1

I'm curious if we'll hit a crash loop in automated deployments, but this still seems much better than the alternative. I wonder if longer term we'd want to peel back this limitation, e.g. assuming the entire disk isn't fully borked, stopping/moving just the replicas that hit a bad reads, while gracefully decommissioning the node.

Longer term, I suspect that for things like decode errors we'll want to treat them as non-crashing errors and report a "damaged" partition, but for EIOs we might always keep the termination behavior, as it's such a critical failure indicator.

jcsp · 2022-12-12T09:37:26Z

/backport v22.3.x

jcsp · 2022-12-12T09:37:32Z

/backport v22.2.x

jcsp · 2022-12-12T09:37:38Z

/backport v22.1.x

vbotbuildovich · 2022-12-12T09:38:34Z

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 875faa41a0019ce2bf8cec82bc96d656a007494b

Workflow run logs.

jcsp · 2022-12-12T11:22:53Z

Manual backport to v22.1.x at #7703

dotnwat · 2022-12-13T00:07:43Z

src/v/storage/log_reader.cc

+ return ss::make_exception_future<result<records_t>>(
+ std::current_exception());


I think that we should pass ec here instead of std::current_exception. While future::handle_exception_type does invoke its continuation inside a catch block, that doesn't seem to be a guarantee that the API needs to maintain.

storage: assert out on EIO in read

875faa4

github-actions bot added the area/redpanda label Dec 5, 2022

jcsp requested review from dotnwat and mmaslankaprv December 6, 2022 14:02

jcsp marked this pull request as ready for review December 6, 2022 14:03

andrwng approved these changes Dec 6, 2022

View reviewed changes

mmaslankaprv approved these changes Dec 12, 2022

View reviewed changes

jcsp merged commit 8694d84 into redpanda-data:dev Dec 12, 2022

jcsp deleted the storage-assert-eio branch December 12, 2022 09:37

This was referenced Dec 12, 2022

[v22.3.x] storage: state machine readers spin on EIO #7699

Closed

[v22.3.x] storage: assert out on EIO in read #7700

Merged

[v22.2.x] storage: state machine readers spin on EIO #7701

Closed

[v22.2.x] storage: assert out on EIO in read #7702

Merged

dotnwat reviewed Dec 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: assert out on EIO in read #7633

storage: assert out on EIO in read #7633

jcsp commented Dec 5, 2022 •

edited

Loading

andrwng Dec 6, 2022

jcsp Dec 12, 2022

jcsp commented Dec 12, 2022

jcsp commented Dec 12, 2022

jcsp commented Dec 12, 2022

vbotbuildovich commented Dec 12, 2022

jcsp commented Dec 12, 2022 •

edited

Loading

dotnwat Dec 13, 2022

		if (ec.code().value() == EIO) {
		vassert(false, "I/O error during read! Disk failure?");

		return ss::make_exception_future<result<records_t>>(
		std::current_exception());

storage: assert out on EIO in read #7633

storage: assert out on EIO in read #7633

Conversation

jcsp commented Dec 5, 2022 • edited Loading

Backports Required

UX Changes

Release Notes

Bug Fixes

andrwng Dec 6, 2022

Choose a reason for hiding this comment

jcsp Dec 12, 2022

Choose a reason for hiding this comment

jcsp commented Dec 12, 2022

jcsp commented Dec 12, 2022

jcsp commented Dec 12, 2022

vbotbuildovich commented Dec 12, 2022

jcsp commented Dec 12, 2022 • edited Loading

dotnwat Dec 13, 2022

Choose a reason for hiding this comment

jcsp commented Dec 5, 2022 •

edited

Loading

jcsp commented Dec 12, 2022 •

edited

Loading