Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set promise on exceptions in dispatch_method_once #8519

Merged
merged 1 commit into from
Feb 7, 2023

Conversation

ballard26
Copy link
Contributor

@ballard26 ballard26 commented Jan 31, 2023

Changes the logic of dispatch_method_once to avoid a broken promise when the connection is closed before the input is skipped for unknown methods and unknown versions. This ensures that we are always setting the body_parse promise after reading in all the input and before we send a reply.

Fixes #8518, Fixes #8074

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

  • none

@@ -104,7 +104,13 @@ rpc_server::send_reply(ss::lw_shared_ptr<server_context_impl> ctx, netbuf buf) {

ss::future<> rpc_server::send_reply_skip_payload(
ss::lw_shared_ptr<server_context_impl> ctx, netbuf buf) {
co_await ctx->conn->input().skip(ctx->get_header().payload_size);
try {
co_await ctx->conn->input().skip(ctx->get_header().payload_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add a comment that we might expect an exception here if the connection is closed? Otherwise it's not immediately obvious why this is necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will add

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it might make sense to handle this back in rpc_server::dispatch_method_once ?

ctx->body_parse_exception(std::current_exception());
co_return;
}
ctx->signal_body_parse();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this was signaled after the send_reply() finished. I imagine this reordering means that we allow waiters to proceed before responding to clients. Is that important to this fix?

Copy link
Contributor Author

@ballard26 ballard26 Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly important to the fix. And yeah, it would allow for waiters to continue before the reply is sent. In this case it would allow for more requests to be read from the connection before a reply is sent.

We currently have three code paths in dispatch_method_once

  1. For unsupported version
  2. For unknown methods
  3. For known methods

For 1 & 2 we currently only signal waiters after a reply is sent.
For 3 though we signal the waiters right after consuming & parsing all input for the request. Long before calling the service method and sending the reply.

The goal of this change was to have 1 & 2 match the behavior of 3. Though there may be a case for them to be different. If you think it's out of scope for this PR though I can start a separate PR for it.

Copy link
Contributor

@andrwng andrwng Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation! It makes sense that we'd be able to / want to signal early. I'm also not sure if this was originally implemented intentionally, so it might be worth poking others familiar in the area, but the rationale for standardizing on 3 makes sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @dotnwat

Copy link
Member

@dotnwat dotnwat Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation! It makes sense that we'd be able to / want to signal early.

i agree with @andrwng that it makes sense, and i think you are doing the right thing here @ballard26.

would allow for waiters to continue before the reply is sent.

is this optimization the only reason for changing the order of operations?

I'm also not sure if this was originally implemented intentionally,

however, i think we need to be very conservative in the rpc server because we've had countless issues with race conditions and haven't had time to revamp/rearch rpc. so anything that potentially alters the scheduling is a red flag. in this case it seems completely benign, but its also a rare error case so the optmiization doesn't seem worth the cost of working through managing the concerns?

Copy link
Contributor Author

@ballard26 ballard26 Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about the issues in the past with race conditions. I'll go ahead and revert things to handle the exception with a then_wrapped in the rpc_server::dispatch_method_once in order to preserve the scheduling order. The optimization I made does seem outside the scope of a PR to fix an issue.

andrwng
andrwng previously approved these changes Jan 31, 2023
// If the connection is closed then this can throw an exception.
// In that case we want to catch and forward it to avoid a broken
// promise.
co_await ctx->conn->input().skip(ctx->get_header().payload_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it might make sense to handle this back in rpc_server::dispatch_method_once ?

Copy link
Contributor Author

@ballard26 ballard26 Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a comment here https://github.com/redpanda-data/redpanda/pull/8519/files#r1091520039 about why I'm handling it here. Basically if I handle it in rpc_server::dispatch_method_once with a finally or similar I can only set the promise after the reply is sent or an exception is thrown. Which is why I moved the handling here. Happy to move the handling to rpc_server::dispatch_method_once if desired though.

@ballard26 ballard26 force-pushed the broken-promise-rpc branch 3 times, most recently from 56edc56 to dae07f4 Compare February 1, 2023 00:27
// `permanent_memory_reservation` will return an exception.
// We intercept it here to avoid a broken promise.
ctx.body_parse_exception(e);
std::rethrow_exception(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it is more idiomatic here outside a co-routine context to return ss::make_exception_future<>(e). the reason is that seastar will end up having to catch this again and repackage it into an exceptional future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, switching to it now.

Changes the logic of `dispatch_method_once` to avoid a broken promise
when the connection is closed before the input is skipped for unknown
methods and unknown versions. This ensures that we are always setting
the body_parse promise after reading in all the input and before we send
a reply.
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@dotnwat
Copy link
Member

dotnwat commented Feb 1, 2023

restarted ci. seems like some transient issue in ci.

@dotnwat
Copy link
Member

dotnwat commented Feb 1, 2023

ok restarted again. seems the ci fix has landed now

@ballard26
Copy link
Contributor Author

ballard26 commented Feb 3, 2023

In ducktape-build-release-clang-amd64-1-0 the failure appears to be #8589
In ducktape-build-debug-clang-amd64-1-0 the failure is #8621

@dotnwat dotnwat merged commit db61371 into redpanda-data:dev Feb 7, 2023
@vshtokman
Copy link
Contributor

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x e47ab0dccb51a1a07e351219c94eb30575d42d82

Workflow run logs.

@vshtokman
Copy link
Contributor

@ballard26 , could you look into backporting this when you have a chance?

@vshtokman
Copy link
Contributor

/backport v22.3.x

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants