Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: reset transport version via reconnect_transport #5520

Merged
merged 5 commits into from
Jul 21, 2022

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Jul 19, 2022

Cover letter

reconnect_transport may be used on a peer that is downgraded, in which case, we need to also reset the transport version to restart version negotiation.

Problem found via test #5488

Fixes: #5506

Release notes

  • None

@dotnwat dotnwat changed the title Fix 5506 rpc: reset transport version via reconnect_transport Jul 19, 2022
andrwng
andrwng previously approved these changes Jul 19, 2022
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the quick turnaround here!

@@ -112,6 +112,8 @@ class transport final : public net::base_transport {
* version level used when dispatching requests. this value may change
* during the lifetime of the transport. for example the version may be
* upgraded if it is discovered that a server supports a newer version.
*
* reset to v1 in reset_state() to support reconnect_transport.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

andrwng
andrwng previously approved these changes Jul 19, 2022
@dotnwat
Copy link
Member Author

dotnwat commented Jul 20, 2022

@andrwng looks like there are several new ducktape bugs that popped up after this change. i'll have to address those before merging.

@mmedenjak mmedenjak added the kind/bug Something isn't working label Jul 20, 2022
When an error occurs while processing a request (e.g. a handler throws
or the server receives an unsupported api request) a default initialized
netbuf is created to hold the error and is sent back to the client.

Prior to this commit the header included v0 as the version for such
error replies. Since the error may be returned to a client whose
transport was upgraded to v2 a protocol violation error would be logged
by the client.

This commit matches the reply version to the request version for error
replies under the reasonable assumption that (1) the rpc header is
compatibile across transport versions and (2) the errors contain no
decodable payloads.

Signed-off-by: Noah Watkins <[email protected]>
Other exceptions log messages when returning timeout errors to the
client. However, an actual timeout exception didn't log so an observer
would have to deduce exactly what happened based on an omission in the
logs.

Signed-off-by: Noah Watkins <[email protected]>
Signed-off-by: Noah Watkins <[email protected]>
This is needed for the case in which a peer on the other end of a
reconnect_transport is downgraded. unless we reset the version on
reconnect then we'll get a policy violation if the peer responds
with a lower version than the transport had been upgraded to.

Signed-off-by: Noah Watkins <[email protected]>
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense. I've checked out the new patch and am running node op fuzz test with mixed versions (previously also tripped on this), initial results seem positive

@@ -153,6 +153,7 @@ simple_protocol::dispatch_method_once(header h, net::server::resources rs) {
ctx->res.conn->addr);
rs.probe().method_not_found();
netbuf reply_buf;
reply_buf.set_version(ctx->get_header().version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think it'd be nice to be able to add a vassert that we never send a v0 netbuf over the wire, e.g. in send_reply(). I don't think we can though since we'll still send v0 over the wire if talking to an old server (though it gets ignored regardless)

Another thought is that we could use some simple_reply_buf around the simple protocol instead of netbuf that is constructed with an initial version (even if it's just _version).

I don't feel strongly that either of these would be much better than what's there, but just brainstorming ways we can be sure our sent bytes are reasonable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can though since we'll still send v0 over the wire if talking to an old server (though it gets ignored regardless)

correct

Another thought is that we could use some simple_reply_buf around the simple protocol instead of netbuf that is constructed with an initial version (even if it's just _version).

i think the right move would be to add a constructor to netbuf so it can't be constructed without providing a version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that seems reasonable too. I was initially worried we'd end up with some lifecycle issues, e.g. not knowing what the right version is at time of construction, but I agree baking this into netbuf would feel better

@dotnwat
Copy link
Member Author

dotnwat commented Jul 21, 2022

Failure is #4848

@dotnwat dotnwat merged commit b2403ff into redpanda-data:dev Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/bug Something isn't working
Projects
None yet
3 participants