Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: avoid this capture in lambda coroutine #6769

Merged
merged 6 commits into from
Nov 1, 2022

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Oct 14, 2022

Cover letter

A shot in the dark for #5575

This needs to bake in the tree for a while to be certain because of how rare we see this crash. If we can conclude at some point that the issue is gone, then we might consider backporting / closing the issue.

Also, if we don't find that the issue is resolved, then we might consider looking more closely at the connection_rate_limit feature that is applied in the net connection accept handler.

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

Release notes

  • None

- default the ctor
- explicitly delete copy ctor
- don't move trivial type (address)
- mark move assignment as noexcept

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
There is some non-trivial things happening via the units destructor in
conn_quota::put. the dtor might already be implicitly noexcept, but lets
make that explicit just in case.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
It still is not clear that capturing `this` in a lambda coroutine is
safe. The accept handler in the net server captures `this`.
Additionally, we've been dealing for some time with this crash:

```
TRACE 2022-10-13 08:55:06,465 [shard 0] rpc - conn_quota.cc:118 - put(172.16.16.27)
TRACE 2022-10-13 08:55:06,465 [shard 0] rpc - conn_quota.cc:130 - leaving put(172.16.16.27)
AddressSanitizer:DEADLYSIGNAL
=================================================================
==16388==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7fa76bfb218f bp 0x7ffddbd1a140 sp 0x7ffddbd19ea0 T0)
==16388==The signal is caused by a READ memory access.
==16388==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.
```

The only place where:

```
TRACE 2022-10-13 08:55:06,465 [shard 0] rpc - conn_quota.cc:130 - leaving put(172.16.16.27)
```

This is happening is in and around this lambda. However, nothing else seems
obviously problematic which might lead to dereferencing a junk pointer. So this change
is not definitely going to fix it as we don't really have proof that the problem is `this`
capture being clobbered.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although struggling to see how the this capture would be problematic, because server is a long lived object & istr the failures happening not during shutdown. Still coroutines lambdas are subtle and quick to anger.

@BenPope
Copy link
Member

BenPope commented Oct 14, 2022

LGTM, although struggling to see how the this capture would be problematic, because server is a long lived object & istr the failures happening not during shutdown. Still coroutines lambdas are subtle and quick to anger.

Because the pointer to the long lived object is not a long lived object, technically it goes out of scope at the first co_await.

I thought that a read somewhere that capturing this in a coroutine lambda on clang was ok, but we should not depend on chance, it may have changed, it may change in the future.

@dotnwat
Copy link
Member Author

dotnwat commented Oct 14, 2022

LGTM, although struggling to see how the this capture would be problematic, because server is a long lived object & istr the failures happening not during shutdown. Still coroutines lambdas are subtle and quick to anger.

Ben is right. The lambda transformation in cppinsights shows that this is captured like any other capture, subject to clobbering. I've seen anecdotes that clang treats this special, and i'm pretty sure Alexey saw the same thing in some IR investigations. But I've yet to see any indication that coroutine transformations do this as a guarantee.

So, this might not fix the issue, unfortunately.

@dotnwat
Copy link
Member Author

dotnwat commented Nov 1, 2022

failure was #6997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants