Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

posix-stack: fix memory leak with shutdown/accept race #1265

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dotnwat
Copy link
Contributor

@dotnwat dotnwat commented Oct 24, 2022

When posix_server_socket_impl::accept() runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q.

However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded::stop).

This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor.

   class handle {
       foreign_ptr<lw_shared_ptr<load_balancer>> _lb;
       ~handle() {
           (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] {
               lb->closed_cpu(cpu);
           });
       }
       ...

When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked:

Direct leak of 88 byte(s) in 1 object(s) allocated from:
#0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

#1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba...
...
// the unordered map here is conn_q
#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket...
#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9...

fixes: #738

Signed-off-by: Noah Watkins noahwatkins@gmail.com

@dotnwat dotnwat changed the title posix-stack: fix memory with shutdown/accept race posix-stack: fix memory leak with shutdown/accept race Oct 24, 2022
When `posix_server_socket_impl::accept()` runs it may start a cross-core
background fiber that inserts a pending connection into the thread local
container posix_ap_server_socket_impl::conn_q.

However, the continuation that enqueues the pending connection may not
aactually run until after the target core calls abort_accept() (e.g.
parallel shutdown via a seastar::sharded<server>::stop).

This can leave an entry in the conn_q container that is destroyed when
the reactor thread exits. Unfortunately the conn_q container holds
conntrack::handle type that schedules additional work in its destructor.

```
   class handle {
       foreign_ptr<lw_shared_ptr<load_balancer>> _lb;
       ~handle() {
           (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] {
               lb->closed_cpu(cpu);
           });
       }
       ...
```

When this race occurs and the destructor runs the reactor is no longer
available, leading to the following memory leak in which the continuation that
is scheduled onto the reactor is leaked:

Direct leak of 88 byte(s) in 1 object(s) allocated from:
    #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

    #1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba...
    ...
    // the unordered map here is conn_q
    scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket...
    scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9...

fixes: scylladb#738

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
future<> posix_network_stack::initialize() {
engine().at_exit([] {
return exiting_gate.close().then([] {
posix_ap_server_socket_impl::conn_q.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this clear() needed? If we're on main thread, this queue is empty, if on the auxiliary then the check in move_connected_socket makes sure no entries are there any longer.

Copy link
Contributor Author

@dotnwat dotnwat Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xemul sorry about the delay getting back to you!

the primary purpose of the gate on the main thread is not to fix the race definitively, but to introduce a barrier such that when a socket is closed during shutdown that we can be sure that if the race occurred then the entry is on the other core early enough that we can...

Why is this clear() needed?

clean up with conn_q.clear() before the reactor thread exits and schedules more continuations on a shutdown reactor.

the race still exists because there is no internal cross-core synchronziation happening with abort_accept calls. with the gate, if the application ensures abort_accept is called on the main thread before all other threads, then the race doesn't exist, but that isn't obvious at the API level where folks will usually do something like sharded<>::stop() which ends up with a non-deterministic shutdown ordering.

How could the race be completely fixed?

I worked on this a bit in the realm of more elaborate internal accounting tricks, but they all started to spiral out of control. Ultimately it seems like there might need to be a change at the socket API level such that abort_accept synchronizes across cores in some fashion.

@dotnwat dotnwat requested a review from xemul November 4, 2022 16:03
@xemul
Copy link
Contributor

xemul commented Nov 8, 2022

I still don't get the fix :( If on shard_0 the gate gets closed, it means that the accepted socket had landed on the shard-N's lists. To avoid leaks there should be something that cleans lists on shard-N after shard-0's gate is closed, but I don't see it.

@dotnwat
Copy link
Contributor Author

dotnwat commented Nov 8, 2022

I still don't get the fix :( If on shard_0 the gate gets closed, it means that the accepted socket had landed on the shard-N's lists.

Thanks for being patient. I think my explanations haven't been great. But I think I understand where the confusion might be given your next question:

To avoid leaks there should be something that cleans lists on shard-N after shard-0's gate is closed, but I don't see it.

I thought I mentioned in the commit--but now realize I didn't--that there is an implicit assumption of at_exit handler ordering. The reactor explicitly runs at_exit handlers on shard 0 before any other shards.

void reactor::stop() {
    assert(_id == 0);
    _smp->cleanup_cpu();
    if (!_stopping) {
        // Run exit tasks locally and then stop all other engines
        // in the background and wait on semaphore for all to complete.
        // Finally, set _stopped on cpu 0.
        (void)run_exit_tasks().then([this] {
            return do_with(semaphore(0), [this] (semaphore& sem) {
                // Stop other cpus asynchronously, signal when done.
                (void)smp::invoke_on_others(0, [] {
                    engine()._smp->cleanup_cpu();
                    return engine().run_exit_tasks().then([] {

Since the following at exit handler is registered on all shards

uture<> posix_network_stack::initialize() {
    engine().at_exit([] {
        return exiting_gate.close().then([] {
            posix_ap_server_socket_impl::conn_q.clear();

Shard 0 will first close the gate ensuring the socket lands on shard N. Then all other shards will close the gate and clear the list. It's slightly confusing that the gate is thread_local since it only serves a functional purpose on shard 0, but I didn't see an obviously better place to put it.

@dotnwat
Copy link
Contributor Author

dotnwat commented Nov 15, 2022

@xemul does that latest update help clarify?

@xemul
Copy link
Contributor

xemul commented Nov 23, 2022

@dotnwat , sorry for getting back late. I'm still trying to wrap my head around it. Please, check my understanding of the race and correct me once you see I'm wrong :)

So posix_server_socket_impl::accept spawns a submit_to fiber that is supposed to do one of two things. If the target shard already did call accept() on it, there's an entry in sockets map, so the moving fiber "wakes up" this entry. If the accept() was not yet called, then the moving entry lands into the conn_q map, so that later accept() would pick it up later.

If the target shard does abort_accept() it's should clean both -- the sockets and conn_q from any pending stuff. However, if the moving fiber delays and abort_accept() runs before, then there appears an entry in conn_q that nobody picks up and it lives up until seastar shuts down.

If that's correct, then shouldn't the fix be to avoid this race, i.e. -- the entry in-cross-shard-flight should be somehow waited by the abort_accept, so that it doesn't hang out in the conn-q until shutdown?

@dotnwat
Copy link
Contributor Author

dotnwat commented Nov 23, 2022

@dotnwat , sorry for getting back late. I'm still trying to wrap my head around it. Please, check my understanding of the race and correct me once you see I'm wrong :)

No problem @xemul at all.

So posix_server_socket_impl::accept spawns a submit_to fiber that is supposed to do one of two things. If the target shard already did call accept() on it, there's an entry in sockets map, so the moving fiber "wakes up" this entry. If the accept() was not yet called, then the moving entry lands into the conn_q map, so that later accept() would pick it up later.

Yep. This is the rendezvous that occurs between the cores.

If the target shard does abort_accept() it's should clean both -- the sockets and conn_q from any pending stuff. However, if the moving fiber delays and abort_accept() runs before, then there appears an entry in conn_q that nobody picks up and it lives up until seastar shuts down.

That's right.

If that's correct, then shouldn't the fix be to avoid this race,

Yes, avoiding the race would be ideal. And I think there are probably many strategies for this, including a replacement for this rendezvous mechanism that is difficult to work with.

i.e. -- the entry in-cross-shard-flight should be somehow waited by the abort_accept, so that it doesn't hang out in the conn-q until shutdown?

The problem with having the target shard wait like you suggest is that the in-cross-shard-flight is in flight -- that target shard doesn't having anything to wait on. I suppose the target shard could reach back over to core-0 and check for in-flight entries.

I had played a bit with the idea of having the target shard's abort_accept record something like a tombstone so that the in-flight entry would be rejected and make the race safe. However, the conn_q data structure is indexed by protocol/socket address--there didn't seem to be an obviously correct policy for clearing the tombstones which would be necessary in practice (e.g. a socket server shutdown and then restarted with the same address).

From what we can tell the race is very rare, and only happens on system shutdown when a new connection happens to arrive in that vulnerable period right before shard-0 abort_accept but after the target shard abort_accept. So this PR is really a band-aid for a nagging issue that shows up as a LSAN leak causing an occasional CI failure, not a fix for the race.

I'm happy to go back to the drawing board and try to fix the race if you'd like to avoid the hack.

@dotnwat
Copy link
Contributor Author

dotnwat commented Dec 8, 2022

bump

@xemul
Copy link
Contributor

xemul commented Dec 8, 2022

Sorry. I think that the way to go is to try to avoid the root-case race. The hack is awesome, but still the raced connection leaks until seastar shutdown which's not nice.

@dotnwat
Copy link
Contributor Author

dotnwat commented Dec 8, 2022

Sorry. I think that the way to go is to try to avoid the root-case race. The hack is awesome, but still the raced connection leaks until seastar shutdown which's not nice.

sgtm. ill look into a more fundamental fix, although with the current socket implementation it's not entirely clear about a better solution that doesn't involve public API changes. There is a reproducer in the linked issue and several folks reporting this issue in case opinions change about the temporary fix for LSAN in the short term.

dotnwat added a commit to dotnwat/redpanda that referenced this pull request Dec 21, 2022
A memory leak pops up in our CI from time to time
(redpanda-data#3338). It is caused by
a very narrow race within Seastar when incoming connections arrive while
a socket server is being shutdown.

I've posted a "fix" for this in upstream Seastar (link below) which is
effectively rejected because it doesn't really fix the leak, but merely
cleans up on exit to make LSAN happy. AFAICT this leak can only happen
when a socket server is being shut down, and in Redpanda, this only
happens when Redpanda is shutting down. In this case it is harmless,
but annoying because it will occassionally fail our CI.

Adding an LSAN suppressions file allows us to instruct LSAN to ignore
this particular leak when it occassionally does trigger.

Fixes: redpanda-data#3338

See scylladb/seastar#1265 for more discussion
and a link to an issue with a reproducer.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Jan 27, 2023
A memory leak pops up in our CI from time to time
(redpanda-data#3338). It is caused by
a very narrow race within Seastar when incoming connections arrive while
a socket server is being shutdown.

I've posted a "fix" for this in upstream Seastar (link below) which is
effectively rejected because it doesn't really fix the leak, but merely
cleans up on exit to make LSAN happy. AFAICT this leak can only happen
when a socket server is being shut down, and in Redpanda, this only
happens when Redpanda is shutting down. In this case it is harmless,
but annoying because it will occassionally fail our CI.

Adding an LSAN suppressions file allows us to instruct LSAN to ignore
this particular leak when it occassionally does trigger.

Fixes: redpanda-data#3338

See scylladb/seastar#1265 for more discussion
and a link to an issue with a reproducer.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 778704b)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Feb 28, 2023
A memory leak pops up in our CI from time to time
(redpanda-data#3338). It is caused by
a very narrow race within Seastar when incoming connections arrive while
a socket server is being shutdown.

I've posted a "fix" for this in upstream Seastar (link below) which is
effectively rejected because it doesn't really fix the leak, but merely
cleans up on exit to make LSAN happy. AFAICT this leak can only happen
when a socket server is being shut down, and in Redpanda, this only
happens when Redpanda is shutting down. In this case it is harmless,
but annoying because it will occassionally fail our CI.

Adding an LSAN suppressions file allows us to instruct LSAN to ignore
this particular leak when it occassionally does trigger.

Fixes: redpanda-data#3338

See scylladb/seastar#1265 for more discussion
and a link to an issue with a reproducer.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 778704b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LeakSanitizer: detected memory leaks from seastar::net::conntrack::load_balancer
2 participants