posix-stack: fix memory leak with shutdown/accept race #1265

dotnwat · 2022-10-24T18:02:54Z

When posix_server_socket_impl::accept() runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q.

However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded::stop).

This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor.

   class handle {
       foreign_ptr<lw_shared_ptr<load_balancer>> _lb;
       ~handle() {
           (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] {
               lb->closed_cpu(cpu);
           });
       }
       ...

When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked:

Direct leak of 88 byte(s) in 1 object(s) allocated from:
#0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

#1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba...
...
// the unordered map here is conn_q
#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket...
#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9...

fixes: #738

Signed-off-by: Noah Watkins noahwatkins@gmail.com

When `posix_server_socket_impl::accept()` runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q. However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded<server>::stop). This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor. ``` class handle { foreign_ptr<lw_shared_ptr<load_balancer>> _lb; ~handle() { (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] { lb->closed_cpu(cpu); }); } ... ``` When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked: Direct leak of 88 byte(s) in 1 object(s) allocated from: #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba... ... // the unordered map here is conn_q scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket... scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9... fixes: scylladb#738 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

xemul · 2022-10-27T10:12:07Z

src/net/posix-stack.cc

+future<> posix_network_stack::initialize() {
+    engine().at_exit([] {
+        return exiting_gate.close().then([] {
+            posix_ap_server_socket_impl::conn_q.clear();


Why is this clear() needed? If we're on main thread, this queue is empty, if on the auxiliary then the check in move_connected_socket makes sure no entries are there any longer.

@xemul sorry about the delay getting back to you!

the primary purpose of the gate on the main thread is not to fix the race definitively, but to introduce a barrier such that when a socket is closed during shutdown that we can be sure that if the race occurred then the entry is on the other core early enough that we can...

Why is this clear() needed?

clean up with conn_q.clear() before the reactor thread exits and schedules more continuations on a shutdown reactor.

the race still exists because there is no internal cross-core synchronziation happening with abort_accept calls. with the gate, if the application ensures abort_accept is called on the main thread before all other threads, then the race doesn't exist, but that isn't obvious at the API level where folks will usually do something like sharded<>::stop() which ends up with a non-deterministic shutdown ordering.

How could the race be completely fixed?

I worked on this a bit in the realm of more elaborate internal accounting tricks, but they all started to spiral out of control. Ultimately it seems like there might need to be a change at the socket API level such that abort_accept synchronizes across cores in some fashion.

xemul · 2022-11-08T07:22:10Z

I still don't get the fix :( If on shard_0 the gate gets closed, it means that the accepted socket had landed on the shard-N's lists. To avoid leaks there should be something that cleans lists on shard-N after shard-0's gate is closed, but I don't see it.

dotnwat · 2022-11-08T18:44:43Z

I still don't get the fix :( If on shard_0 the gate gets closed, it means that the accepted socket had landed on the shard-N's lists.

Thanks for being patient. I think my explanations haven't been great. But I think I understand where the confusion might be given your next question:

To avoid leaks there should be something that cleans lists on shard-N after shard-0's gate is closed, but I don't see it.

I thought I mentioned in the commit--but now realize I didn't--that there is an implicit assumption of at_exit handler ordering. The reactor explicitly runs at_exit handlers on shard 0 before any other shards.

void reactor::stop() {
    assert(_id == 0);
    _smp->cleanup_cpu();
    if (!_stopping) {
        // Run exit tasks locally and then stop all other engines
        // in the background and wait on semaphore for all to complete.
        // Finally, set _stopped on cpu 0.
        (void)run_exit_tasks().then([this] {
            return do_with(semaphore(0), [this] (semaphore& sem) {
                // Stop other cpus asynchronously, signal when done.
                (void)smp::invoke_on_others(0, [] {
                    engine()._smp->cleanup_cpu();
                    return engine().run_exit_tasks().then([] {

Since the following at exit handler is registered on all shards

uture<> posix_network_stack::initialize() {
    engine().at_exit([] {
        return exiting_gate.close().then([] {
            posix_ap_server_socket_impl::conn_q.clear();

Shard 0 will first close the gate ensuring the socket lands on shard N. Then all other shards will close the gate and clear the list. It's slightly confusing that the gate is thread_local since it only serves a functional purpose on shard 0, but I didn't see an obviously better place to put it.

dotnwat · 2022-11-15T17:02:11Z

@xemul does that latest update help clarify?

xemul · 2022-11-23T10:52:33Z

@dotnwat , sorry for getting back late. I'm still trying to wrap my head around it. Please, check my understanding of the race and correct me once you see I'm wrong :)

So posix_server_socket_impl::accept spawns a submit_to fiber that is supposed to do one of two things. If the target shard already did call accept() on it, there's an entry in sockets map, so the moving fiber "wakes up" this entry. If the accept() was not yet called, then the moving entry lands into the conn_q map, so that later accept() would pick it up later.

If the target shard does abort_accept() it's should clean both -- the sockets and conn_q from any pending stuff. However, if the moving fiber delays and abort_accept() runs before, then there appears an entry in conn_q that nobody picks up and it lives up until seastar shuts down.

If that's correct, then shouldn't the fix be to avoid this race, i.e. -- the entry in-cross-shard-flight should be somehow waited by the abort_accept, so that it doesn't hang out in the conn-q until shutdown?

dotnwat · 2022-11-23T22:30:16Z

@dotnwat , sorry for getting back late. I'm still trying to wrap my head around it. Please, check my understanding of the race and correct me once you see I'm wrong :)

No problem @xemul at all.

So posix_server_socket_impl::accept spawns a submit_to fiber that is supposed to do one of two things. If the target shard already did call accept() on it, there's an entry in sockets map, so the moving fiber "wakes up" this entry. If the accept() was not yet called, then the moving entry lands into the conn_q map, so that later accept() would pick it up later.

Yep. This is the rendezvous that occurs between the cores.

If the target shard does abort_accept() it's should clean both -- the sockets and conn_q from any pending stuff. However, if the moving fiber delays and abort_accept() runs before, then there appears an entry in conn_q that nobody picks up and it lives up until seastar shuts down.

That's right.

If that's correct, then shouldn't the fix be to avoid this race,

Yes, avoiding the race would be ideal. And I think there are probably many strategies for this, including a replacement for this rendezvous mechanism that is difficult to work with.

i.e. -- the entry in-cross-shard-flight should be somehow waited by the abort_accept, so that it doesn't hang out in the conn-q until shutdown?

The problem with having the target shard wait like you suggest is that the in-cross-shard-flight is in flight -- that target shard doesn't having anything to wait on. I suppose the target shard could reach back over to core-0 and check for in-flight entries.

I had played a bit with the idea of having the target shard's abort_accept record something like a tombstone so that the in-flight entry would be rejected and make the race safe. However, the conn_q data structure is indexed by protocol/socket address--there didn't seem to be an obviously correct policy for clearing the tombstones which would be necessary in practice (e.g. a socket server shutdown and then restarted with the same address).

From what we can tell the race is very rare, and only happens on system shutdown when a new connection happens to arrive in that vulnerable period right before shard-0 abort_accept but after the target shard abort_accept. So this PR is really a band-aid for a nagging issue that shows up as a LSAN leak causing an occasional CI failure, not a fix for the race.

I'm happy to go back to the drawing board and try to fix the race if you'd like to avoid the hack.

dotnwat · 2022-12-08T15:03:43Z

bump

xemul · 2022-12-08T15:15:22Z

Sorry. I think that the way to go is to try to avoid the root-case race. The hack is awesome, but still the raced connection leaks until seastar shutdown which's not nice.

dotnwat · 2022-12-08T18:32:39Z

Sorry. I think that the way to go is to try to avoid the root-case race. The hack is awesome, but still the raced connection leaks until seastar shutdown which's not nice.

sgtm. ill look into a more fundamental fix, although with the current socket implementation it's not entirely clear about a better solution that doesn't involve public API changes. There is a reproducer in the linked issue and several folks reporting this issue in case opinions change about the temporary fix for LSAN in the short term.

A memory leak pops up in our CI from time to time (redpanda-data#3338). It is caused by a very narrow race within Seastar when incoming connections arrive while a socket server is being shutdown. I've posted a "fix" for this in upstream Seastar (link below) which is effectively rejected because it doesn't really fix the leak, but merely cleans up on exit to make LSAN happy. AFAICT this leak can only happen when a socket server is being shut down, and in Redpanda, this only happens when Redpanda is shutting down. In this case it is harmless, but annoying because it will occassionally fail our CI. Adding an LSAN suppressions file allows us to instruct LSAN to ignore this particular leak when it occassionally does trigger. Fixes: redpanda-data#3338 See scylladb/seastar#1265 for more discussion and a link to an issue with a reproducer. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

A memory leak pops up in our CI from time to time (redpanda-data#3338). It is caused by a very narrow race within Seastar when incoming connections arrive while a socket server is being shutdown. I've posted a "fix" for this in upstream Seastar (link below) which is effectively rejected because it doesn't really fix the leak, but merely cleans up on exit to make LSAN happy. AFAICT this leak can only happen when a socket server is being shut down, and in Redpanda, this only happens when Redpanda is shutting down. In this case it is harmless, but annoying because it will occassionally fail our CI. Adding an LSAN suppressions file allows us to instruct LSAN to ignore this particular leak when it occassionally does trigger. Fixes: redpanda-data#3338 See scylladb/seastar#1265 for more discussion and a link to an issue with a reproducer. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> (cherry picked from commit 778704b)

dotnwat mentioned this pull request Oct 24, 2022

test: memory leak in test_cluster_rpunit redpanda-data/redpanda#3338

Closed

dotnwat changed the title ~~posix-stack: fix memory with shutdown/accept race~~ posix-stack: fix memory leak with shutdown/accept race Oct 24, 2022

dotnwat force-pushed the wip-738 branch from 75c530c to 224e541 Compare October 24, 2022 18:04

xemul reviewed Oct 27, 2022

View reviewed changes

dotnwat requested a review from xemul November 4, 2022 16:03

dotnwat mentioned this pull request Nov 24, 2022

LeakSanitizer: detected memory leaks from seastar::net::conntrack::load_balancer #738

Open

dotnwat mentioned this pull request Dec 21, 2022

lsan: ignore memory leak redpanda-data/redpanda#7888

Merged

6 tasks

michael-redpanda mentioned this pull request Jul 11, 2023

CI Failure (Leak Detected) in RackAwarePlacementTest.test_replica_placement redpanda-data/redpanda#11615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

posix-stack: fix memory leak with shutdown/accept race #1265

posix-stack: fix memory leak with shutdown/accept race #1265

dotnwat commented Oct 24, 2022

xemul Oct 27, 2022

dotnwat Nov 2, 2022 •

edited

Loading

xemul commented Nov 8, 2022

dotnwat commented Nov 8, 2022 •

edited

Loading

dotnwat commented Nov 15, 2022

xemul commented Nov 23, 2022

dotnwat commented Nov 23, 2022

dotnwat commented Dec 8, 2022

xemul commented Dec 8, 2022

dotnwat commented Dec 8, 2022 •

edited

Loading

posix-stack: fix memory leak with shutdown/accept race #1265

Are you sure you want to change the base?

posix-stack: fix memory leak with shutdown/accept race #1265

Conversation

dotnwat commented Oct 24, 2022

xemul Oct 27, 2022

Choose a reason for hiding this comment

dotnwat Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

How could the race be completely fixed?

xemul commented Nov 8, 2022

dotnwat commented Nov 8, 2022 • edited Loading

dotnwat commented Nov 15, 2022

xemul commented Nov 23, 2022

dotnwat commented Nov 23, 2022

dotnwat commented Dec 8, 2022

xemul commented Dec 8, 2022

dotnwat commented Dec 8, 2022 • edited Loading

dotnwat Nov 2, 2022 •

edited

Loading

dotnwat commented Nov 8, 2022 •

edited

Loading

dotnwat commented Dec 8, 2022 •

edited

Loading