-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
posix-stack: fix memory leak with shutdown/accept race #1265
base: master
Are you sure you want to change the base?
Conversation
When `posix_server_socket_impl::accept()` runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q. However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded<server>::stop). This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor. ``` class handle { foreign_ptr<lw_shared_ptr<load_balancer>> _lb; ~handle() { (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] { lb->closed_cpu(cpu); }); } ... ``` When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked: Direct leak of 88 byte(s) in 1 object(s) allocated from: #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba... ... // the unordered map here is conn_q scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket... scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9... fixes: scylladb#738 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
future<> posix_network_stack::initialize() { | ||
engine().at_exit([] { | ||
return exiting_gate.close().then([] { | ||
posix_ap_server_socket_impl::conn_q.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this clear()
needed? If we're on main thread, this queue is empty, if on the auxiliary then the check in move_connected_socket
makes sure no entries are there any longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xemul sorry about the delay getting back to you!
the primary purpose of the gate on the main thread is not to fix the race definitively, but to introduce a barrier such that when a socket is closed during shutdown that we can be sure that if the race occurred then the entry is on the other core early enough that we can...
Why is this clear() needed?
clean up with conn_q.clear()
before the reactor thread exits and schedules more continuations on a shutdown reactor.
the race still exists because there is no internal cross-core synchronziation happening with abort_accept calls. with the gate, if the application ensures abort_accept is called on the main thread before all other threads, then the race doesn't exist, but that isn't obvious at the API level where folks will usually do something like sharded<>::stop() which ends up with a non-deterministic shutdown ordering.
How could the race be completely fixed?
I worked on this a bit in the realm of more elaborate internal accounting tricks, but they all started to spiral out of control. Ultimately it seems like there might need to be a change at the socket API level such that abort_accept
synchronizes across cores in some fashion.
I still don't get the fix :( If on shard_0 the gate gets closed, it means that the accepted socket had landed on the shard-N's lists. To avoid leaks there should be something that cleans lists on shard-N after shard-0's gate is closed, but I don't see it. |
Thanks for being patient. I think my explanations haven't been great. But I think I understand where the confusion might be given your next question:
I thought I mentioned in the commit--but now realize I didn't--that there is an implicit assumption of at_exit handler ordering. The reactor explicitly runs at_exit handlers on shard 0 before any other shards. void reactor::stop() {
assert(_id == 0);
_smp->cleanup_cpu();
if (!_stopping) {
// Run exit tasks locally and then stop all other engines
// in the background and wait on semaphore for all to complete.
// Finally, set _stopped on cpu 0.
(void)run_exit_tasks().then([this] {
return do_with(semaphore(0), [this] (semaphore& sem) {
// Stop other cpus asynchronously, signal when done.
(void)smp::invoke_on_others(0, [] {
engine()._smp->cleanup_cpu();
return engine().run_exit_tasks().then([] { Since the following at exit handler is registered on all shards uture<> posix_network_stack::initialize() {
engine().at_exit([] {
return exiting_gate.close().then([] {
posix_ap_server_socket_impl::conn_q.clear(); Shard 0 will first close the gate ensuring the socket lands on shard N. Then all other shards will close the gate and clear the list. It's slightly confusing that the gate is thread_local since it only serves a functional purpose on shard 0, but I didn't see an obviously better place to put it. |
@xemul does that latest update help clarify? |
@dotnwat , sorry for getting back late. I'm still trying to wrap my head around it. Please, check my understanding of the race and correct me once you see I'm wrong :) So If the target shard does If that's correct, then shouldn't the fix be to avoid this race, i.e. -- the entry in-cross-shard-flight should be somehow waited by the abort_accept, so that it doesn't hang out in the conn-q until shutdown? |
No problem @xemul at all.
Yep. This is the rendezvous that occurs between the cores.
That's right.
Yes, avoiding the race would be ideal. And I think there are probably many strategies for this, including a replacement for this rendezvous mechanism that is difficult to work with.
The problem with having the target shard wait like you suggest is that the in-cross-shard-flight is in flight -- that target shard doesn't having anything to wait on. I suppose the target shard could reach back over to core-0 and check for in-flight entries. I had played a bit with the idea of having the target shard's abort_accept record something like a tombstone so that the in-flight entry would be rejected and make the race safe. However, the From what we can tell the race is very rare, and only happens on system shutdown when a new connection happens to arrive in that vulnerable period right before shard-0 abort_accept but after the target shard abort_accept. So this PR is really a band-aid for a nagging issue that shows up as a LSAN leak causing an occasional CI failure, not a fix for the race. I'm happy to go back to the drawing board and try to fix the race if you'd like to avoid the hack. |
bump |
Sorry. I think that the way to go is to try to avoid the root-case race. The hack is awesome, but still the raced connection leaks until seastar shutdown which's not nice. |
sgtm. ill look into a more fundamental fix, although with the current socket implementation it's not entirely clear about a better solution that doesn't involve public API changes. There is a reproducer in the linked issue and several folks reporting this issue in case opinions change about the temporary fix for LSAN in the short term. |
A memory leak pops up in our CI from time to time (redpanda-data#3338). It is caused by a very narrow race within Seastar when incoming connections arrive while a socket server is being shutdown. I've posted a "fix" for this in upstream Seastar (link below) which is effectively rejected because it doesn't really fix the leak, but merely cleans up on exit to make LSAN happy. AFAICT this leak can only happen when a socket server is being shut down, and in Redpanda, this only happens when Redpanda is shutting down. In this case it is harmless, but annoying because it will occassionally fail our CI. Adding an LSAN suppressions file allows us to instruct LSAN to ignore this particular leak when it occassionally does trigger. Fixes: redpanda-data#3338 See scylladb/seastar#1265 for more discussion and a link to an issue with a reproducer. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
A memory leak pops up in our CI from time to time (redpanda-data#3338). It is caused by a very narrow race within Seastar when incoming connections arrive while a socket server is being shutdown. I've posted a "fix" for this in upstream Seastar (link below) which is effectively rejected because it doesn't really fix the leak, but merely cleans up on exit to make LSAN happy. AFAICT this leak can only happen when a socket server is being shut down, and in Redpanda, this only happens when Redpanda is shutting down. In this case it is harmless, but annoying because it will occassionally fail our CI. Adding an LSAN suppressions file allows us to instruct LSAN to ignore this particular leak when it occassionally does trigger. Fixes: redpanda-data#3338 See scylladb/seastar#1265 for more discussion and a link to an issue with a reproducer. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> (cherry picked from commit 778704b)
A memory leak pops up in our CI from time to time (redpanda-data#3338). It is caused by a very narrow race within Seastar when incoming connections arrive while a socket server is being shutdown. I've posted a "fix" for this in upstream Seastar (link below) which is effectively rejected because it doesn't really fix the leak, but merely cleans up on exit to make LSAN happy. AFAICT this leak can only happen when a socket server is being shut down, and in Redpanda, this only happens when Redpanda is shutting down. In this case it is harmless, but annoying because it will occassionally fail our CI. Adding an LSAN suppressions file allows us to instruct LSAN to ignore this particular leak when it occassionally does trigger. Fixes: redpanda-data#3338 See scylladb/seastar#1265 for more discussion and a link to an issue with a reproducer. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> (cherry picked from commit 778704b)
When
posix_server_socket_impl::accept()
runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q.However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded::stop).
This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor.
When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked:
Direct leak of 88 byte(s) in 1 object(s) allocated from:
#0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
fixes: #738
Signed-off-by: Noah Watkins noahwatkins@gmail.com