Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input: disable coroutines for custom events and input callbacks that do not support coroutines. #9338

Merged
merged 4 commits into from
Sep 16, 2024

Conversation

pwhelan
Copy link
Contributor

@pwhelan pwhelan commented Sep 3, 2024

Summary

Code that does not specifically support coroutines can result in a segmentation fault when coroutines are used during I/O and network calls. Under most circumstances this should not occur but if an input plugin shares the same thread as any output plugin that can inadvertently enable coroutines for all input plugins as well as causing custom events to use coroutines as well.

This PR mitigates this by juggling out the thread local storage used by coroutines during collect callbacks for non-coroutine inputs as well as custom events which do not have a way to express their compatibility with coroutines.

This PR has been refactored so that both the flb_io_net_connect and flb_tls_session_create functions respect the FLB_IO_ASYNC flag instead of assuming it when coroutines have been initialized.

This also mitigates the issue for FLB_ENGINE_EV_CUSTOM:

[2024/09/11 19:44:27] [debug] [upstream] KA connection #44 to cloud-api.calyptia.com:443 is now available
Process 69510 stopped
* thread #2, name = 'flb-pipeline', stop reason = breakpoint 1.1
    frame #0: 0x00000001001288a8 calyptia-fluent-bit`flb_engine_start(config=0x0000000101900200) at flb_engine.c:1040:22
   1037	                /* Init coroutine */
   1038	                flb_coro_resume(output_flush->coro);
   1039	            }
-> 1040	            else if (event->type == FLB_ENGINE_EV_CUSTOM) {
   1041	                event->handler(event);
   1042	            }
   1043	            else if (event->type == FLB_ENGINE_EV_THREAD) {
Target 0: (calyptia-fluent-bit) stopped.
(lldb) c
Process 69510 resuming
[0] logs: [[1726094665.418981000, {"id"=>"uuid", "email"=>"test@account.com", "external_email"=>"foo@bar.com", "first_name"=>"Test", "last_name"=>"User", "created_by"=>"no-idea", "state"=>2, "type"=>1, "user_account.provider"=>4, "user_account.provider_id"=>"THIS_IS_A_PROVIDER", "service_account.id"=>"692c02c283cca2aa9722499313c805c064639a855a85b791607b89ef77c27639", "service_account.email"=>"service@account.com", "service_account.name"=>"This Is a Service Account", "service_account.created_by"=>"Some guy in IT called Roy", "service_account.version"=>65541, "service_account.permission"=>3, "service_account.source_client"=>"The Source of everything: 42", "service_account.slug"=>"Captain Slug to you!", "First"=>"Psot!", "tenant_active.id"=>"78e8b9ab-6735-4a5e-9c4c-cb38d26de21d"}], {"service"=>"startup-test", "severity"=>"INFO"}]
[2024/09/11 19:44:28] [debug] [out flush] cb_destroy coro_id=0
[2024/09/11 19:44:28] [debug] [task] destroy task=0x105504b40 (task_id=0)
[2024/09/11 19:44:28] [debug] [out flush] cb_destroy coro_id=0

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • [Y] Backport to latest stable release.

I have backported this PR to the 3.0 and 3.1 series:


Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@leonardo-albertovich
Copy link
Collaborator

Food for thought: what would be the complexity of modifying the input event dispatcher to actually use coroutines? I don't know if I'm missing a big deal or if the simplistic approach would be too awful performance wise but for some reason I think it shouldn't be hard to pull off.

In a way I think it'd be preferable to leave this PR on hold for a few days and give that one a try because it could be quite beneficial performance wise and I think it would be better to address the issue now that we have a clear picture of what the problem is rather than working around it and leaving it for someone to uncover in the future but I understand that this request can be a bit frustrating.

@edsiper
Copy link
Member

edsiper commented Sep 5, 2024

Thanks for opening this PR.

Code that does not specifically support coroutines can result in a segmentation fault when coroutines are used during I/O and network calls.

I am wondering if this is fixing the root cause of the problem or is just a workaround. what about if we schedule a call and we walk through this ?

@pwhelan
Copy link
Contributor Author

pwhelan commented Sep 6, 2024

Food for thought: what would be the complexity of modifying the input event dispatcher to actually use coroutines? I don't know if I'm missing a big deal or if the simplistic approach would be too awful performance wise but for some reason I think it shouldn't be hard to pull off.

I am all for it but hadn't taken the time yet to attempt that approach. One problem might be how to handle custom events since we cannot be sure if that is registered to an input or an output plugin, but its possible that distinction might make no difference.

I am wondering if this is fixing the root cause of the problem or is just a workaround.

I agree. I do not necessarily think this needs to be the final fix for this issue but I did feel the need to present it.

what about if we schedule a call and we walk through this ?

👍

Signed-off-by: Phillip Adair Stewart Whelan <phillip.whelan@chronosphere.io>
Signed-off-by: Phillip Adair Stewart Whelan <phillip.whelan@chronosphere.io>
Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>
@pwhelan
Copy link
Contributor Author

pwhelan commented Sep 12, 2024

I ran the flb-rt-in_tcp and flb-rt-out_tcp tests which cover both changes:

$ valgrind ./bin/flb-rt-out_tcp
==232901== Memcheck, a memory error detector
==232901== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==232901== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==232901== Command: ./bin/flb-rt-out_tcp
==232901==
Test tcp_with_tls...                            ==232901== Conditional jump or move depends on uninitialised value(s)
==232901==    at 0x1DF6F9: flb_net_server (flb_network.c:1582)
==232901==    by 0x2104A5: flb_downstream_setup (flb_downstream.c:125)
==232901==    by 0x21069E: flb_downstream_create (flb_downstream.c:177)
==232901==    by 0x18E03A: flb_test_tcp_with_tls (out_tcp.c:346)
==232901==    by 0x18B5BF: acutest_do_run_ (acutest.h:1034)
==232901==    by 0x18B9E5: acutest_run_ (acutest.h:1205)
==232901==    by 0x18CE08: main (acutest.h:1769)
==232901==
[ OK ]
Test format_msgpack...                          [ OK ]
Test format_json...                             [ OK ]
Test format_json_stream...                      [ OK ]
Test format_json_lines...                       [ OK ]
Test set_json_date_key...                       [ OK ]
Test disable_json_date_key...                   [ OK ]
Test json_date_format_epoch...                  [ OK ]
Test json_date_format_iso8601...                [ OK ]
Test json_date_format_java_sql_timestamp...     [ OK ]
Test tcp_exit_workers...                        [ OK ]
Test tcp_exit_no_workers...                     [ OK ]
SUCCESS: All unit tests have passed.
==232901==
==232901== HEAP SUMMARY:
==232901==     in use at exit: 0 bytes in 0 blocks
==232901==   total heap usage: 110,728 allocs, 110,728 frees, 17,101,837 bytes allocated
==232901==
==232901== All heap blocks were freed -- no leaks are possible
==232901==
==232901== Use --track-origins=yes to see where uninitialised values come from
==232901== For lists of detected and suppressed errors, rerun with: -s
==232901== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
$ valgrind ./bin/flb-rt-in_tcp
==234643== Memcheck, a memory error detector
==234643== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==234643== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==234643== Command: ./bin/flb-rt-in_tcp
==234643==
Test tcp...                                     [ OK ]
Test tcp_with_source_address...                 [ OK ]
Test tcp_with_tls...                            [ OK ]
Test format_none...                             [ OK ]
Test format_none_separator...                   [ OK ]
Test 65535_records_issue_5336...                [ OK ]
SUCCESS: All unit tests have passed.
==234643==
==234643== HEAP SUMMARY:
==234643==     in use at exit: 0 bytes in 0 blocks
==234643==   total heap usage: 865,240 allocs, 865,240 frees, 1,840,708,122 bytes allocated
==234643==
==234643== All heap blocks were freed -- no leaks are possible
==234643==
==234643== For lists of detected and suppressed errors, rerun with: -s
==234643== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The tests pass and do not leak any memory.

The following code sets all connection to FLB_IO_ASYNC by default, which I actually disable explicitly in the code that I am using:

flb_stream_enable_flags(&u->base, FLB_IO_ASYNC);

It is called in flb_upstream_create. I do not know what the full implications of turning it off by default would be so I have left it as is for now.

There is also no flb_input_plugin struct in the scope of flb_upstream_create to be able to check for the FLB_INPUT_CORO flags from it. Maybe we could wrap flb_upstream_create with flb_upstream_input_create and pass that a flb_input_plugin pointer to check the flag there?

… flb_tls_session_create.

Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants