-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[engine] caught signal (SIGSEGV) #351
Comments
@LucasHantz Can you please attach your task def and any custom Fluent Bit config? Have you seen this crash consistently? Does it only happen after Fluent Bit receives a SIGTERM? Also:
I don't see this in the log output you shared, can you share full log output with this? |
Hello @PettitWesley The log is from our internal application which is forwarding the logs to tcp://127.0.0.1:5170 |
@LucasHantz Interesting. Recently I worked on testing similar reports of failures with the TCP input- I forgot to post the results of my investigation publicly, I will do that ASAP. What is your approximate throughput emitted to the TCP input? My tests were with the log4j tcp appender: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app My only finding was that adding workers to ALL outputs significantly improves the throughput that the TCP input can accept. This is likely because without workers, the entirety of Fluent Bit is one thread, with both inputs and outputs fighting for time: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency However, once you add workers, the outputs get their own separate threads, freeing up the main thread to focus exclusively on inputs. I still think Fluent Bit needs workers for inputs to truly scale, but I didn't see workers for your outputs in your config, so please try adding that. |
Is there some way to get that stat of throughput from FluentBit? Quite difficult to get from the application level as the PHP-FPM forks multiples process which can handle traffic and each one has a socket open with the same TCP input. All right I'll try it with a worker on each output, and see how it goes but does that explain the SIGSEGV as well? |
The SIGSEGV must be a different issue. Which I will try my best to repro.
To be clear here, are you talking about the SIGSEGV issue or the TCP input issue? I think there are two separate issues here. |
@LucasHantz Also wait... I realized from your config you are using the older lower performance go plugin, you need to migrate in addition to enabling workers: https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin Also, actually now that I look at your logs again I am confused:
I see here you are also using the low performance CW plugin, that needs to be migrated as well: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin Also I am confused here because the logs don't match up with the config you shared- I don't see the CW plugin used in your config, and I don't see inputs with aliases logs, metrics or io. Are the logs and config from the same test run? |
You can try the monitoring interface and divide by uptime: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit |
@LucasHantz Please see the results of my TCP input testing here: #294 |
FTR, today I upgraded to 2.26.0 on 2 clusters:
"Luckily" it crashed only on the staging system 😅 Going back to 2.25.1 worked for me. I realize we're talking about different versions here, as in my case I suspect it might be related to fluent-bit 1.9.4 (=2.26.0) vs. fluent-bit 1.9.3 (=2.25.1) |
SIGSEGV is also generated after receiving SIGTERM in v2.24.0 as well. |
We believe this crash report is likely the same as is described here: fluent/fluent-bit#5753 (comment) The fix will be released in 2.28.1 #418 |
Hi all, we are facing the same error message
which corresponds to FluentBit pods' restarts and it's kind of annoying. We are using the the aws-for-fluent-bit:2.32.0 image that packages the version 1.9.10 of FluentBit. Is there any way to get this issue fixed without any downgrade of the lib? |
When trying to upgrade to the latest version of Firelens (v2.25.0 packaged with FluentBit v1.9.3), we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"
[1mFluent Bit v1.9.3�[0m
[2022/05/12 09:28:48] [ info] [fluent bit] version=1.9.3, commit=a313296229, pid=1
[2022/05/12 09:28:48] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/12 09:28:48] [ info] [cmetrics] version=0.3.1
[2022/05/12 09:28:48] [ info] [input:forward:forward.0] listening on unix:///var/run/fluent.sock
[2022/05/12 09:28:48] [ info] [input:forward:forward.1] listening on 127.0.0.1:24224
[2022/05/12 09:28:48] [ info] [input:tcp:tcp.2] listening on 127.0.0.1:8877
[2022/05/12 09:28:48] [ info] [input:tcp:tcp.3] listening on 127.0.0.1:5170
AWS for Fluent Bit Container Image Version 2.25.0time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter delivery_stream = 'xxx'"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter region = 'eu-west-1'"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter data_keys = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter role_arn = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter sts_endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter time_key = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter time_key_format = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter log_key = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter replace_dots = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter simple_aggregation = 'false'"
time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter delivery_stream = 'xxx'"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter region = 'eu-west-1'"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter data_keys = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter role_arn = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter sts_endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter time_key = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter time_key_format = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter log_key = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter replace_dots = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter simple_aggregation = 'false'"
time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter delivery_stream = 'xxx'"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter region = 'eu-west-1'"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter data_keys = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter role_arn = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter sts_endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter time_key = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter time_key_format = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter log_key = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter replace_dots = ''"
time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter simple_aggregation = 'false'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_group_name = 'xxx'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter default_log_group_name = 'fluentbit-default'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_stream_prefix = 'web/'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_stream_name = ''"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter default_log_stream_name = '/fluentbit-default'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter region = 'eu-west-1'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_key = 'log'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter role_arn = ''"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter auto_create_group = 'true'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter new_log_group_tags = ''"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_retention_days = '0'"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter sts_endpoint = ''"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter external_id = ''"
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter credentials_endpoint = "
time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_format = ''"
[2022/05/12 09:28:48] [ info] [sp] stream processor started
[2022/05/12 09:28:48] [ info] [sp] registered task: logs
[2022/05/12 09:28:48] [ info] [sp] registered task: metrics
[2022/05/12 09:28:48] [ info] [sp] registered task: io
[2022/05/12 09:28:48] [ info] [output:null:null.3] worker #0 started
time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Log group xxx already exists\n"
time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Created log stream xxx in group xxx"
[2022/05/12 09:41:19] [engine] caught signal (SIGTERM)
[2022/05/12 09:41:19] [ info] [input] pausing forward.0
[2022/05/12 09:41:19] [ info] [input] pausing forward.1
[2022/05/12 09:41:19] [ info] [input] pausing logs
[2022/05/12 09:41:19] [ info] [input] pausing metrics
[2022/05/12 09:41:19] [ info] [input] pausing io
[2022/05/12 09:41:19] [ warn] [engine] service will shutdown in max 30 seconds
[2022/05/12 09:41:20] [ info] [engine] service has stopped (0 pending tasks)
[2022/05/12 09:41:20] [engine] caught signal (SIGSEGV)
Going back to v2.23.3 packaged with Fluent Bit v1.8.15
The text was updated successfully, but these errors were encountered: