Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[engine] caught signal (SIGSEGV) #351

Open
LucasHantz opened this issue May 12, 2022 · 12 comments
Open

[engine] caught signal (SIGSEGV) #351

LucasHantz opened this issue May 12, 2022 · 12 comments

Comments

@LucasHantz
Copy link

When trying to upgrade to the latest version of Firelens (v2.25.0 packaged with FluentBit v1.9.3), we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"

[1mFluent Bit v1.9.3�[0m

  • �[1m�[93mCopyright (C) 2015-2022 The Fluent Bit Authors�[0m
  • Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
  • https://fluentbit.io
    [2022/05/12 09:28:48] [ info] [fluent bit] version=1.9.3, commit=a313296229, pid=1
    [2022/05/12 09:28:48] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
    [2022/05/12 09:28:48] [ info] [cmetrics] version=0.3.1
    [2022/05/12 09:28:48] [ info] [input:forward:forward.0] listening on unix:///var/run/fluent.sock
    [2022/05/12 09:28:48] [ info] [input:forward:forward.1] listening on 127.0.0.1:24224
    [2022/05/12 09:28:48] [ info] [input:tcp:tcp.2] listening on 127.0.0.1:8877
    [2022/05/12 09:28:48] [ info] [input:tcp:tcp.3] listening on 127.0.0.1:5170
    AWS for Fluent Bit Container Image Version 2.25.0time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter delivery_stream = 'xxx'"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter region = 'eu-west-1'"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter data_keys = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter role_arn = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter sts_endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter time_key = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter time_key_format = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter log_key = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter replace_dots = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter simple_aggregation = 'false'"
    time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter delivery_stream = 'xxx'"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter region = 'eu-west-1'"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter data_keys = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter role_arn = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter sts_endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter time_key = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter time_key_format = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter log_key = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter replace_dots = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter simple_aggregation = 'false'"
    time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter delivery_stream = 'xxx'"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter region = 'eu-west-1'"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter data_keys = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter role_arn = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter sts_endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter time_key = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter time_key_format = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter log_key = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter replace_dots = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter simple_aggregation = 'false'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_group_name = 'xxx'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter default_log_group_name = 'fluentbit-default'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_stream_prefix = 'web/'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_stream_name = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter default_log_stream_name = '/fluentbit-default'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter region = 'eu-west-1'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_key = 'log'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter role_arn = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter auto_create_group = 'true'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter new_log_group_tags = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_retention_days = '0'"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter sts_endpoint = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter external_id = ''"
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter credentials_endpoint = "
    time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_format = ''"
    [2022/05/12 09:28:48] [ info] [sp] stream processor started
    [2022/05/12 09:28:48] [ info] [sp] registered task: logs
    [2022/05/12 09:28:48] [ info] [sp] registered task: metrics
    [2022/05/12 09:28:48] [ info] [sp] registered task: io
    [2022/05/12 09:28:48] [ info] [output:null:null.3] worker #0 started
    time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Log group xxx already exists\n"
    time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Created log stream xxx in group xxx"
    [2022/05/12 09:41:19] [engine] caught signal (SIGTERM)
    [2022/05/12 09:41:19] [ info] [input] pausing forward.0
    [2022/05/12 09:41:19] [ info] [input] pausing forward.1
    [2022/05/12 09:41:19] [ info] [input] pausing logs
    [2022/05/12 09:41:19] [ info] [input] pausing metrics
    [2022/05/12 09:41:19] [ info] [input] pausing io
    [2022/05/12 09:41:19] [ warn] [engine] service will shutdown in max 30 seconds
    [2022/05/12 09:41:20] [ info] [engine] service has stopped (0 pending tasks)
    [2022/05/12 09:41:20] [engine] caught signal (SIGSEGV)

Going back to v2.23.3 packaged with Fluent Bit v1.8.15

@PettitWesley
Copy link
Contributor

@LucasHantz Can you please attach your task def and any custom Fluent Bit config?

Have you seen this crash consistently? Does it only happen after Fluent Bit receives a SIGTERM?

Also:

we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"

I don't see this in the log output you shared, can you share full log output with this?

@LucasHantz
Copy link
Author

Hello @PettitWesley
The issue with the Socket being cut happens even more frequently than the SIGSEGV which seems to occur.
Find attached the configuration: firelens.zip

The log is from our internal application which is forwarding the logs to tcp://127.0.0.1:5170
It's a PHP application using the logging library: https://github.com/Seldaek/monolog/blob/main/src/Monolog/Handler/SocketHandler.php

@PettitWesley
Copy link
Contributor

@LucasHantz Interesting. Recently I worked on testing similar reports of failures with the TCP input- I forgot to post the results of my investigation publicly, I will do that ASAP. What is your approximate throughput emitted to the TCP input?

My tests were with the log4j tcp appender: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app

My only finding was that adding workers to ALL outputs significantly improves the throughput that the TCP input can accept. This is likely because without workers, the entirety of Fluent Bit is one thread, with both inputs and outputs fighting for time: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

However, once you add workers, the outputs get their own separate threads, freeing up the main thread to focus exclusively on inputs. I still think Fluent Bit needs workers for inputs to truly scale, but I didn't see workers for your outputs in your config, so please try adding that.

@LucasHantz
Copy link
Author

Is there some way to get that stat of throughput from FluentBit? Quite difficult to get from the application level as the PHP-FPM forks multiples process which can handle traffic and each one has a socket open with the same TCP input.

All right I'll try it with a worker on each output, and see how it goes but does that explain the SIGSEGV as well?
Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.

@PettitWesley
Copy link
Contributor

does that explain the SIGSEGV as well?

The SIGSEGV must be a different issue. Which I will try my best to repro.

Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.

To be clear here, are you talking about the SIGSEGV issue or the TCP input issue? I think there are two separate issues here.

@PettitWesley
Copy link
Contributor

@LucasHantz Also wait... I realized from your config you are using the older lower performance go plugin, you need to migrate in addition to enabling workers: https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin

Also, actually now that I look at your logs again I am confused:

time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Created log stream xxx in group xxx"
[2022/05/12 09:41:19] [engine] caught signal (SIGTERM)
[2022/05/12 09:41:19] [ info] [input] pausing forward.0
[2022/05/12 09:41:19] [ info] [input] pausing forward.1
[2022/05/12 09:41:19] [ info] [input] pausing logs
[2022/05/12 09:41:19] [ info] [input] pausing metrics
[2022/05/12 09:41:19] [ info] [input] pausing io

I see here you are also using the low performance CW plugin, that needs to be migrated as well: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin

Also I am confused here because the logs don't match up with the config you shared- I don't see the CW plugin used in your config, and I don't see inputs with aliases logs, metrics or io. Are the logs and config from the same test run?

@PettitWesley
Copy link
Contributor

Is there some way to get that stat of throughput from FluentBit?

You can try the monitoring interface and divide by uptime: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit

Or see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/send-fb-internal-metrics-to-cw

@PettitWesley
Copy link
Contributor

@LucasHantz Please see the results of my TCP input testing here: #294

@mfn
Copy link

mfn commented Jun 15, 2022

FTR, today I upgraded to 2.26.0 on 2 clusters:

  • on one it started to consistently crash with [engine] caught signal (SIGSEGV)
  • on the other: no issues

"Luckily" it crashed only on the staging system 😅

Going back to 2.25.1 worked for me. I realize we're talking about different versions here, as in my case I suspect it might be related to fluent-bit 1.9.4 (=2.26.0) vs. fluent-bit 1.9.3 (=2.25.1)

@tatsuo48
Copy link

tatsuo48 commented Sep 9, 2022

SIGSEGV is also generated after receiving SIGTERM in v2.24.0 as well.
Fixed by downgrading to v2.23.4.

@PettitWesley
Copy link
Contributor

We believe this crash report is likely the same as is described here: fluent/fluent-bit#5753 (comment)

The fix will be released in 2.28.1 #418

@maxflowers89
Copy link

Hi all, we are facing the same error message

[engine] caught signal (SIGSEGV)

which corresponds to FluentBit pods' restarts and it's kind of annoying. We are using the the aws-for-fluent-bit:2.32.0 image that packages the version 1.9.10 of FluentBit. Is there any way to get this issue fixed without any downgrade of the lib?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants