Consider using fluent-bit instead of fluentd #233

robertodauria · 2019-07-22T07:28:50Z

Fluentd is a big and memory-hungry daemon doing much more than what our current needs are.

Fluent-bit, by the same company, is a much smaller version of it, written in C, that does just log ingestion/parsing/forwarding to Stackdriver and has native Prometheus metrics. However, while testing it, I discovered that in none of the latest versions everything we need just works out-of-the-box. In particular:

The HTTP server seems to be broken in the 1.2.* branch - Prometheus counters for input plugins are always zero and the uptime endpoint returns an empty response
The kubernetes input filter floods the logs with an error about it being unable to merge JSON messages in previous versions. It does not seem to actually read and forward anything to stackdriver other than its own error logs

I didn't test even older versions, but I don't feel too comfortable using a product that's not very mature yet. However, development is very active and it's likely fluent-bit will become a more viable solution in the next months. This issue is to remind myself to revisit it in - say - three months from now.

The text was updated successfully, but these errors were encountered:

nkinkade · 2019-10-31T20:39:47Z

@robertodauria: It appears that fluent-bit v1.3.2 was released just a few weeks ago. Is it possible that this release (or anything in v1.3.x) resolves the issues you encountered with v1.2.x?

robertodauria · 2019-12-02T16:02:50Z

I've checked the latest release (v1.3.3) and the previous one, v1.3.2. Both of them segfault with a minimal configuration that didn't cause any issue on the older v1.2.x. Also, there are several open issues related to features we will certainly use:

in_tail plugin randomly fails with "too many open files" and errno=24 - unless switched from inotify to stat tail fluent/fluent-bit#1777
[backpressure setup issue] Fluent Bit is OOM-Killed (8 GB Mem usage) during load test fluent/fluent-bit#1768
Polling of prometheus metrics randomly causes SIGSEGV fluent/fluent-bit#1755

I think we should wait another 5-6 months.

binarylogic · 2019-12-03T00:41:31Z

Hey guys, I came across this issue. I thought I'd humbly recommend Vector as an option as well. I think you'll find it be superior to both fluentd and fluentbit. We have extensive experience with the fluent* projects at Timber and found them to be rather unreliable in many ways. You can see our test harness here which runs a variety of performance and correctness tests across these tools.

Happy to answer any questions as well!

robertodauria · 2019-12-03T16:21:52Z

@binarylogic thank you for mentioning Vector as an option!

Sadly, our main reason to stick to the Fluent family is that we need support for exporting logs to Google Stackdriver. Is there any plan of adding a Stackdriver sink to Vector in future?

binarylogic · 2019-12-03T16:36:20Z

Hi @robertodauria, no problem. And yes, we're actually working through a GCP milestone now, vectordotdev/vector#572 is the specific issue for stackdriver. We should have a pull request up next week if that works for your time line.

robertodauria · 2019-12-04T14:33:15Z

@binarylogic That sounds great, and I'd be happy to give Vector a try. It's not too urgent for us right now as the setup we have (with fluentd) works - although it's a bit too memory-hungry - so it might take a while before we get to test it.

edsiper · 2019-12-19T16:49:16Z

Hi @robertodauria

this is Eduardo, core maintainer of Fluent Bit, looking around for topics around Fluent Bit I get into this ticket.

Since you are planning to migrate to Fluent Bit and you are finding some issues I would like to point out some comments:

1 fluent/fluent-bit#1777 : this is mostly a setup issue, our primary monitoring interface for files modifications is inotify(7), the good thing is that is highly performant but the downside is that it requires an extra file descriptor per monitored file. The workaround at the moment is to increase the Kernel limit for watched files here:

/proc/sys/fs/inotify/max_user_watches
/proc/sys/fs/inotify/max_user_instances

for environments where this modification is not an option, we will offer the backend based on stat(2), the good thing is that it doesn't require an extra file descriptor as inotify(7), the downside is that is more expensive since it involves a more expensive system call (called multiple times).

Both mechanisms already exist in Fluent Bit, the thing that at build time you have to choose one or the other, the improvement will be to let the user decide which one will use.

It the workaround above is not suitable, let me know since we are prioritizing this anyways.

2 fluent/fluent-bit#1768 : setup issue. On high load environments, likely the ability to deliver records or events to the destination databases or cloud providers is slower than the rate of data ingestion, for hence your system faces backpressure. This is a common issue that is easily solved configuring the input plugins with a memory limit, you can read more about this here:

https://docs.fluentbit.io/manual/configuration/backpressure

note: when your input is based on a network service line tcp, syslog, mqtt or other, this memory limit + backpressure might lead to discarding incoming records to survive, the workaround is to enable file system buffering mechanism so you don't lose data and you can continue processing and delivering records:

https://docs.fluentbit.io/manual/configuration/buffering

3 fluent/fluent-bit#1755 : bug already fixed in the previous version Fluent Bit v1.3.4.

Now we are at Fluent Bit v1.3.5, I would encourage you to give it a try, if you face any issue let me know, we can follow up on our Github repo or through a call, we do that with most of the users.

Fluent Bit is deployed a few million of times every month and several companies contribute to it; if you have any question about its adoption and enterprise-grade usage I am happy to discuss about it :)

best,

nkinkade · 2020-07-09T22:19:16Z

@robertodauria: I think it may be time to revisit this issue. Fluent Bit is now at v1.4.5, and Vector from timber.io now seems to support exporting to Stackdriver. Either one could, at this point, be a viable option for us. What do you think?

Hoverbear · 2020-07-09T23:31:55Z

Shiny new feature! Yup, check out our stackdriver docs: https://vector.dev/docs/reference/sinks/gcp_stackdriver_logs/

Here's a sample of all the knobs:

[sinks.my_sink_id]
  # General
  type = "gcp_stackdriver_logs" # required
  inputs = ["my-source-id"] # required
  billing_account_id = "012345-6789AB-CDEF01" # optional, no default
  credentials_path = "/path/to/credentials.json" # optional, no default
  folder_id = "My Folder" # optional, no default
  healthcheck = true # optional, default
  log_id = "vector-logs" # required
  organization_id = "622418129737" # optional, no default
  project_id = "vector-123456" # required

  # Batch
  batch.max_size = 5242880 # optional, default, bytes
  batch.timeout_secs = 1 # optional, default, seconds

  # Buffer
  buffer.type = "memory" # optional, default
  buffer.max_events = 500 # optional, default, events, relevant when type = "memory"
  buffer.max_size = 104900000 # required, bytes, required when type = "disk"
  buffer.when_full = "block" # optional, default

  # Encoding
  encoding.except_fields = ["timestamp", "message", "host"] # optional, no default
  encoding.only_fields = ["timestamp", "message", "host"] # optional, no default
  encoding.timestamp_format = "rfc3339" # optional, default

  # Request
  request.in_flight_limit = 5 # optional, default, requests
  request.rate_limit_duration_secs = 1 # optional, default, seconds
  request.rate_limit_num = 1000 # optional, default
  request.retry_attempts = -1 # optional, default
  request.retry_initial_backoff_secs = 1 # optional, default, seconds
  request.retry_max_duration_secs = 10 # optional, default, seconds
  request.timeout_secs = 60 # optional, default, seconds

  # Resource
  resource.type = "global" # required
  resource.projectId = "vector-123456" # example
  resource.zone = "Twilight" # example

  # TLS
  tls.ca_path = "/path/to/certificate_authority.crt" # optional, no default
  tls.crt_path = "/path/to/host_certificate.crt" # optional, no default
  tls.key_pass = "${KEY_PASS_ENV_VAR}" # optional, no default
  tls.key_path = "/path/to/host_certificate.key" # optional, no default
  tls.verify_certificate = true # optional, default
  tls.verify_hostname = true # optional, default

You may also be interested in our upcoming Kubernetes Integration!

Let me know if I can help anyone with a test setup. :) We'd be happy to set up a chat etc.

nkinkade · 2020-08-28T17:39:27Z

We replaced Fluend with Vector. Closing.

robertodauria self-assigned this Jul 22, 2019

autolabel bot added the review/triage label Jul 22, 2019

pboothe added backlog Q4 P2 and removed review/triage labels Jul 22, 2019

robertodauria added the current label Nov 12, 2019

autolabel bot added 2019 Week 46 and removed backlog labels Nov 12, 2019

robertodauria added backlog Story and removed 2019 Week 46 labels Nov 12, 2019

autolabel bot removed the current label Nov 12, 2019

robertodauria added the current label Nov 12, 2019

autolabel bot added 2019 Week 46 and removed backlog labels Nov 12, 2019

robertodauria added the DevOps label Nov 13, 2019

robertodauria added backlog and removed current 2019 Week 46 labels Dec 2, 2019

nkinkade added review/triage and removed P2 backlog labels Jul 9, 2020

nkinkade mentioned this issue Jul 9, 2020

Update fluentd container and enable systemd logging #177

Closed

robertodauria added the current label Jul 20, 2020

autolabel bot added 2020 Week 30 and removed review/triage labels Jul 20, 2020

robertodauria removed 2020 Week 30 labels Jul 20, 2020

nkinkade closed this as completed Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using fluent-bit instead of fluentd #233

Consider using fluent-bit instead of fluentd #233

robertodauria commented Jul 22, 2019

nkinkade commented Oct 31, 2019

robertodauria commented Dec 2, 2019

binarylogic commented Dec 3, 2019

robertodauria commented Dec 3, 2019

binarylogic commented Dec 3, 2019

robertodauria commented Dec 4, 2019

edsiper commented Dec 19, 2019

nkinkade commented Jul 9, 2020

Hoverbear commented Jul 9, 2020 •

edited

Loading

nkinkade commented Aug 28, 2020

Consider using fluent-bit instead of fluentd #233

Consider using fluent-bit instead of fluentd #233

Comments

robertodauria commented Jul 22, 2019

nkinkade commented Oct 31, 2019

robertodauria commented Dec 2, 2019

binarylogic commented Dec 3, 2019

robertodauria commented Dec 3, 2019

binarylogic commented Dec 3, 2019

robertodauria commented Dec 4, 2019

edsiper commented Dec 19, 2019

nkinkade commented Jul 9, 2020

Hoverbear commented Jul 9, 2020 • edited Loading

nkinkade commented Aug 28, 2020

Hoverbear commented Jul 9, 2020 •

edited

Loading