Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using fluent-bit instead of fluentd #233

Closed
robertodauria opened this issue Jul 22, 2019 · 10 comments
Closed

Consider using fluent-bit instead of fluentd #233

robertodauria opened this issue Jul 22, 2019 · 10 comments

Comments

@robertodauria
Copy link
Contributor

Fluentd is a big and memory-hungry daemon doing much more than what our current needs are.

Fluent-bit, by the same company, is a much smaller version of it, written in C, that does just log ingestion/parsing/forwarding to Stackdriver and has native Prometheus metrics. However, while testing it, I discovered that in none of the latest versions everything we need just works out-of-the-box. In particular:

  • The HTTP server seems to be broken in the 1.2.* branch - Prometheus counters for input plugins are always zero and the uptime endpoint returns an empty response
  • The kubernetes input filter floods the logs with an error about it being unable to merge JSON messages in previous versions. It does not seem to actually read and forward anything to stackdriver other than its own error logs

I didn't test even older versions, but I don't feel too comfortable using a product that's not very mature yet. However, development is very active and it's likely fluent-bit will become a more viable solution in the next months. This issue is to remind myself to revisit it in - say - three months from now.

@nkinkade
Copy link
Contributor

@robertodauria: It appears that fluent-bit v1.3.2 was released just a few weeks ago. Is it possible that this release (or anything in v1.3.x) resolves the issues you encountered with v1.2.x?

@robertodauria
Copy link
Contributor Author

I've checked the latest release (v1.3.3) and the previous one, v1.3.2. Both of them segfault with a minimal configuration that didn't cause any issue on the older v1.2.x. Also, there are several open issues related to features we will certainly use:

I think we should wait another 5-6 months.

@binarylogic
Copy link

Hey guys, I came across this issue. I thought I'd humbly recommend Vector as an option as well. I think you'll find it be superior to both fluentd and fluentbit. We have extensive experience with the fluent* projects at Timber and found them to be rather unreliable in many ways. You can see our test harness here which runs a variety of performance and correctness tests across these tools.

Happy to answer any questions as well!

@robertodauria
Copy link
Contributor Author

@binarylogic thank you for mentioning Vector as an option!

Sadly, our main reason to stick to the Fluent family is that we need support for exporting logs to Google Stackdriver. Is there any plan of adding a Stackdriver sink to Vector in future?

@binarylogic
Copy link

Hi @robertodauria, no problem. And yes, we're actually working through a GCP milestone now, vectordotdev/vector#572 is the specific issue for stackdriver. We should have a pull request up next week if that works for your time line.

@robertodauria
Copy link
Contributor Author

@binarylogic That sounds great, and I'd be happy to give Vector a try. It's not too urgent for us right now as the setup we have (with fluentd) works - although it's a bit too memory-hungry - so it might take a while before we get to test it.

@edsiper
Copy link

edsiper commented Dec 19, 2019

Hi @robertodauria

this is Eduardo, core maintainer of Fluent Bit, looking around for topics around Fluent Bit I get into this ticket.

Since you are planning to migrate to Fluent Bit and you are finding some issues I would like to point out some comments:

1 fluent/fluent-bit#1777 : this is mostly a setup issue, our primary monitoring interface for files modifications is inotify(7), the good thing is that is highly performant but the downside is that it requires an extra file descriptor per monitored file. The workaround at the moment is to increase the Kernel limit for watched files here:

/proc/sys/fs/inotify/max_user_watches
/proc/sys/fs/inotify/max_user_instances

for environments where this modification is not an option, we will offer the backend based on stat(2), the good thing is that it doesn't require an extra file descriptor as inotify(7), the downside is that is more expensive since it involves a more expensive system call (called multiple times).

Both mechanisms already exist in Fluent Bit, the thing that at build time you have to choose one or the other, the improvement will be to let the user decide which one will use.

It the workaround above is not suitable, let me know since we are prioritizing this anyways.

2 fluent/fluent-bit#1768 : setup issue. On high load environments, likely the ability to deliver records or events to the destination databases or cloud providers is slower than the rate of data ingestion, for hence your system faces backpressure. This is a common issue that is easily solved configuring the input plugins with a memory limit, you can read more about this here:

note: when your input is based on a network service line tcp, syslog, mqtt or other, this memory limit + backpressure might lead to discarding incoming records to survive, the workaround is to enable file system buffering mechanism so you don't lose data and you can continue processing and delivering records:

3 fluent/fluent-bit#1755 : bug already fixed in the previous version Fluent Bit v1.3.4.

Now we are at Fluent Bit v1.3.5, I would encourage you to give it a try, if you face any issue let me know, we can follow up on our Github repo or through a call, we do that with most of the users.

Fluent Bit is deployed a few million of times every month and several companies contribute to it; if you have any question about its adoption and enterprise-grade usage I am happy to discuss about it :)

best,

@nkinkade
Copy link
Contributor

nkinkade commented Jul 9, 2020

@robertodauria: I think it may be time to revisit this issue. Fluent Bit is now at v1.4.5, and Vector from timber.io now seems to support exporting to Stackdriver. Either one could, at this point, be a viable option for us. What do you think?

@Hoverbear
Copy link

Hoverbear commented Jul 9, 2020

Shiny new feature! Yup, check out our stackdriver docs: https://vector.dev/docs/reference/sinks/gcp_stackdriver_logs/

Here's a sample of all the knobs:

[sinks.my_sink_id]
  # General
  type = "gcp_stackdriver_logs" # required
  inputs = ["my-source-id"] # required
  billing_account_id = "012345-6789AB-CDEF01" # optional, no default
  credentials_path = "/path/to/credentials.json" # optional, no default
  folder_id = "My Folder" # optional, no default
  healthcheck = true # optional, default
  log_id = "vector-logs" # required
  organization_id = "622418129737" # optional, no default
  project_id = "vector-123456" # required

  # Batch
  batch.max_size = 5242880 # optional, default, bytes
  batch.timeout_secs = 1 # optional, default, seconds

  # Buffer
  buffer.type = "memory" # optional, default
  buffer.max_events = 500 # optional, default, events, relevant when type = "memory"
  buffer.max_size = 104900000 # required, bytes, required when type = "disk"
  buffer.when_full = "block" # optional, default

  # Encoding
  encoding.except_fields = ["timestamp", "message", "host"] # optional, no default
  encoding.only_fields = ["timestamp", "message", "host"] # optional, no default
  encoding.timestamp_format = "rfc3339" # optional, default

  # Request
  request.in_flight_limit = 5 # optional, default, requests
  request.rate_limit_duration_secs = 1 # optional, default, seconds
  request.rate_limit_num = 1000 # optional, default
  request.retry_attempts = -1 # optional, default
  request.retry_initial_backoff_secs = 1 # optional, default, seconds
  request.retry_max_duration_secs = 10 # optional, default, seconds
  request.timeout_secs = 60 # optional, default, seconds

  # Resource
  resource.type = "global" # required
  resource.projectId = "vector-123456" # example
  resource.zone = "Twilight" # example

  # TLS
  tls.ca_path = "/path/to/certificate_authority.crt" # optional, no default
  tls.crt_path = "/path/to/host_certificate.crt" # optional, no default
  tls.key_pass = "${KEY_PASS_ENV_VAR}" # optional, no default
  tls.key_path = "/path/to/host_certificate.key" # optional, no default
  tls.verify_certificate = true # optional, default
  tls.verify_hostname = true # optional, default

You may also be interested in our upcoming Kubernetes Integration!

Let me know if I can help anyone with a test setup. :) We'd be happy to set up a chat etc.

@nkinkade
Copy link
Contributor

We replaced Fluend with Vector. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants