Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS query cancelled error on v1.8.3 when sending to http-intake.logs.datadoghq.eu (works in v1.8.2) #3944

Closed
ThomasHenckel opened this issue Aug 12, 2021 · 16 comments

Comments

@ThomasHenckel
Copy link

Bug Report

Describe the bug
After upgrading to v1.8.3 td-agent-bit does not work when sending logs to http-intake.logs.datadoghq.eu

To Reproduce

  • Example log message if applicable:
[2021/08/12 12:08:16] [ info] [engine] started (pid=662)
[2021/08/12 12:08:16] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/08/12 12:08:16] [debug] [storage] [cio stream] new stream registered: tail.0
[2021/08/12 12:08:16] [ info] [storage] version=1.1.1, initializing...
[2021/08/12 12:08:16] [ info] [storage] in-memory
[2021/08/12 12:08:16] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/08/12 12:08:16] [ info] [cmetrics] version=0.1.6
[2021/08/12 12:08:16] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2021/08/12 12:08:16] [debug] [input:tail:tail.0] inotify watch fd=24
[2021/08/12 12:08:16] [debug] [input:tail:tail.0] scanning path /f/src/json_logs/*.log
[2021/08/12 12:08:16] [debug] [input:tail:tail.0] inode=3280056 with offset=3050 appended as /f/src/json_logs/ml-training.log
[2021/08/12 12:08:16] [debug] [input:tail:tail.0] scan_glob add(): /fx.log, inode 3280056
[2021/08/12 12:08:16] [debug] [input:tail:tail.0] 1 new files found on path ‘/f/*.log'
[2021/08/12 12:08:16] [debug] [datadog:datadog.0] created event channels: read=26 write=27
[2021/08/12 12:08:18] [debug] [output:datadog:datadog.0] scheme: https://
[2021/08/12 12:08:18] [debug] [output:datadog:datadog.0] api_key: xxxxxx
[2021/08/12 12:08:18] [debug] [output:datadog:datadog.0] uri: /v1/input/xxxxx
[2021/08/12 12:08:18] [debug] [output:datadog:datadog.0] host: http-intake.logs.datadoghq.eu
[2021/08/12 12:08:18] [debug] [output:datadog:datadog.0] port: 443
[2021/08/12 12:08:18] [debug] [output:datadog:datadog.0] json_date_key: timestamp
[2021/08/12 12:08:18] [debug] [output:datadog:datadog.0] compress_gzip: 1
[2021/08/12 12:08:18] [debug] [router] match rule tail.0:datadog.0
[2021/08/12 12:08:18] [ info] [sp] stream processor started
[2021/08/12 12:08:18] [debug] [input:tail:tail.0] inode=3280056 file=/f/x.log promote to TAIL_EVENT
[2021/08/12 12:08:18] [ info] [input:tail:tail.0] inotify_fs_add(): inode=3280056 watch_fd=1 name=/f/x.log
[2021/08/12 12:08:53] [debug] [input:tail:tail.0] inode=3280056 events: IN_MODIFY 
[2021/08/12 12:08:57] [debug] [task] created task=0x7f9de9637a00 id=0 OK
[2021/08/12 12:08:57] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.eu', err=24): DNS query cancelled
[2021/08/12 12:08:57] [debug] [upstream] connection #-1 failed to http-intake.logs.datadoghq.eu:443
[2021/08/12 12:08:57] [debug] [out coro] cb_destroy coro_id=0
[2021/08/12 12:08:57] [debug] [retry] new retry created for task_id=0 attempts=1
[2021/08/12 12:08:57] [ warn] [engine] failed to flush chunk '662xxxx.flb', retry in 10 seconds: task_id=0, input=tail.0 > output=datadog.0 (out_id=0)
[2021/08/12 12:09:07] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.eu', err=24): DNS query cancelled
[2021/08/12 12:09:07] [debug] [upstream] connection #-1 failed to http-intake.logs.datadoghq.eu:443
[2021/08/12 12:09:07] [debug] [out coro] cb_destroy coro_id=1
[2021/08/12 12:09:07] [debug] [task] task_id=0 reached retry-attempts limit 1/1
[2021/08/12 12:09:07] [ warn] [engine] chunk ‘xxx.flb' cannot be retried: task_id=0, input=tail.0 > output=datadog.0
[2021/08/12 12:09:07] [debug] [task] destroy task=0x7f9de9637a00 (task_id=0)
  • Steps to reproduce the problem:
    RUN apt-get -y install td-agent-bit
    /opt/td-agent-bit/bin/td-agent-bit -c /etc/td-agent-bit/td-agent-bit.conf --log_file=/tmp/fluentbit.log

Expected behavior
The logs in the logfile should be send to datadog

Your Environment

  • Version used: 1.8.3
  • Environment name and version : Docker version 19.03.13-ce
  • Server type and version: Docker on AWS Linux image
  • Filters and plugins: none

Additional context
This error appeared after i updated to version 1.8.3, and if i go back to 1.8.2 it works again
Might only be related to datadog eu

@agup006
Copy link
Member

agup006 commented Aug 16, 2021

Hey @ThomasHenckel thanks for filing

Might only be related to datadog eu

What would you suggest would be the best way to rule that out? Is it possible to use a static IP to see DNS difference between 1.8.3 / 1.8.2 could be causing issue?

@ThomasHenckel
Copy link
Author

Hi @agup006

I just tried using the ip address as the host name, and it looks like logs are getting through

I have talked with other devs in my company and looks like they don't have the problem, i'm the only using ubuntu for my docker images.

i install using:
RUN wget -qO - https://packages.fluentbit.io/fluentbit.key | apt-key add -
RUN echo "deb https://packages.fluentbit.io/ubuntu/bionic bionic main" >> /etc/apt/sources.list
RUN apt-get update
RUN apt-get -y install td-agent-bit

Hope it helps
Best Regards

@carsten-langer
Copy link

We also have the same problem which started the minute we upgraded from 1.8.2 to 1.8.3, i.e. we have the very close time correlation (few minutes) between the version change and the beginning of this issue.
We use Ubuntu 18.04 LTS.

@maiconbaumx
Copy link

Same problem here. =]

@bjtucker
Copy link

Same problem here. centos 7, connecting to graylog.

@bjtucker
Copy link

In our case, dns resolution from the command line to the hostname that was failing dns lookups was working just fine. As a diagnostic and stopgap fix, we tried putting an entry in the /etc/hosts file directing the hostname that fails lookup to its proper ip, and that worked.

With the host file entry in place, td-agent-bit starts up fine and does its job.

So... what is going wrong with getaddrinfo doing its dns lookups. Is something getting negatively cached somewhere it shouldn't? What's changed since v1.8.2?

@bjtucker
Copy link

I'm not too familiar with this codebase, so forgive me if this is the wrong direction, but I did a quick search for getaddrinfo
https://github.com/fluent/fluent-bit/search?q=getaddrinfo

and found this file, which appears to have been touched recently.
https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c

I do see several recent changes in the file that look like they're related to name resolution, but I'm not sure which ones fall between release 1.8.2 and 1.8.3

Could these be related?

@blc2
Copy link

blc2 commented Aug 19, 2021

Also seeing this on centos 7, we're using the official yum repo which only seems to have the latest package so unable to rollback to 1.8.2 easily.

@PettitWesley
Copy link
Contributor

1.8.5 added a new DNS network setting: https://github.com/fluent/fluent-bit/blob/master/src/flb_upstream.c#L38

So in your output you can set:

net.dns.mode UDP

The other valid value is TCP. Does trying that affect this issue?

@carsten-langer
Copy link

I tried with version 1.8.6 and

net.dns.mode UDP

but got the same errors than before

@PettitWesley
Copy link
Contributor

@edsiper there are multiple reports from AWS and non-AWS users that DNS resolution is still broken in some cases. IMO, this is a critical issue that deserves to be a top priority.

@farvour
Copy link

farvour commented Sep 18, 2021

I'm not sure why this stuff is played with to be honest. Not even unique to fluent-bit, everyone seems to like to toy with a working DNS solution and then it completely wrecks the product. I've been having these issues with docker images all the way from 1.8.3 -> 1.8.6. I was previously using v1.7.3 without any problems. I'm running in a fresh EKS cluster and coredns is fine. It's fluent-bit and toying with the settings around DNS resolution.

I agree @PettitWesley, critical stuff like this turns me off of a product's usage. There are plenty of reports all around about 1.8.3+ has basically screwed up DNS in fluent-bit. My favorite part: how the app segfaults after it can't find a valid host. Like really... what?

@edsiper
Copy link
Member

edsiper commented Sep 18, 2021

FYI: it's not a DNS issue, DNS was affected due to premature closure of the TCP connection by the upstream handler and it looked like a DNS issue, but is not. A fix is being shipped on today's release.

@farvour
Copy link

farvour commented Sep 18, 2021

@edsiper thanks for the input. Any idea how long this particular issue has been around? It must have been introduced in a version >1.7.3...
I'd like to know what version of fluent-bit to rollback to, basically, since this affects a critical production workload. If I can safely go back to 1.7.3 I will just do that for now.

@PettitWesley
Copy link
Contributor

@farvour my testing suggested it works in 1.8.4 or lower: #4050 (comment)

@PettitWesley
Copy link
Contributor

@farvour/everyone this has been fixed in 1.8.7: https://fluentbit.io/announcements/v1.8.7/

I will close this issue. Please re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants