Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution issues still occur in 2.21.0 #253

Closed
ssplatt opened this issue Oct 6, 2021 · 22 comments
Closed

DNS resolution issues still occur in 2.21.0 #253

ssplatt opened this issue Oct 6, 2021 · 22 comments

Comments

@ssplatt
Copy link

ssplatt commented Oct 6, 2021

Copy of issue from upstream fluent/fluent-bit#4157

Fluent-bit 1.8.7, AWS for fluent 2.20.0. Running in AWS Fargate using Firelens.

Datadog output does deliver some logs but drops many.

Many messages per minute like [2021/10/04 13:56:07] [ warn] [engine] failed to flush chunk '1-1633355763.278469126.flb', retry in 10 seconds: task_id=0, input=forward.1 > output=datadog.1 (out_id=1) followed by [2021/10/04 13:56:17] [ warn] [engine] chunk '1-1633355763.278469126.flb' cannot be retried: task_id=0, input=forward.1 > output=datadog.1. All of our services running fb 1.8.7/awsffb 2.20.0 are throwing this same error, in multiple environments.

cloudformation example:

...
        LogConfiguration:
            LogDriver: awsfirelens
            Options:
              Name: datadog
              TLS: 'on'
              compress: gzip
              apikey: '${DD_API_KEY}'
              dd_service: !Ref ServiceName
              dd_source: nginx
              Log_Level: warning
...
       - Name: log_router
          Image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.20.0
          Essential: true
          MemoryReservation: 50
          Secrets:
            - Name: DD_API_KEY
              ValueFrom: DdApiKey
          FirelensConfiguration:
            Type: fluentbit
            Options:
              enable-ecs-log-metadata: 'true'
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-region: !Ref AWS::Region
              awslogs-group: !Ref LogGroup
              awslogs-stream-prefix: ecs

I do have a lot of occurrences of [2021/10/04 18:31:01] [ warn] [net] getaddrinfo(host='http-intake.logs.datadoghq.com', err=12): Timeout while contacting DNS servers as well.

@ssplatt
Copy link
Author

ssplatt commented Oct 6, 2021

CFN has been updated to use the LogConfiguration SecretOptions but the issue persists.

@PettitWesley PettitWesley changed the title Datadog output failing to flush chunks DNS resolution issues still occur in 2.20.0 Oct 6, 2021
@PettitWesley
Copy link
Contributor

Saw a report of DNS issues with 2.20.0 and ES and CW outputs: #235 (comment)

@ssplatt
Copy link
Author

ssplatt commented Oct 6, 2021

memory doesn't appear to be the issue based on the graphs in datadog showing ecs.fargate.mem.usage. I'm using the default suggested 50 as the memoryreservation and I'm seeing those containers staying around 24mb used. I'm going to test increasing that value on a few tasks.

@ssplatt
Copy link
Author

ssplatt commented Oct 7, 2021

i pushed a change to all my services to increase their memory but I haven't seen any change in memory use or in the amount of errors produced. So I don't think a memory limit or OOM killer is the issue.

@bcwilsondotcom
Copy link

Yeah the OOM issues are a symptom not not being able to flush logs, not the cause.

@ssplatt
Copy link
Author

ssplatt commented Oct 14, 2021

deployed 2.21.0 to one of my services and this issue is still present. the s3 memory leak fix is a plus though, since I have many other services which are utilizing that and have had issue with s3 itmeouts.

2021-10-14 14:30:23[2021/10/14 18:30:23] [ warn] [engine] failed to flush chunk '1-1634236220.660823627.flb', retry in 6 seconds: task_id=0, input=forward.1 > output=datadog.2 (out_id=2)
2021-10-14 14:30:12[2021/10/14 18:30:12] [ warn] [engine] chunk '1-1634236202.657640279.flb' cannot be retried: task_id=1, input=forward.1 > output=datadog.2
2021-10-14 14:30:03[2021/10/14 18:30:03] [ warn] [engine] failed to flush chunk '1-1634236202.657640279.flb', retry in 9 seconds: task_id=1, input=forward.1 > output=datadog.2 (out_id=2)
2021-10-14 14:29:58[2021/10/14 18:29:58] [ warn] [engine] failed to flush chunk '1-1634236194.490145304.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=datadog.1 (out_id=1)
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [sp] stream processor started
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [engine] started (pid=1)
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [storage] version=1.1.4, initializing...
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [storage] in-memory
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [cmetrics] version=0.2.2
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
2021-10-14 14:29:54[2021/10/14 18:29:54] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
2021-10-14 14:29:54�[1mFluent Bit v1.8.8�[0m
2021-10-14 14:29:54* �[1m�[93mCopyright (C) 2019-2021 The Fluent Bit Authors�[0m
2021-10-14 14:29:54* �[1m�[93mCopyright (C) 2015-2018 Treasure Data�[0m
2021-10-14 14:29:54* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
2021-10-14 14:29:54* https://fluentbit.io

@ssplatt ssplatt changed the title DNS resolution issues still occur in 2.20.0 DNS resolution issues still occur in 2.21.0 Oct 18, 2021
@zhonghui12
Copy link
Contributor

Hi @ssplatt, I tried to reproduce the DNS issue on my side but failed. It seems my task goes well. I ran it several times and I saw this error log once but then it was retried and no more warning:
[2021/10/20 23:16:39] [ warn] [engine] failed to flush chunk '1-1634771796.18542375.flb', retry in 6 seconds: task_id=0, input=forward.1 > output=datadog.1 (out_id=1)

To be specific, here is my task definition:

{
    "family": "my-test-dns-issue",
    "taskRoleArn": "my-task-role-arn",
    "executionRoleArn": "my-execution-role-arn",
    "networkMode": "awsvpc",
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "memory": "512",
    "cpu": "256",
    "containerDefinitions": [
        {
            "essential": true,
            "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:2.21.0",
            "name": "log_router",
            "firelensConfiguration": {
                "type": "fluentbit",
                "options":{
                    "enable-ecs-log-metadata":"true"
                }
            },
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "dns-issue-test",
                    "awslogs-region": "us-east-1",
                    "awslogs-create-group": "true",
                    "awslogs-stream-prefix": "firelens"
                }
            },
            "memoryReservation": 50
         },
         {
            "essential": true,
            "image": "nginx",
            "name": "app",
            "logConfiguration": {
                "logDriver":"awsfirelens",
                "options": {
                   "Name": "datadog",
                   "Host": "http-intake.logs.datadoghq.com",
                   "TLS": "on",
                   "dd_service": "my-test-service-dns-issue",
                   "dd_source": "nginx",
                   "dd_tags": "project:example",
                   "provider": "ecs",
                   "apikey":"my-api-key",
                   "Log_Level":"warning",
                   "compress":"gzip"
               }
           }
        }
    ]
}

I am not sure if the issue is related to the env/config setup. So could you please try run in a different/clean env to see if the DNS issue still exists? I am not sure if there is a real bug in the upstream code about it. And there is one more option you could try with DNS to resolve the issue: "net.dns.mode":"UDP"or "net.dns.mode":"TCP".

@ssplatt
Copy link
Author

ssplatt commented Oct 21, 2021

Would running in us-east-1 be part of the problem? I have this error in 8 different applications, deployed across 3 different VPCs (environments), so 24 deployments in total. I guess if it's not a datadog issue, or fluent bit issue, then it could be a VPC dns issue? But I also have fluentd deployed which we were hoping to phase out which does not have this problem.

@zhonghui12
Copy link
Contributor

zhonghui12 commented Oct 21, 2021

Would running in us-east-1 be part of the problem?

Do you mean the region for my CW group? I also tried us-west-2 which also works. And yeah, it might be the VPC dns issue.

@PettitWesley
Copy link
Contributor

@ssplatt My guess is that its an issue with your VPC DNS yes.

To confirm- you tried both net.dns.mode TCP and UDP right?

But I also have fluentd deployed which we were hoping to phase out which does not have this problem.

Interesting. This does sort of indicate its a problem specific to some interaction between your DNS settings and Fluent Bit. Can you also curl/dig the Datadog endpoint in your VPC?

Does your company have a support relationship with AWS? If you do, we could possibly try to use that to debug this more closely with you. On GitHub we are more limited to just trying to repro with your config in our envs or give you suggestions of things to try.

@PettitWesley
Copy link
Contributor

Actually, the verbose output of dig on the datadog endpoint in your VPC might possibly be interesting. We could compare it with ours.

Unfortunately I am not an expert on these DNS things 😐

@ssplatt
Copy link
Author

ssplatt commented Oct 22, 2021

I haven't tried the tcp or udp setting yet, Ill try that tomorrow. We do have a support relationship with aws so I'll open a ticket to see if they can help track down a dns issue in our VPCs. And we are also running many datadog agent containers. Actually each one of these applications has both a datadog agent and a fluent but side car. The datadog agent handles metrics and apm. I don't believe we are having any DNS issues with the datadog agents.

@ssplatt
Copy link
Author

ssplatt commented Oct 22, 2021

Would running in us-east-1 be part of the problem?

Do you mean the region for my CW group? I also tried us-west-2 which also works. And yeah, it might be the VPC dns issue.

I missed that your test was in us-east-1. I meant that all of my deployments are in us-east-1.

@ssplatt
Copy link
Author

ssplatt commented Oct 22, 2021

I have only been testing on one task for a few minutes but TCP seems to be working. UDP produced the error very quickly. Is the default net.dns.mode to use UDP? I don't see it documented anywhere.

I have a ticket open with AWS to figure out if there is an issue with DNS in my VPC setup.

@ssplatt
Copy link
Author

ssplatt commented Oct 22, 2021

and the nslookup using the VPC DNS:

➜ nslookup http-intake.logs.datadoghq.com
Server:		10.1.0.2
Address:	10.1.0.2#53

Non-authoritative answer:
http-intake.logs.datadoghq.com	canonical name = alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com.
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.117
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.122
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.7
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.23
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.24
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.35
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.43
Name:	alb-logs-http-shard0-194173476.us-east-1.elb.amazonaws.com
Address: 3.233.146.87

@zhonghui12
Copy link
Contributor

Is the default net.dns.mode to use UDP? I don't see it documented anywhere.

@ssplatt The default value for this option is NULL so you need to manually set it up. For this option, upstream has released the change but it seems that they haven't updated the official doc for it. It is for selecting the primary DNS connection type (TCP or UDP).

@ssplatt
Copy link
Author

ssplatt commented Oct 22, 2021

And nslookup in the vpc works, which uses udp. Why would fluent bit using udp fail?

@PettitWesley
Copy link
Contributor

@zhonghui12 @ssplatt looking at the code in flb_network.c in flb_net_getaddrinfo() it looks like NULL eventually gets translated to UDP. So if TCP is working for you, but not UDP, this would explain why you have to explicitly set it to TCP.

With it set to TCP- it seems to work completely fine?

@ssplatt
Copy link
Author

ssplatt commented Oct 25, 2021

@PettitWesley I spun up a test EC2 instance and installed docker on it then created a testing container based off AWSFFB with a custom configuration that is essentially just a "input forward" and a "output datadog". I haven't seen any DNS errors, even with UDP, but I have seen "failed to flush chunk". Interestingly, I'm sending logs in a loop with an order number in the message and I'm seeing them all come through in order in DD with no loss. Once I added in the "Retry_Limit False" setting I see no more messages about even retrying logs. Setting the Retry_Limit to another integer, like 10, seemed to reduce he number of "failed to flush chunk" messages as well for some reason.

I also see a few logs showing that dns timeouts for s3.amazonaws and our internal fluentd server, but only 1 or 2 over the past week.

So I think these may just be transient errors exacerbated by the shear number of logs we are sending, the fragility of UDP, and then the default retry limit of 2. I'll go ahead and close this and reopen if I run into more frequent errors or complete loss of logging.

@ssplatt ssplatt closed this as completed Oct 25, 2021
@ssplatt
Copy link
Author

ssplatt commented Oct 29, 2021

FWIW, we're still seeing DNS resolution errors a few times per hour with net.dns.mode TCP set but the number of "cannot deliver chunk" messages has dropped to 0, I believe this is due to the "Retry Limit" being raised above the default of 2, so I don't think these warnings are resulting in dropped logs.

@matthewfala
Copy link
Contributor

It might be a good idea to have an automatic dns retry in Fluent Bit core if more people experience this problem. This could be implemented with a for loop around the ares_getaddrinfo() and timer block here https://github.com/fluent/fluent-bit/blob/231ef4b38f312eb981583d8f1c2441f7313f128d/src/flb_network.c#L941 The condition could be result_code == ARES_ETIMEOUT From here Corresponding to your error message here

@JohnPreston
Copy link

Giving a 2022 update: still seeing that error coming out, with the destination being firehose. However, data does make it to firehose so unsure what's happening there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants