DNS latency of 5s when uses iptables forward in pods network traffic #62628

xiaoxubeii · 2018-04-16T08:28:27Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
The DNS will get a 5s latency of AAAA when uses iptables forward in network traffic between pods.

What you expected to happen:
No latency.

How to reproduce it (as minimally and precisely as possible):

CNI configuration

        "name": "mynet",
        "type": "macvlan",
	"master": "eth0",
        "ipam": {
                "type": "host-local",
                "subnet": "172.20.0.0/17",
		"rangeStart": "172.20.64.129",
		"rangeEnd": "172.20.64.254",
		"gateway": "172.20.127.254",
		"routes": [
			{"dst":"0.0.0.0/0"},
			{"dst":"172.20.80.0/24", "gw":"172.20.0.62"}
		]
        }
}

Network Architecture
The cluster cidr is 172.20.80.0/24, gw is current node. Cluster, pods and nodes are in l2 network using VXLAN.

Anything else we need to know?:
If cni gw of cluster cidr is current node, the network traffic between pods and services will use iptables forward:

-P FORWARD ACCEPT
-A FORWARD -m comment --comment "kubernetes forward rules" -j KUBE-FORWARD

-N KUBE-FORWARD
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -s 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

If enable forwarding conntrack, netfilter will drop first AAAA record packet when requests dns. It will cause dns latency of 5s.

Environment:

Kubernetes version (use kubectl version): v1.9.2
Cloud provider or hardware configuration: None
OS (e.g. from /etc/os-release): CentOS Linux release 7.2.1511 (Core)
Kernel (e.g. uname -a): 3.10.0-327.18.2.el7.x86_64
Install tools: kubeadm
Others:

The text was updated successfully, but these errors were encountered:

xiaoxubeii · 2018-04-16T08:33:39Z

/sig network
/assign

MrHohn · 2018-04-30T22:42:44Z

If enable forwarding conntrack, netfilter will drop last AAAA record packet when requests dns

@xiaoxubeii Could you elaborate a bit more on this behavior? Is this a bug in netfilter? Or is this a bug in kube-proxy that it doesn't follow certain standard while using netfilter? Thanks.

cc @bowei

Quentin-M · 2018-04-30T22:49:54Z

I am by the way experiencing the same thing. With Kubernetes 1.10+CoreOS+Weave+CoreDNS/kube-dns+kube-proxy ipvs, I see constant 5s latency on DNS resolution. tcpdump shows that the first AAAA requests get lost somehow: https://hastebin.com/banulayire.swift. With single-request or single-request-reopen, the issue is gone.

Quentin-M · 2018-04-30T23:16:22Z

Actually, relevant reads:

bboreham · 2018-05-01T10:07:06Z

Most of the comments relate to things which will cause intermittent packet loss, but OP seems to be talking about a consistent symptom - every time you do the request it will drop the same packet.

Am I understanding the OP correctly?

I can’t imagine what would cause it to drop the last packet. How would it know it’s the last one?

Quentin-M · 2018-05-01T23:14:40Z

@bboreham The blog post I linked above explains the issue very well. It's a race condition with conntrack/SNAT. glibc/musl are very good at triggering it when sending A/AAAA lookups in parallel. Using single-request-reopen works around the issue by making serializing the queries. One better fix (as documented) is to add --random-fully to every MASQUERADE rules (kubelet, kube-proxy, overlay).

xiaoxubeii · 2018-05-02T10:16:53Z

@MrHohn @bboreham @Quentin-M I think the problem that i met is about race condition on
conntrack insertions. I used the node as the gateway, so it would redirects packets from pods to services. When iptables sets:

-A KUBE-FORWARD -s 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

and we send DNS requests in pods, the second DNS requests would be received while the first one is still not confirmed and both DNS requests would have an unconfirmed conntrack. So the second one would be dropped in nf_conntrack_confirm, which results in an DNS timeout and retransmit.

So a simple solution is to use single-request-reopen:

single-request-reopen (since glibc 2.9)
                     Sets RES_SNGLKUPREOP in _res.options.  The resolver
                     uses the same socket for the A and AAAA requests.  Some
                     hardware mistakenly sends back only one reply.  When
                     that happens the client system will sit and wait for
                     the second reply.  Turning this option on changes this
                     behavior so that if two requests from the same port are
                     not handled correctly it will close the socket and open
                     a new one before sending the second request.

It is more like a netfilter bug or flaw, but i think the kube-dns also need do something to avoid it.

Quentin-M · 2018-05-02T17:37:15Z

Yes, this is what's described there. AFAIK, real solution would be to patch kubelet, kube-proxy and the overlay networks (flannel, weave, calico, etc), adding --random-fully to the MASQUERADE rules.

But I agree that your patch, while slowing down DNS lookups a little (well it's not as bad as 5secs+!), is simple and effective. People in other threads have also been mentioning deployment initializers, using dnsPolicy=None in Kubernetes 1.10 or manually mounting /etc/resolv.conf - but I'd rather not force cluster users' to apply such workaround, whereas it's an actual infrastructure issue.

Or it's a totally different issue?

bboreham · 2018-05-02T21:32:20Z

It’s great you are all agreed what is the cause, but still: a race condition would randomly cause or not cause the problem.

So, was it consistent - happened every time, or occasional - sometimes happened?

Quentin-M · 2018-05-02T21:59:32Z

Fair point. I wonder if something like this may be happening instead? I am not that familiar with networking to be able to tell.

xiaoxubeii · 2018-05-03T02:36:42Z

@Quentin-M I am not sure which one causes this problem, conntrack in KUBE-FORWARD or MASQUERADE rules, because in my case, they all exist. And when i remove conntrack from KUBE-FORWARD, the problem is gone (or only alleviated).

@bboreham In my case, it is consistent. It always drop AAAA packet.

bboreham · 2018-05-08T09:50:17Z

The inestimable @brb has found another race condition, which will tend to cause the last packet to be dropped. weaveworks/weave#3287 (comment)

Quentin-M · 2018-05-15T23:05:12Z

I would just like to add here that the single-request(-reopen) workaround does not work with Alpine-based containers, as musl does not support the option (see below). Unfortunately, Alpine Linux is the base image of 90% of our infrastructure.

src/network/resolvconf.c

                if (!strncmp(line, "options", 7) && isspace(line[7])) {
                        p = strstr(line, "ndots:");
                        if (p && isdigit(p[6])) {
                                p += 6;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->ndots = x > 15 ? 15 : x;
                        }
                        p = strstr(line, "attempts:");
                        if (p && isdigit(p[9])) {
                                p += 9;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->attempts = x > 10 ? 10 : x;
                        }
                        p = strstr(line, "timeout:");
                        if (p && (isdigit(p[8]) || p[8]=='.')) {
                                p += 8;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->timeout = x > 60 ? 60 : x;
                        }
                        continue;
                }

src/network/lookup.h

struct resolvconf {
        struct address ns[MAXNS];
        unsigned nns, attempts, ndots;
        unsigned timeout;
};

I have reached out on the freenode's #musl channel, but unfortunately it does not seem like there is much desire to add support for the option:

[16:19] <dalias> why not fix the bug causing it?
[16:20] <dalias> sprry
[16:20] <dalias> the option is not something that can be added, its contrary to the lookup architecture
[17:39] <dalias> quentinm, thanks for the report. i just don't know any good way to work around it on our side without nasty hacks
[17:40] <dalias> the architecture is not designed to support sequential queries

xiaoxubeii · 2018-05-23T10:31:34Z

/close

Quentin-M · 2018-06-24T23:06:17Z

I just posted a little write-up about our journey troubleshooting the issue, and how we are worked around it in production: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/.

szuecs · 2018-07-09T14:07:33Z

@xiaoxubeii @Quentin-M what is the current favorite workaround for this?

We run flannel with vxlan and have here and there bleeps in out monitoring spike dns request up to 5s. One time probably (not 100% sure) we had a production incident, because of that.

Should I port the script referenced by @Quentin-M to flannel or is there already something else?

szuecs · 2018-07-10T13:04:26Z

FYI: flannel-io/flannel#1001 (comment)
And the "port to flannel" https://github.com/szuecs/flannel-tc

…hat the kernel race condition happens less likely. Background see https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/ kubernetes/kubernetes#62628 flannel-io/flannel#1001

bboreham · 2018-08-02T08:25:42Z

@xiaoxubeii why is this issue closed? @brb has fixed one of the kernel races but other causes remain.

inter169 · 2018-08-03T04:10:30Z

coded a fix for musl on Alpine Linux 3.7, which removed AAAA query by default (AF_UNSPEC).

see #56903 (comment)

thanks,
harper

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Apr 16, 2018

k8s-ci-robot assigned xiaoxubeii Apr 16, 2018

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2018

xiaoxubeii mentioned this issue Apr 18, 2018

Fix DNS latency of 5s when use iptables forward #62764

Closed

Quentin-M mentioned this issue Apr 30, 2018

DNS intermittent delays of 5s #56903

Closed

Quentin-M mentioned this issue Apr 30, 2018

How to use k8s+autopath? coredns/coredns#1752

Closed

Quentin-M mentioned this issue May 2, 2018

Support configurable pod resolv.conf kubernetes/enhancements#504

Closed

k8s-ci-robot closed this as completed May 23, 2018

steven-sheehy mentioned this issue Jun 27, 2018

Configure DNS options at cluster level #59031

Closed

szuecs mentioned this issue Jul 10, 2018

fix kernel race condition in DNAT rules for DNS zalando-incubator/kubernetes-on-aws#1228

Merged

discordianfish mentioned this issue Sep 14, 2018

iptables fails to NAT udp responses moby/moby#11998

Open

jar3b mentioned this issue Sep 18, 2018

dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider cert-manager/cert-manager#896

Closed

devteng mentioned this issue Sep 26, 2018

DNS lookup timeouts Azure/AKS#667

Closed

zigmund mentioned this issue Apr 26, 2019

Slow connect (TCP retransmits) via NodePort #76699

Closed

mjsabby mentioned this issue Aug 15, 2020

System.Net.Dns does not always resolve "" and the system's own hostname on Unix dotnet/runtime#36849

Open

pierDipi mentioned this issue Jan 14, 2021

Fix TimeoutException and DnsNameResolverTimeoutException knative-extensions/eventing-kafka-broker#539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS latency of 5s when uses iptables forward in pods network traffic #62628

DNS latency of 5s when uses iptables forward in pods network traffic #62628

xiaoxubeii commented Apr 16, 2018 •

edited

xiaoxubeii commented Apr 16, 2018 •

edited

MrHohn commented Apr 30, 2018

Quentin-M commented Apr 30, 2018 •

edited

Quentin-M commented Apr 30, 2018 •

edited

bboreham commented May 1, 2018

Quentin-M commented May 1, 2018

xiaoxubeii commented May 2, 2018

Quentin-M commented May 2, 2018 •

edited

bboreham commented May 2, 2018 •

edited

Quentin-M commented May 2, 2018

xiaoxubeii commented May 3, 2018

bboreham commented May 8, 2018

Quentin-M commented May 15, 2018 •

edited

xiaoxubeii commented May 23, 2018

Quentin-M commented Jun 24, 2018

szuecs commented Jul 9, 2018

szuecs commented Jul 10, 2018

bboreham commented Aug 2, 2018

inter169 commented Aug 3, 2018 •

edited

DNS latency of 5s when uses iptables forward in pods network traffic #62628

DNS latency of 5s when uses iptables forward in pods network traffic #62628

Comments

xiaoxubeii commented Apr 16, 2018 • edited

xiaoxubeii commented Apr 16, 2018 • edited

MrHohn commented Apr 30, 2018

Quentin-M commented Apr 30, 2018 • edited

Quentin-M commented Apr 30, 2018 • edited

bboreham commented May 1, 2018

Quentin-M commented May 1, 2018

xiaoxubeii commented May 2, 2018

Quentin-M commented May 2, 2018 • edited

bboreham commented May 2, 2018 • edited

Quentin-M commented May 2, 2018

xiaoxubeii commented May 3, 2018

bboreham commented May 8, 2018

Quentin-M commented May 15, 2018 • edited

xiaoxubeii commented May 23, 2018

Quentin-M commented Jun 24, 2018

szuecs commented Jul 9, 2018

szuecs commented Jul 10, 2018

bboreham commented Aug 2, 2018

inter169 commented Aug 3, 2018 • edited

xiaoxubeii commented Apr 16, 2018 •

edited

xiaoxubeii commented Apr 16, 2018 •

edited

Quentin-M commented Apr 30, 2018 •

edited

Quentin-M commented Apr 30, 2018 •

edited

Quentin-M commented May 2, 2018 •

edited

bboreham commented May 2, 2018 •

edited

Quentin-M commented May 15, 2018 •

edited

inter169 commented Aug 3, 2018 •

edited