DNS lookup timeouts due to races in conntrack #3287

dcowden · 2018-04-26T21:24:25Z

What happened?

We are experiencing random 5 second DNS timeouts in our kubernetes cluster.

How to reproduce it?

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Anything else we need to know?

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

bboreham · 2018-04-26T22:21:10Z

Whoa! Good job for finding that.

However:

The iptables tool doesn't support setting this flag

this might be an issue.

dcowden · 2018-04-27T00:13:35Z

@bboreham my kernel networking Fu is weak, so I'm not even able to suggest any work arounds. I'm hoping others here have stronger Fu... Challenge proposed!

naysayers frequently make scary, handwavey stability arguments against container stacks. Usually I laugh in the face of danger, but this appears to be the first ever case I've seen in which a little known kernel level gotcha actually does create issues for containers that would otherwise be unlikely to surface

btalbot · 2018-04-27T00:42:06Z

I just spent several hours trouble shooting this problem, ran into the same XING blog post and then this issue report which was opened while I was trouble shooting!

Anyway, I'm seeing the same issues reported in the XING blog. DNS 5 second delays and a lot of insert_failed counts from conntrack using weave 2.3.0.

cpu=0 found=8089 invalid=353025 ignore=1249480 insert=0 insert_failed=8042 drop=8042 early_drop=0 error=0 search_restart=591166

More details can be provided if needed.

dcowden · 2018-04-27T00:49:41Z

@btalbot one workaround you might try is to set this option in resolv.conf:

options single-request-reopen

It is a workaround that will basically make glibc retry the lookup, which will work most of the time.

Another bandaid that helps is to change ndots from 5 (the default) to 3, which will generate far fewer requests to your dns servers ,and lessen the frequency.

The problem is that it's kind of a pain to force changes into resolve.conf. it's done with kubelet --resolve-conf option, but then you have to create the whole file yourself which stinks.

dcowden · 2018-04-27T00:51:28Z

@bboreham it does appear that the patched iptables is available. Can weave use a patched iptables?

bboreham · 2018-04-27T06:45:42Z

The easiest thing is to use an iptables from a released Apline package. From there it gets progressively harder.

(Sorry for closing/reopening - finger slipped)

bboreham · 2018-04-27T06:51:48Z

BTW my top tip to reduce DNS requests is to put a dot at the end when you know the full address. Eg instead of example.com put example.com.. This means it will not go through the search path, reducing lookups by 5x in a typical Kubernetes install.

For an in-cluster address if you know the namespace you can construct the fqdn, e.g. servicename.namespacename.svc.cluster.local.

dcowden · 2018-04-27T10:50:53Z

@bboreham great tip, I didn't know that one! Thanks

dcowden · 2018-04-27T12:18:09Z

I did a little investigation on netfilter.org.
it appears that the iptables patch that adds --random-fully is in iptables v 1.6.2, released on 2/22/2018.

alpine:latest packages v 1.6.1, however alpine:edge packages v 1.6.2

btalbot · 2018-04-27T19:41:40Z

For an in-cluster address if you know the namespace you can construct the fqdn, e.g. servicename.namespacename.svc.cluster.local.

This only works for some apps or resolvers. The bind tools honor that of course since that is a decades old syntax for bind's zone files. But any apps that try to fix an address or use a different resolver that trick doesn't work. Curl is a good example of that not working.

From inside an alpine container curl https://kubernetes/ will hit the api server of course but so does curl https://kubernetes./

dcowden · 2018-04-27T19:45:38Z

in our testing, we have found that only the options single-request-reopen change actually addresses this issue. Its a band-aid-- but dns lookups are fast, so we get aberrations of like 100ms, not 5 seconds,w hich is acceptable for us.

Now we're trying to figure out how to inject that into resolv.conf on all the pods. Anyone know how to do that?

btalbot · 2018-04-27T19:49:12Z

I found this hack in some other related github issues and it's working for me

apiVersion: v1
data:
  resolv.conf: |
    nameserver 1.2.3.4
    search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
    options ndots:3 single-request-reopen
kind: ConfigMap
metadata:
  name: resolvconf

Then in your affected pods and containers

        volumeMounts:
        - name: resolv-conf
          mountPath: /etc/resolv.conf
          subPath: resolv.conf
...

      volumes:
      - name: resolv-conf
        configMap:
          name: resolvconf
          items:
          - key: resolv.conf
            path: resolv.conf

dcowden · 2018-04-27T20:01:43Z

@btalbot thanks for posting that. That would definitely work in a pinch!

we use kops for our cluster, and the this seems promising. But i'm still learning how it works

Quentin-M · 2018-05-01T03:03:49Z

Experiencing the same issue here. 5s delays on every, single, DNS lookup, 100% of the time. Similarly, insert_failed does increase for each DNS query. The AAAA query, that happens a few cycles after the A query, gets dropped systematically (tcpdump: https://hastebin.com/banulayire.swift).

Mounting a resolv.conf by hand in every single pod of our infrastructure is untenable.
kubernetes/kubernetes#62764 attempts at adding the workaround as a default in Kubernetes, but the PR is unlikely to land. And even if it does, it won't be released for a good while.

Here is the flannel patch: https://gist.github.com/maxlaverse/1fb3bfdd2509e317194280f530158c98

dcowden · 2018-05-01T10:48:51Z

@Quentin-M what k8s version are you using? I'm curious why it's 100% repeatable for some but intermittent for others.

Another method to inject resolve.conf change s would be a deployment initializer. I've been trying to avoid creating one, but it's beginning to seem inevitable that in an Enterprise environment you need a way to enforce various things on every launched workload in a central way.

I'm still investigating the use of kubelet --resolve-conf, but what I'm really worried about is that all this is just a bandaid..

The only actual fix is the iptables flag

brb · 2018-05-01T11:06:18Z

Has anyone tried installing and running iptables-1.6.2 from the alpine packages for edge on Alpine 3.7?

dcowden · 2018-05-01T11:19:31Z

@brb i was wondering the same thing. It would be nice to make progress and get a PR ready in anticipation of availability of 1.6.2. My go Fu is too week to take a shot at making the fix, but I'm guessing the fix goes somewhere around expose.go?

If it were possible to create a frankenversion that has this fix, we could test it out.

brb · 2018-05-01T14:41:12Z

Has anyone tried installing and running iptables-1.6.2 from the alpine packages for edge on Alpine 3.7?

Just installed it with apk add iptables --update-cache --repository http://dl-3.alpinelinux.org/alpine/edge/main/. However, I cannot guarantee that we don't miss anything with iptables from edge on 3.7.

the fix goes somewhere around expose.go

Yes, you are right.

If it were possible to create a frankenversion that has this fix, we could test it out.

I've just created the weave-kube image with the fix for amd64 arch only and kernel >= 3.13 (https://github.com/weaveworks/weave/tree/issues/3287-iptables-random-fully). To use it, please change the image name of weave-kube to "brb0/weave-kube:iptables-random-fully" in DaemonSet of Weave.

dcowden · 2018-05-01T15:30:12Z

@brb Score! that's awesome! we'll try this out asap!
We're currently using image weaveworks/weave-kube:2.2.0, via a kops cluster. Would this image interoperate ok with those?

brb · 2018-05-01T15:39:44Z

I can't think of anything which would prevent it from working.

Please let us know whether it works, thanks!

Quentin-M · 2018-05-01T16:07:41Z

@dcowden Kubernetes 1.10.1, Container Linux 1688.5.3-1758.0.0, AWS VPCs, Weave 2.3.0, kube-proxy IPVS. My guess is that it depends how fast/stable your network is?

Quentin-M · 2018-05-01T16:09:54Z

@dcowden

I'm still investigating the use of kubelet --resolve-conf, but what I'm really worried about is that all this is just a bandaid..

I have tried the other day, while it changed the resolv.conf of my static pods, all the other pods (with default dnsPolicy) were still based on what dns.go constructs. Note that the DNS options are written as a constant there. No possibility to get single-request-reopen without running your own compiled version of kubelet.

Quentin-M · 2018-05-01T18:45:40Z

@brb Thanks! I haven't realized yesterday that the patched iptables was already in an Alpine release. My issue is surely still present and both insert_failed and drop are still increasing. I note however that there are two other MASQUERADE rules in place, that do not have --random-fully, so that might be why? I am no network expert by any means unfortunately.

# Setup by WEAVE too.
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

# Setup by both kubelet and kube-proxy, used to SNAT ports when querying services.
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

-A WEAVE ! -s 172.16.0.0/16 -d 172.16.0.0/16 -j MASQUERADE --random-fully
-A WEAVE -s 172.16.0.0/16 ! -d 172.16.0.0/16 -j MASQUERADE --random-fully

dcowden · 2018-05-01T20:34:22Z

@brb, i tried this out. I was able to upgrade successfully, but it didnt help my problems.

I think maybe i don't have it installed correctly, because my iptables rules do not show the fully-random flag anywhere.

Here's my daemonset ( annotations and stuff after the image omitted ):

dcowden@ubuntu:~/gitwork/kubernetes$ kc get ds weave-net -n kube-system -o yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  ...omitted annotations...
  creationTimestamp: 2017-12-21T16:37:59Z
  generation: 4
  labels:
    name: weave-net
    role.kubernetes.io/networking: "1"
  name: weave-net
  namespace: kube-system
  resourceVersion: "21973562"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/weave-net
  uid: 4dd96bf2-e66d-11e7-8b61-069a0a6ccd8c
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: weave-net
      role.kubernetes.io/networking: "1"
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        name: weave-net
        role.kubernetes.io/networking: "1"
    spec:
      containers:
      - command:
        - /home/weave/launch.sh
        env:
        - name: WEAVE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: weave-passwd
              name: weave-passwd
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: IPALLOC_RANGE
          value: 100.96.0.0/11
        - name: WEAVE_MTU
          value: "8912"
        image: brb0/weave-kube:iptables-random-fully
        ...more stuff...

The daemonset was updated ok. Here's the iptables rules i see on a host. I dont see --random-fully anywhere:

[root@ip-172-25-19-92 ~]# iptables --list-rules
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N KUBE-FIREWALL
-N KUBE-FORWARD
-N KUBE-SERVICES
-N WEAVE-IPSEC-IN
-N WEAVE-NPC
-N WEAVE-NPC-DEFAULT
-N WEAVE-NPC-INGRESS
-A INPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -j KUBE-FIREWALL
-A INPUT -j WEAVE-IPSEC-IN
-A FORWARD -o weave -m comment --comment "NOTE: this must go before \'-j KUBE-FORWARD\'" -j WEAVE-NPC
-A FORWARD -o weave -m state --state NEW -j NFLOG --nflog-group 86
-A FORWARD -o weave -j DROP
-A FORWARD -i weave ! -o weave -j ACCEPT
-A FORWARD -o weave -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -m comment --comment "kubernetes forward rules" -j KUBE-FORWARD
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A OUTPUT ! -p esp -m policy --dir out --pol none -m mark --mark 0x20000/0x20000 -j DROP
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -s 100.96.0.0/11 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 100.96.0.0/11 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-SERVICES -d 100.65.65.105/32 -p tcp -m comment --comment "default/schaeffler-logstash:http has no endpoints" -m tcp --dport 9600 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -p tcp -m comment --comment "ops/echoheaders:http has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 31436 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 100.69.172.111/32 -p tcp -m comment --comment "ops/echoheaders:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A WEAVE-IPSEC-IN -s 172.25.83.126/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.83.234/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.83.40/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.21/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.170/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.29/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.19.130/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-NPC -m state --state RELATED,ESTABLISHED -j ACCEPT
-A WEAVE-NPC -d 224.0.0.0/4 -j ACCEPT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-DEFAULT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-INGRESS
-A WEAVE-NPC -m set ! --match-set weave-local-pods dst -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-f(09:Q6gzJb~LE_pU4n:@416L dst -m comment --comment "DefaultAllow isolation for namespace: ops" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-jXXXW48#WnolRYPFUalO(fLpK dst -m comment --comment "DefaultAllow isolation for namespace: troubleshooting" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-E.1.0W^NGSp]0_t5WwH/]gX@L dst -m comment --comment "DefaultAllow isolation for namespace: default" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-0EHD/vdN#O4]V?o4Tx7kS;APH dst -m comment --comment "DefaultAllow isolation for namespace: kube-public" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-?b%zl9GIe0AET1(QI^7NWe*fO dst -m comment --comment "DefaultAllow isolation for namespace: kube-system" -j ACCEPT

I don't know what to try next.

Quentin-M · 2018-05-01T20:43:35Z

@dcowden You need to make sure you are calling iptables 1.6.2, otherwise you will not see the flag. One solution is to run iptables from within the weave container. As for you, it did not help my issue, the first AAAA query still appears to be dropped. I am compiling kube-proxy/kubelet to add the fully-random flag there as well, but this is going to take a while.

bboreham · 2019-03-07T07:21:53Z

there is still a problem when 2+ containers connect to google.com at the same time?

[EDIT: I was confused so scoring out this part. See later comment too.]
~~Those (TCP) connections are never a problem, because they will come from unique source ports.~~

The problem [EDIT: in this specifc GitHub issue] comes when certain DNS clients make two simultaneous UDP requests with identical source ports (and the destination port is always 53), so we get a race.

The best mitigation is a DNS service which does not go via NAT. This is being worked on in Kubernetes, basically one per node and disabling NAT for on-node connections.

krzysztof-bronk · 2019-03-07T11:00:17Z

But isn't there a race condition in that source port uniqueness algorithm during SNAT, regardless of protocol and affecting different pods on the same host in the same way as the dns UDP client issue within one? Basically as in https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

bboreham · 2019-03-07T11:43:10Z

Sorry, yes, there is a different race condition to do with picking unique outgoing ports for SNAT.

If you are actually encountering this please open a new issue giving the details.

krzysztof-bronk · 2019-03-07T12:20:49Z

Thank you for the response. Indeed I'm seeing insert_failed despite implementing several workarounds and I'm note sure whether it's TCP, UDP, SNAT or DNAT. We can't bump the kernel yet.

If I understood correctly the SNAT case should be mitigated by the "random fully" flag, but Weave never went on with it? I think kubelet and kube-proxy would need those as well anyway, I don't know where things stand there.

There is one more head scratching case for me which is how all those cases fare when one uses NodePort. Isn't there a similar conntrack problem if NodePort forwards to cluster ip?

bboreham · 2019-03-07T12:32:40Z

the "random fully" flag, but Weave never went on with it?

We investigated the problem reported here, and developed fixes to that problem. If someone reports symptoms that are improved by "random fully" then we might add that. We have finite resources and have to concentrate on what is actually reported (and within that set, on paying customers).

Or, since it's Open Source, anyone else can do the investigation and contribute a PR.

krzysztof-bronk · 2019-03-07T12:55:41Z

I understand :) I was merely trying to comprehend where things stand with regards to the different races and available mitigations, since there exist several blog posts and several github issues with a massive amount of comments to parse.

From my understanding of all of it, even with 2 kernel fixes and dns workarounds and iptables flags there is still an issue at least with multipod -> Cluster IP multipod connection, and without kernel 5.0 or "random fully" also an issue with simple multipod -> External IP connection.

But yeah, I'll raise a new issue if that proves true and impactful enough for us in production. Thank you

Krishna1408 · 2019-07-16T14:50:28Z

@Quentin-M @brb We are using weave as well for our CNI and I tried to use the workaround mentioned by @Quentin-M. But I am getting error:

No distribution data for pareto (/lib/tc//pareto.dist: No such file or directory)

I am using debian: 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux

And I have mounted on /usr/lib/tc

Can you please correct where I am getting wrong ?

    spec:
      containers:
      - name: weave-tc
        image: 'qmachu/weave-tc:0.0.1'
        securityContext:
          privileged: true
        volumeMounts:
          - name: xtables-lock
            mountPath: /run/xtables.lock
          - name: usr-lib-tc
            mountPath: /usr/lib/tc

      volumes:
      - hostPath:
          path: /usr/lib/tc
          type: ""
        name: usr-lib-tc

Edit:
In the container specs, VolumeMount us-lib-tc needs update. It should be /lib/tc instead of /usr/lib/tc

hairyhenderson · 2019-07-16T18:55:54Z

@Krishna1408 If you change mountPath: /usr/lib/tc to mountPath: /lib/tc it should work. It needs to be mounted in /lib/tc inside the container, but it's (usually) /usr/lib/tc on the host.

Krishna1408 · 2019-07-17T13:18:48Z

Hi @hairyhenderson thanks a lot, it works for me :)

phlegx · 2019-10-06T16:38:31Z

@brb May I ask if the problem (5 sec DNS delay) is solved with the 5.x Kernel? Have you have some more details and feedback from people already?

brb · 2019-10-07T06:46:11Z

@phlegx It depends which race condition you hit. The first two out of three got fixed in the kernel, and someone reported a success (kubernetes/kubernetes#56903 (comment)).

However, not much can be done from the kernel side about the third race condition. See my comments in the linked issue.

bboreham · 2019-10-07T07:37:18Z

I will repeat what a few others have said in this thread: the best way forward, if you have this problem, is “node-local dns”. Then there is no NAT on the DNS requests from pods and so no race condition.

Support for this configuration is slowly improving in Kubernetes and installers.

phlegx · 2019-10-07T10:33:57Z

We upgraded to Linux 5.x now and for now the "5 second" problem seem to be "solved". Need to check about this third race condition. Thanks for your replies!

insoz · 2019-10-15T14:11:05Z

We upgraded to Linux 5.x now and for now the "5 second" problem seem to be "solved". Need to check about this third race condition. Thanks for your replies!

You mean the Linux 5.x is kernel 5.x ?

thockin · 2020-04-10T17:04:26Z

I just wanted to pop in and say thanks for this excellent and detailed explanation. 2 years since it was filed and 1 year since it was fixed, some people still hit this issue, and frankly the DNAT part of it had me baffled.

It took a bit of reasoning but as I understand it - the client sends multiple UDP requests on the same {source IP, source port, dest IP, dest port, protocol} and one just gets lost. Since clients are INTENTIONALLY sending them in parallel, the race is exacerbated.

DerGenaue · 2020-04-12T01:07:50Z

I was able to solve the issue by using the SessionAffinity feature by kubernetes:
Configuring the kube-dns service in the kube-system namespace from None to:
service.spec.sessionAffinity: ClientIP
resolved it basically immediately on our cluster.
I can't tell how long it will last, though; I expect the next kubernetes upgrade to revert that setting.
I'm pretty sure that this shouldn't have any problematic side-effects; but I cannot tell for sure.

This solution makes all DNS request packets from one pod be delivered to the same kube-dns pod, thus eliminating the problem that the conntrack DNAT race condition causes
(the race condition still exists, it just doesn't have any effect anymore).

bboreham · 2020-04-12T09:57:55Z

@DerGenaue ~~as far as I can tell sessionAffinity only works with proxy-mode userspace, which will slow down service traffic to an extent that some people will not tolerate~~.

thockin · 2020-04-12T23:18:31Z

Session affinity should work fine in iptables, but you still have the race the first time any pod starts sending DNS, any time the chosen backend dies, and (if you use a lot of DNS) you get no balancing. It's kind of hacky, but a fair mitigation for many people.

…

On Sun, Apr 12, 2020 at 2:58 AM Bryan Boreham ***@***.***> wrote: @DerGenaue <https://github.com/DerGenaue> as far as I can tell sessionAffinity only works with --proxy-mode userspace, which will slow down service traffic to an extent that some people will not tolerate. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3287 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVD7FJB5JX3GDGOZXHLRMGGDHANCNFSM4E473DHQ> .

DerGenaue · 2020-04-12T23:27:34Z

I checked the kube-proxy code and the iptables version generates sessionAffinity just fine.
I don't think any single pod will ever do so many DNS requests to cause any problems in this regard.
Also, the way I understood it, the current plan for the future is to route all DNS requests to the pod running on the same node (aka. only node-local DNS traffic), which basically would be very similar to this solution.

thockin · 2020-04-12T23:36:36Z

NodeLocal DNS avoids this problem, yes, by avoiding conntrack. But we have definitely experienced a single pod that issues DNS 2 queries in parallel (A and AAAA) and triggers this race.

…

On Sun, Apr 12, 2020 at 4:27 PM DerGenaue ***@***.***> wrote: I checked the code and the iptables version generates sessionAffinity just fine. I don't think any single pod will ever do so many DNS requests to cause any problems in this regard. Also, the way I understood it, the current plan for the future is to route all DNS requests to the pod running on the same node (aka. only node-local DNS traffic), which basically would be very similar to this solution. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

elmiedo · 2020-05-22T10:07:19Z

Hi. Why are you not implement dnsmasq instead of working with usual dns clients?
Dnsmasq is able to send dns query to every dns-server from it config file simultaneously. You just will receive the fastest reply.

bboreham · 2020-05-22T10:20:10Z

@elmiedo it is uncommon to have the opportunity to change DNS client - it's bound into each container image in code from glibc or musl or similar. And the problem we are discussing hits between that client and the Linux kernel, so the server (such as dnsmasq) does not have a chance to affect things.

thockin · 2020-05-22T15:58:06Z

Again: The Kubernetes node-local DNS cache effort is trying to bypass these problems by using NOTRACK for connections from pods to the local cache, then using TCP exclusively from the local cache to upstream resolvers.

…

On Fri, May 22, 2020 at 3:20 AM Bryan Boreham ***@***.***> wrote: @elmiedo <https://github.com/elmiedo> it is uncommon to have the opportunity to change DNS *client* - it's bound into each container image in code from glibc or musl or similar. And the problem we are discussing hits between that client and the Linux kernel, so the server (such as dnsmasq) does not have a chance to affect things. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3287 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVDQ4TU4YIJY656AYTLRSZGW5ANCNFSM4E473DHQ> .

chengzhycn · 2021-10-26T12:52:47Z

We wrote a blog post describing the technical details of the problem and presenting the kernel fixes: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

@brb Thanks for your excellent explains. But there is a little doubt that confused me. I viewed the glibc source codes, it used send_dg to send A and AAAA queries via UDP in parallel. But it is just called sendmmsg, seems like send two UDP packets in one thread(doesn't match the condition different thread). Is there any misunderstanding by me above？Looking forward to your reply. :)

axot · 2021-10-26T13:24:51Z

https://elixir.bootlin.com/linux/v5.14.14/source/net/socket.c#L2548
Same question if it is possible to run in a different CPU by cond_resched().

bboreham added bug [component/router] labels Apr 26, 2018

bboreham closed this as completed Apr 27, 2018

bboreham reopened this Apr 27, 2018

Quentin-M mentioned this issue May 1, 2018

DNS latency of 5s when uses iptables forward in pods network traffic kubernetes/kubernetes#62628

Closed

m1schka mentioned this issue Mar 15, 2019

DNS Issue on Kubernetes (ndots=5 + search domain query) netty/netty#8880

Closed

mattmattox mentioned this issue Aug 14, 2019

Averaging 5000 ms Delay in Network Requests rancher/rancher#22165

Open

owlwalks mentioned this issue Oct 16, 2019

Upgrade CNI version broke pod-to-pod communication within the same worker node aws/amazon-vpc-cni-k8s#641

Closed

jim-barber-he mentioned this issue Dec 30, 2019

kops Debian images need to use a newer kernel to fix intermittent network timeouts caused by connection tracking bugs. kubernetes/kops#8224

Closed

DimitrijeManic mentioned this issue Jul 9, 2021

CoreDNS fails valid lookups as soon as an invalid lookup is made kubernetes/minikube#11938

Closed

DNS lookup timeouts due to races in conntrack #3287

DNS lookup timeouts due to races in conntrack #3287

Comments

dcowden commented Apr 26, 2018

What happened?

How to reproduce it?

Anything else we need to know?

bboreham commented Apr 26, 2018

dcowden commented Apr 27, 2018

btalbot commented Apr 27, 2018

dcowden commented Apr 27, 2018

dcowden commented Apr 27, 2018

bboreham commented Apr 27, 2018 • edited

bboreham commented Apr 27, 2018

dcowden commented Apr 27, 2018

dcowden commented Apr 27, 2018

btalbot commented Apr 27, 2018

dcowden commented Apr 27, 2018

btalbot commented Apr 27, 2018

dcowden commented Apr 27, 2018

Quentin-M commented May 1, 2018 • edited

dcowden commented May 1, 2018

brb commented May 1, 2018

dcowden commented May 1, 2018

brb commented May 1, 2018

dcowden commented May 1, 2018

brb commented May 1, 2018

Quentin-M commented May 1, 2018

Quentin-M commented May 1, 2018

Quentin-M commented May 1, 2018 • edited

dcowden commented May 1, 2018

Quentin-M commented May 1, 2018

bboreham commented Mar 7, 2019 • edited

krzysztof-bronk commented Mar 7, 2019

bboreham commented Mar 7, 2019

krzysztof-bronk commented Mar 7, 2019

bboreham commented Mar 7, 2019

krzysztof-bronk commented Mar 7, 2019

Krishna1408 commented Jul 16, 2019 • edited

hairyhenderson commented Jul 16, 2019

Krishna1408 commented Jul 17, 2019

phlegx commented Oct 6, 2019 • edited

brb commented Oct 7, 2019

bboreham commented Oct 7, 2019 • edited

phlegx commented Oct 7, 2019 • edited

insoz commented Oct 15, 2019

thockin commented Apr 10, 2020

DerGenaue commented Apr 12, 2020 • edited

bboreham commented Apr 12, 2020 • edited

thockin commented Apr 12, 2020 via email

DerGenaue commented Apr 12, 2020 • edited

thockin commented Apr 12, 2020 via email

elmiedo commented May 22, 2020

bboreham commented May 22, 2020

thockin commented May 22, 2020 via email

chengzhycn commented Oct 26, 2021 • edited

axot commented Oct 26, 2021 • edited

bboreham commented Apr 27, 2018 •

edited

Quentin-M commented May 1, 2018 •

edited

Quentin-M commented May 1, 2018 •

edited

bboreham commented Mar 7, 2019 •

edited

Krishna1408 commented Jul 16, 2019 •

edited

phlegx commented Oct 6, 2019 •

edited

bboreham commented Oct 7, 2019 •

edited

phlegx commented Oct 7, 2019 •

edited

DerGenaue commented Apr 12, 2020 •

edited

bboreham commented Apr 12, 2020 •

edited

DerGenaue commented Apr 12, 2020 •

edited

chengzhycn commented Oct 26, 2021 •

edited

axot commented Oct 26, 2021 •

edited