Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upscaling of coredns pods leads to DNS timeout errors #113080

Closed
sli720 opened this issue Oct 15, 2022 · 16 comments
Closed

Upscaling of coredns pods leads to DNS timeout errors #113080

sli720 opened this issue Oct 15, 2022 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@sli720
Copy link

sli720 commented Oct 15, 2022

What happened?

Since upgrading to CentOS 9 and Kubernetes 1.24.6 (on the same hardware), sporadic DNS resolution errors occur when too many coredns pods are running at the same time. If I reduce them to <= 3, no more errors occur. Once the error occurs, you see lines in the logs of nodelocaldns pods like:

[ERROR] plugin/errors: 2 git-cache.ci.svc.cluster.local. A: select tcp 10.233.0.3:53: i/o timeout

It looks like the nodelocaldns pod sometimes can't contact the coredns pod for some reason. There are no errors seen in the logs for the coredns pods, but also for the calico pods. It also occurs with low load (CPU, network, disk) on the cluster. Could this be a bug in nodelocaldns, coredns or a wrong configuration of the /etc/resolv.conf? It is strange that it disappears when I reduce the number of coredns pods.

What did you expect to happen?

nodelocaldns pods can always contact coredns

How can we reproduce it (as minimally and precisely as possible)?

Run the nslookup command many times. Sometimes it fails, sometimes not.

❯ kubectl exec -i -t dnsutils -- nslookup nexus-service.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

** server can't find nexus-service.ci.svc.cluster.local.default.svc.cluster.local: SERVFAIL

command terminated with exit code 1

❯ kubectl exec -i -t dnsutils -- nslookup nexus-service.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

Name:	nexus-service.ci.svc.cluster.local
Address: 10.233.13.178

Anything else we need to know?

Here is the resolv.conf of the pod used to test:

nameserver 169.254.25.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Kubernetes version

1.24.6

Cloud provider

on-premise (kubespray see kubernetes-sigs/kubespray#9328)

OS version

CentOS Stream 9 (Kernel Linux 5.14.0-160.el9.x86_64 x86_64)

Install tools

ansible through kubespray

Container runtime (CRI) and version (if applicable)

containerd 1.6.8 (also tested with docker 20.10)

Related plugins (CNI, CSI, ...) and versions (if applicable)

calico

@sli720 sli720 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 15, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 15, 2022
@k8s-ci-robot
Copy link
Contributor

@sli720: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sli720
Copy link
Author

sli720 commented Oct 15, 2022

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 15, 2022
@sli720 sli720 changed the title Sometimes namespace.svc.cluster.local is appended twice in DNS requests Upscaling of coredns pods leads to DNS timeout errors Oct 16, 2022
@chrisohaver
Copy link
Contributor

Perhaps the issue is with networking to a particular node, and increasing the number of instances > 3 results in a coredns pod running on a problematic node?

@sli720
Copy link
Author

sli720 commented Oct 19, 2022

I've tried it out on specific nodes and I don't see a relation to the hardware. Everytime I increase the number of pods I get this issue.I also don't see any issues in the calico or system/kernel logs. Can I somehow debug if that issue is related to a specific node?

@chrisohaver
Copy link
Contributor

chrisohaver commented Oct 20, 2022

Can I somehow debug if that issue is related to a specific node?

Perhaps try issuing TCP queries to each individual CoreDNS pod IP directly. Do they all exhibit the same degree of sporadic timeout? Or some more so than others.

Note: The forward timeout is 2 seconds in nodelocal/coredns.

@khenidak
Copy link
Contributor

How healthy are your kube-proxies (specifically on the nodes that hosts the pod that can't read dns)?

@sli720
Copy link
Author

sli720 commented Oct 22, 2022

Perhaps try issuing TCP queries to each individual CoreDNS pod IP directly. Do they all exhibit the same degree of sporadic timeout? Or some more so than others.

I did a fast nslookup loop against all coredns pods from different hosts and it always resolved successfully. Only when using nodelocaldns between it sometimes fails on always different hosts.

@sli720
Copy link
Author

sli720 commented Oct 22, 2022

How healthy are your kube-proxies (specifically on the nodes that hosts the pod that can't read DNS)?

They never crashed so far if you mean that and in the logs there are no errors.

@thockin
Copy link
Member

thockin commented Dec 8, 2022

Do we have any updates here?

@thockin thockin closed this as completed Dec 21, 2022
@jaswanthikolla
Copy link

jaswanthikolla commented Apr 23, 2023

My hypothesis on why it's happenning:

LocalCoreDNS uses CoreDNS kube-dns Cluster IP for upstream cluster.local DNS resolution. ClusterIP uses IP Table's DNAT which is subject to Conntrack race condition issues, and More number of CoreDNS pod means more number of IP Tables rules/endpoints.

So, if simultaneous connections are made ( within 2 ns) and there are multiple rules with multiple endpoints, the packets can be sent to wrong pod/node ( see race#3) . So, The probability of the sending the packet to correct pod/node decreases with more number of coredns pods.

Also, There are others who faced this issue.

@aojea
Copy link
Member

aojea commented Apr 23, 2023

on-premise (kubespray

there are no errors seen in the logs for the coredns pods, but also for the calico pods

many moving parts here ;)

@sli720
Copy link
Author

sli720 commented Apr 23, 2023

I've disabled nodelocaldns completely and scaled up the coredns pods again. No problems anymore.

@jaswanthikolla
Copy link

I've disabled nodelocaldns completely and scaled up the coredns pods again. No problems anymore.

It's possible that you don't have visibility into DNS errors anymore, Did you validate that? Earlier, LocalCoreDNS was central place and it's logging the error. It's interesting how that fixes the error.

@chrisohaver
Copy link
Contributor

My hypothesis on why it's happenning:

LocalCoreDNS uses CoreDNS kube-dns Cluster IP for upstream cluster.local DNS resolution. ClusterIP uses IP Table's DNAT which is subject to Conntrack race condition issues, and More number of CoreDNS pod means more number of IP Tables rules/endpoints.

Nodelocaldns instances use TCP to forward DNS requests to the Cluster IP DNS, which would mitigate the conntrack issue - requests getting resent when sender does not get ACKs.

@jaswanthikolla
Copy link

jaswanthikolla commented Apr 24, 2023

which would mitigate the conntrack issue -

Yes. One case it will fail is if SYN-ACK is lost then as per this doc, it would take at least 3 seconds which is much more than CoreDNS timeout. I wonder what's the impact of that on other requests, which I asked as separate question here.

@tanvp112
Copy link

tanvp112 commented Apr 26, 2023

which would mitigate the conntrack issue -

Yes. One case it will fail is if SYN-ACK is lost then as per this doc, it would take at least 3 seconds which is much more than CoreDNS timeout. I wonder what's the impact of that on other requests, which I asked as separate question here.

hi @jaswanthikolla , I was trying to re-produce the SYN-ACK issue, but encounter an issue like nodelocaldns always go back to coredns for name resolution. I tested by running a 1s loop doing nslookup kubernetes.default.svc.cluster.local on a node with nodelocaldns running, all goes well but as soon as I scale CoreDNS to zero the nslookup will fail immediately and nodelocaldns log will report connection refused to CoreDNS. I thought there's a 5s TTL set by CoreDNS and suppose nodelocaldns should not fail immediately and response using the cached record? Noticed once I scale CoreDNS up the resolution goes back to normal. This looks like nodelocaldns didn't really cache any result to reduce call to CoreDNS... I was using the stock nodelocaldns.yaml, beside the 3 standard environment variables to change, nothing else is changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

8 participants