nodelocaldns resolution issues #9328

sli720 · 2022-09-26T07:33:00Z

Environment:

Cloud provider or hardware configuration:
Self-hosted (10 Hosts)
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
CentOS Stream 9 (Kernel Linux 5.14.0-160.el9.x86_64 x86_64)
Version of Ansible (ansible --version):
2.12.5
Version of Python (python --version):
3.9.13

Kubespray version (commit) (git rev-parse --short HEAD):
6dff393

Network plugin used:
calico

Version of Kubernetes:
1.24.6

The DNS cache is not always resolving internal DNS names e.g.

kubectl -n ci exec -i -t dnsutils -- nslookup jenkins-operator-http-testing.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

** server can't find jenkins-operator-http-testing.ci.svc.cluster.local.ci.svc.cluster.local: SERVFAIL

And just another try and it works again

kubectl -n ci exec -i -t dnsutils -- nslookup jenkins-operator-http-testing.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

Name:	jenkins-operator-http-testing.ci.svc.cluster.local
Address: 10.233.49.75

I've tried it also out directly on all hosts through:
nslookup jenkins-operator-http-testing.ci.svc.cluster.local 169.254.25.10
and it sometimes fails on every host.

It is not related to the name I'm trying to resolve. Also happens with other names. The only thing I see in the nodelocaldns logs is that there is sometimes the following error that it can not connect to coredns:
[ERROR] plugin/errors: 2 jenkins-operator-http-testing.ci.svc.cluster.local. A: dial tcp 10.233.0.3:53: i/o timeout

But if I run the nslookup against coredns IP directly there is no issue. There are not issues in the coredns or calico pod logs. It is also strange that the nodelocaldns cache forgets within a few seconds and asks coredns again.

I'm not sure where to ask, therefore I've created the ticket first here.

The text was updated successfully, but these errors were encountered:

sli720 · 2022-09-26T17:47:32Z

Could one of these issues be related?

kubernetes/dns#387
aws/amazon-vpc-cni-k8s#595
coredns/coredns#3927
kubernetes/kubernetes#56903

Is it possible to change nodelocaldns to only use UDP instead of TCP through kubespray? Or have the coredns pods to be scaled down for some reason? Or do I have to set single-request-reopen in the resolv.conf? etc.

sli720 · 2022-09-27T06:57:32Z

Is it possible to change nodelocaldns to only use UDP instead of TCP through kubespray

I've tried out that but it did not help

Or do I have to set single-request-reopen in the resolv.conf?

It seems only related to old alpine containers which is not the case here

Or have the coredns pods to be scaled down for some reason?

If I scale down the coredns pods to only one there are no issues. I'm already running an older k8s cluster with multiple coredns pods without any issues under CentOS 7. Could this problem may be related to CentOS 9 (e.g. Kernel)?

HoKim98 · 2022-09-28T11:46:21Z

Could you execute the command kubectl get pods -n kube-system and check whether coredns and nodelocaldns is working well?

If that's not, this procedure may help you. #9160

sli720 · 2022-09-28T16:13:09Z

Yes they are all running. Coredns pods show no errors but nodelocaldns pods sometimes show the error:

[ERROR] plugin/errors: 2 jenkins-operator-http-testing.ci.svc.cluster.local. A: dial tcp 10.233.0.3:53: i/o timeout

When I scale down to 1-3 pods there are no issues but when using the DNS autoscaler (which scales up to 7 pods) the issues occurr.

sli720 · 2022-10-15T10:07:40Z

Interesting. In my last tests I found out that sometimes .default.svc.cluster.local is appended to the request and sometimes not. Could this be a bug in nodelocaldns, coredns or a wrong configuration of the /etc/resolv.conf?

❯ kubectl exec -i -t dnsutils -- nslookup nexus-service.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

** server can't find nexus-service.ci.svc.cluster.local.default.svc.cluster.local: SERVFAIL

command terminated with exit code 1

❯ kubectl exec -i -t dnsutils -- nslookup nexus-service.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

Name:	nexus-service.ci.svc.cluster.local
Address: 10.233.13.178

Here is the resolv.conf of the pod used to test:

nameserver 169.254.25.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

It is strange that it disappears when I reduce the number of coredns pods. Any idea why?

sli720 · 2022-10-16T10:57:04Z

I've opened an issue in the kubernetes project so I think this one can be closed

freeyoung · 2024-05-21T12:51:24Z

For people who encounter this and got this issue page from Google:
https://blog.brujordet.no/post/devops/i_cant_believe_its_not_dns/

sli720 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2022

sli720 mentioned this issue Oct 15, 2022

Upscaling of coredns pods leads to DNS timeout errors kubernetes/kubernetes#113080

Closed

sli720 closed this as completed Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodelocaldns resolution issues #9328

nodelocaldns resolution issues #9328

sli720 commented Sep 26, 2022 •

edited

sli720 commented Sep 26, 2022 •

edited

sli720 commented Sep 27, 2022

HoKim98 commented Sep 28, 2022

sli720 commented Sep 28, 2022 •

edited

sli720 commented Oct 15, 2022

sli720 commented Oct 16, 2022

freeyoung commented May 21, 2024

nodelocaldns resolution issues #9328

nodelocaldns resolution issues #9328

Comments

sli720 commented Sep 26, 2022 • edited

sli720 commented Sep 26, 2022 • edited

sli720 commented Sep 27, 2022

HoKim98 commented Sep 28, 2022

sli720 commented Sep 28, 2022 • edited

sli720 commented Oct 15, 2022

sli720 commented Oct 16, 2022

freeyoung commented May 21, 2024

sli720 commented Sep 26, 2022 •

edited

sli720 commented Sep 26, 2022 •

edited

sli720 commented Sep 28, 2022 •

edited