Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodelocaldns resolution issues #9328

Closed
sli720 opened this issue Sep 26, 2022 · 7 comments
Closed

nodelocaldns resolution issues #9328

sli720 opened this issue Sep 26, 2022 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sli720
Copy link

sli720 commented Sep 26, 2022

Environment:

  • Cloud provider or hardware configuration:
    Self-hosted (10 Hosts)

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    CentOS Stream 9 (Kernel Linux 5.14.0-160.el9.x86_64 x86_64)

  • Version of Ansible (ansible --version):
    2.12.5

  • Version of Python (python --version):
    3.9.13

Kubespray version (commit) (git rev-parse --short HEAD):
6dff393

Network plugin used:
calico

Version of Kubernetes:
1.24.6

The DNS cache is not always resolving internal DNS names e.g.

kubectl -n ci exec -i -t dnsutils -- nslookup jenkins-operator-http-testing.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

** server can't find jenkins-operator-http-testing.ci.svc.cluster.local.ci.svc.cluster.local: SERVFAIL

And just another try and it works again

kubectl -n ci exec -i -t dnsutils -- nslookup jenkins-operator-http-testing.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

Name:	jenkins-operator-http-testing.ci.svc.cluster.local
Address: 10.233.49.75

I've tried it also out directly on all hosts through:
nslookup jenkins-operator-http-testing.ci.svc.cluster.local 169.254.25.10
and it sometimes fails on every host.

It is not related to the name I'm trying to resolve. Also happens with other names. The only thing I see in the nodelocaldns logs is that there is sometimes the following error that it can not connect to coredns:
[ERROR] plugin/errors: 2 jenkins-operator-http-testing.ci.svc.cluster.local. A: dial tcp 10.233.0.3:53: i/o timeout

But if I run the nslookup against coredns IP directly there is no issue. There are not issues in the coredns or calico pod logs. It is also strange that the nodelocaldns cache forgets within a few seconds and asks coredns again.

I'm not sure where to ask, therefore I've created the ticket first here.

@sli720 sli720 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2022
@sli720
Copy link
Author

sli720 commented Sep 26, 2022

Could one of these issues be related?

kubernetes/dns#387
aws/amazon-vpc-cni-k8s#595
coredns/coredns#3927
kubernetes/kubernetes#56903

Is it possible to change nodelocaldns to only use UDP instead of TCP through kubespray? Or have the coredns pods to be scaled down for some reason? Or do I have to set single-request-reopen in the resolv.conf? etc.

@sli720
Copy link
Author

sli720 commented Sep 27, 2022

Is it possible to change nodelocaldns to only use UDP instead of TCP through kubespray

I've tried out that but it did not help

Or do I have to set single-request-reopen in the resolv.conf?

It seems only related to old alpine containers which is not the case here

Or have the coredns pods to be scaled down for some reason?

If I scale down the coredns pods to only one there are no issues. I'm already running an older k8s cluster with multiple coredns pods without any issues under CentOS 7. Could this problem may be related to CentOS 9 (e.g. Kernel)?

@HoKim98
Copy link
Contributor

HoKim98 commented Sep 28, 2022

Could you execute the command kubectl get pods -n kube-system and check whether coredns and nodelocaldns is working well?

If that's not, this procedure may help you. #9160

@sli720
Copy link
Author

sli720 commented Sep 28, 2022

Yes they are all running. Coredns pods show no errors but nodelocaldns pods sometimes show the error:

[ERROR] plugin/errors: 2 jenkins-operator-http-testing.ci.svc.cluster.local. A: dial tcp 10.233.0.3:53: i/o timeout

When I scale down to 1-3 pods there are no issues but when using the DNS autoscaler (which scales up to 7 pods) the issues occurr.

@sli720
Copy link
Author

sli720 commented Oct 15, 2022

Interesting. In my last tests I found out that sometimes .default.svc.cluster.local is appended to the request and sometimes not. Could this be a bug in nodelocaldns, coredns or a wrong configuration of the /etc/resolv.conf?

❯ kubectl exec -i -t dnsutils -- nslookup nexus-service.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

** server can't find nexus-service.ci.svc.cluster.local.default.svc.cluster.local: SERVFAIL

command terminated with exit code 1

❯ kubectl exec -i -t dnsutils -- nslookup nexus-service.ci.svc.cluster.local
Server:		169.254.25.10
Address:	169.254.25.10#53

Name:	nexus-service.ci.svc.cluster.local
Address: 10.233.13.178

Here is the resolv.conf of the pod used to test:

nameserver 169.254.25.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

It is strange that it disappears when I reduce the number of coredns pods. Any idea why?

@sli720
Copy link
Author

sli720 commented Oct 16, 2022

I've opened an issue in the kubernetes project so I think this one can be closed

@sli720 sli720 closed this as completed Oct 16, 2022
@freeyoung
Copy link

For people who encounter this and got this issue page from Google:
https://blog.brujordet.no/post/devops/i_cant_believe_its_not_dns/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants