kops Debian images need to use a newer kernel to fix intermittent network timeouts caused by connection tracking bugs. #8224

jim-barber-he · 2019-12-30T02:02:27Z

1. What kops version are you running? The command kops version, will display
this information.

$ kops version
Version 1.15.0 (git-9992b4055)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

Amazon AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

We have a production cluster that is running many jobs.
We see network timeouts many times per day.

Running conntrack -S to show the in-kernel connection tracking system statistics for each of our nodes shows a large number of insert_failed entries.

5. What happened after the commands executed?

Jobs that don't handle network connection timeouts fail.

6. What did you expect to happen?

We should not be seeing random network timeouts.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Apologies for the sparse answers to many of the above questions; I'm not sure if they are applicable to the problem.

The following 2 Linux patches were merged into the Linux kernel that address some problems with its connection tracking that result in network connection issues.
http://patchwork.ozlabs.org/patch/937963/
http://patchwork.ozlabs.org/patch/1032812/
I believe they are both in the Linux 5.1 kernel.

There are a number of topics on the Internet discussing this issue causing DNS timeouts.
Here are a couple of bug reports talking about it:
kubernetes/kubernetes#56903
weaveworks/weave#3287

Although these are talking timeouts with DNS, it is applicable to everything else as well; it's just that DNS lookups happen regularly and so trip the problem frequently.
We've already put in place the node-local-dns solution as presented in these bug reports to vastly improve the issue around DNS.
However we still have random jobs trip up on the problem and conntrack -S is showing we are hitting the problem because we still have many insert_failed errors.

Our instance groups are using the kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2019-09-26 image.
Looking at https://github.com/kubernetes/kops/blob/master/channels/stable these seem to be the latest versions of the images currently available.
They report the following kernel when running uname -rvm on the nodes:

$ uname -rvm
4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64

Installing the linux-source-4.9 package on Debian that matches the same kernel version, it appears that one of the 2 kernel patches have been back-ported into Debian Stretch.
I also looked at the kernel sources for Debian Buster (the current stable release of Debian) and see that both of these patches have been back-ported into the 4.19 kernel it is currently using.

We are using the default Weave CNI that Kops installs.

If the kops images were updated to use the current stable version of Debian Buster, or if they were able to incorporate the missing fix into Debian Stretch, then I am hoping our connection problems will go away.

The text was updated successfully, but these errors were encountered:

gfdusc · 2020-01-02T14:34:30Z

Hello,
you can use an AMI ID from https://wiki.debian.org/Cloud/AmazonEC2Image/Buster or generate your images using https://salsa.debian.org/cloud-team/debian-cloud-images, upload it into your account and then reference the image name in your instancegroup config.

jim-barber-he · 2020-01-03T03:24:43Z

Oh, that's awesome. Thanks a lot.

I don't suppose you know what is different between the kops images like kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2019-09-26 vs Debian's official cloud images like 379101102735/debian-stretch-hvm-x86_64-gp2-2019-09-08-17994 ?
I.e. is kops taking the official Debian images and making specific changes to them that I need to be aware of when using the official Debian images?

rifelpet · 2020-01-03T03:38:42Z

This is the repo responsible for building the kope.io AMIs:

https://github.com/kubernetes-sigs/image-builder/blob/master/images/kube-deploy/imagebuilder/templates/1.15-stretch.yml

That file gives you a sense of what changes are made from the official Debian images. I added an item to tomorrow's Kops office hours to get newer AMIs built, I'll be sure to update this issue with the results of that discussion.

MMeent · 2020-01-07T11:40:18Z

@rifelpet Do you have a result for this discussion?

rifelpet · 2020-01-07T15:44:17Z

@MMeent yes, we decided that @justinsb is going to build new AMIs soon, hopefully within the next two weeks.

gfdusc · 2020-01-10T10:25:21Z

Thank you @rifelpet

Please, if possible, consider generating an image for sid as well. It has kernel 5.4+ and iptables 1.8.4.

It would be really great if we can include the ipvsadm package by default too.

Many thanks!

mariusv · 2020-01-16T17:52:05Z

@rifelpet will you updated this issue when the Buster image is built or will be announced somewhere else?

justinsb · 2020-01-17T15:12:08Z

I updated our existing stretch AMIs and investigated this a bit: #8361 (comment)

Our AMIs do run the stock kernels, and with that it looks like:

Stretch still has one of the two patches
Buster has both

Given that, I proposed that we stick to the stock kernels, and expedite getting buster as an option and also make it the default in a newer version of kops (1.18?). There was an iptables blocker, but that should now be fixed.

mariusv · 2020-01-17T16:37:59Z

@justinsb I agree, one of the patches is ok but would rather have both of them to be covered. Iptables blocker still applies for Kubernetes versions below 1.17 afaik

mars64 · 2020-01-17T16:38:15Z

There was an iptables blocker, but that should now be fixed.

As I recall, that's tracked in #7379 -- which is still open. Is this the same issue? Or perhaps you're referring to this 1.17 patch kubernetes/kubernetes#82966 ?

[edit] doh @mariusv beat me to the punch!

justinsb · 2020-01-17T16:58:40Z

Yes - exactly, I'm referring to the iptables nft switch. It was fixed in kubernetes/kubernetes#82966 and that should be in k8s >= 1.17. I also did just bring up a k8s 1.17.0 with a stock buster image. So I figure I can build an image for buster for 1.17, we can test with it and then confidently close #7379 ... and work to make buster the default.

tkoeck · 2020-02-13T15:53:41Z

It's possible to use Debian Buster with an earlier k8s version (I tried 1.13) if you change the iptables mode to legacy. There is a way to to that with cloud init #7381

fejta-bot · 2020-05-13T16:27:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jecnua · 2020-05-13T17:27:28Z

Still valid, please remove stale label.

itskingori · 2020-05-13T18:16:29Z

/remove-lifecycle stale

jim-barber-he · 2020-08-05T00:12:24Z

Kops now has Ubuntu Focal as the default image, and also supports using Debian Buster as an image if you set spec.image to the AMI ID of one of Debian's Cloud images listed here: https://wiki.debian.org/Cloud/AmazonEC2Image/Buster

Therefore I suspect that this can now be closed.

mariusv · 2020-08-05T05:16:30Z

Yeah, it should definitely be closed now.

johngmyers · 2020-08-05T05:27:50Z

/close

k8s-ci-robot · 2020-08-05T05:28:03Z

@johngmyers: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rifelpet mentioned this issue Jan 17, 2020

Publish update AMIs into the alpha channel #8361

Merged

etwillbefine mentioned this issue Mar 14, 2020

unexpected error during validation: error listing nodes: Get https://api.cluster.tibco.in/api/v1/nodes: dial tcp 199.59.242.151:443: i/o timeout #7438

Closed

kzap mentioned this issue Apr 30, 2020

Kernel throttling bug patch in kops node images #8954

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2020

k8s-ci-robot closed this as completed Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kops Debian images need to use a newer kernel to fix intermittent network timeouts caused by connection tracking bugs. #8224

kops Debian images need to use a newer kernel to fix intermittent network timeouts caused by connection tracking bugs. #8224

jim-barber-he commented Dec 30, 2019

gfdusc commented Jan 2, 2020

jim-barber-he commented Jan 3, 2020

rifelpet commented Jan 3, 2020

MMeent commented Jan 7, 2020

rifelpet commented Jan 7, 2020

gfdusc commented Jan 10, 2020

mariusv commented Jan 16, 2020

justinsb commented Jan 17, 2020

mariusv commented Jan 17, 2020

mars64 commented Jan 17, 2020 •

edited

justinsb commented Jan 17, 2020

tkoeck commented Feb 13, 2020

fejta-bot commented May 13, 2020

jecnua commented May 13, 2020

itskingori commented May 13, 2020

jim-barber-he commented Aug 5, 2020

mariusv commented Aug 5, 2020

johngmyers commented Aug 5, 2020

k8s-ci-robot commented Aug 5, 2020

kops Debian images need to use a newer kernel to fix intermittent network timeouts caused by connection tracking bugs. #8224

kops Debian images need to use a newer kernel to fix intermittent network timeouts caused by connection tracking bugs. #8224

Comments

jim-barber-he commented Dec 30, 2019

gfdusc commented Jan 2, 2020

jim-barber-he commented Jan 3, 2020

rifelpet commented Jan 3, 2020

MMeent commented Jan 7, 2020

rifelpet commented Jan 7, 2020

gfdusc commented Jan 10, 2020

mariusv commented Jan 16, 2020

justinsb commented Jan 17, 2020

mariusv commented Jan 17, 2020

mars64 commented Jan 17, 2020 • edited

justinsb commented Jan 17, 2020

tkoeck commented Feb 13, 2020

fejta-bot commented May 13, 2020

jecnua commented May 13, 2020

itskingori commented May 13, 2020

jim-barber-he commented Aug 5, 2020

mariusv commented Aug 5, 2020

johngmyers commented Aug 5, 2020

k8s-ci-robot commented Aug 5, 2020

mars64 commented Jan 17, 2020 •

edited