Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node NotReady because of PLEG is not healthy #195

Closed
radityasurya opened this issue Feb 20, 2019 · 29 comments
Closed

Node NotReady because of PLEG is not healthy #195

radityasurya opened this issue Feb 20, 2019 · 29 comments

Comments

@radityasurya
Copy link

What happened:
One of the nodes using the latest AMI version started to become NotReady

What you expected to happen:
The node is always ready

How to reproduce it (as minimally and precisely as possible):
Not sure how, the node never get any CPU or Memory usage high

Anything else we need to know?:
We ssh to the nodes and found out that the PLEG is not healthy

Feb 20 14:23:09 ip-10-0-13-15.eu-west-1.compute.internal kubelet[3694]: I0220 14:23:09.120100    3694 kubelet.go:1775] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 4h19m47.369998188s ago; threshold is 3m0s]

and when we try to check the container with docker ps we get this error:

Feb 20 14:28:54 ip-10-0-13-15.eu-west-1.compute.internal dockerd[3188]: http: multiple response.WriteHeader calls
Feb 20 14:28:54 ip-10-0-13-15.eu-west-1.compute.internal dockerd[3188]: time="2019-02-20T14:28:54.455014979Z" level=error msg="Handler for GET /v1.25/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

this is the docker version we use

[root@ip-10-0-13-15 ~]# docker version
Client:
 Version:      17.06.2-ce
 API version:  1.30
 Go version:   go1.9.6
 Git commit:   3dfb8343b139d6342acfd9975d7f1068b5b1c3d3
 Built:        Mon Jan 28 22:06:48 2019
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.2-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.9.6
 Git commit:   402dd4a/17.06.2-ce
 Built:        Mon Jan 28 22:07:35 2019
 OS/Arch:      linux/amd64
 Experimental: false
[root@ip-10-0-13-15 ~]# docker info
Containers: 13
 Running: 12
 Paused: 0
 Stopped: 1
Images: 37
Server Version: 17.06.2-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.94-89.73.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.503GiB
Name: ip-10-0-13-15.eu-west-1.compute.internal
ID: PVWT:EV6L:L543:5IU4:WIAB:IZPK:FIAE:3LLA:WV7F:GG5V:XRKW:JA4S
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

We tried to forked and made some changes https://github.com/tiqets/amazon-eks-ami, we've updated the docker version and it seems to be working, do you guys think this is related with docker?

Environment:

  • AWS Region: eu-west-1
  • Instance Type(s): M5.Larget
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.11
  • AMI Version: amazon-eks-node-1.11-v20190211 (ami-0b469c0fef0445d29)
  • Kernel (e.g. uname -a):
    Linux ip-10-0-13-15.eu-west-1.compute.internal 4.14.94-89.73.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Fri Jan 18 22:36:02 UTC 2019 x86_64 x86_64 x86_64 G
    NU/Linux
  • Release information (run cat /tmp/release on a node):
empty
@toricls
Copy link

toricls commented Mar 2, 2019

it seems to be related to kubernetes/kubernetes#45419 if your nodes are flapping between Ready/NotReady?

@mysunshine92
Copy link

/sig node

@krish7919
Copy link

Also, I have read on slack that GKE kills/restarts containerd when it detects this issue automatically. Does AWS EKS do that too?

@whereisaaron
Copy link

There are 200+ issues in the k8s project for PLEG issues. Mostly attributed to PLEG resource exhaustion, either out of memory, or too many events too quickly, and some deadlock situations. I have seen it once and ended up restarting since there was no other option I could find.

@krish7919
Copy link

Agreed, but this is present in most of the k8s versions now and even though there is no fix yet, I want to know what work-arounds, if any, does EKS have in place to solve this problem. This definitely affects k8s deployments and as a product EKS is directly affected.

Note: In our cluster, I have debugged this a lot before finding the issues online, and can say definitively that it is not related to CPU, memory, network or disk issues. It might be related to too many events too quickly.

@emanuelecasadio
Copy link

emanuelecasadio commented Apr 4, 2019

This night all of our nodes became unavailable because of this problem :|

@a7i
Copy link

a7i commented May 19, 2019

This night all of our nodes became unavailable because of this problem :|

Have you been able to resolve this issue yet @emanuelecasadio ?

@emanuelecasadio
Copy link

@a7i no, but it seems that manually rebooting the node temporarily prevents this to happen again for some time (1-2 weeks approx.)

@MohammedFadin
Copy link

@emanuelecasadio Any update on this issue?, so we have to manually reboot the nodes all the time?

@jesuslinares
Copy link

It is happening in our environments too. We created a bash script that checks if docker ps is executed properly in less than 60 seconds. If this check fails 3 times, the Docker service is restarted.

It happens from time to time, but the script solves the issue.

What is the current status of this issue?.

@PDRMAN
Copy link

PDRMAN commented Oct 10, 2019

It is happening in our clusters as well. We're running:
Docker 18.06.1-ce
kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp-ee", GitCommit:"eb4df6c6fb47f5b4fd1ed8bfbfe2d0ed5ea636e1", GitTreeState:"clean", BuildDate:"2019-05-08T02:18:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

@PDRMAN
Copy link

PDRMAN commented Oct 10, 2019

we're going to be testing a possible solution by increasing the reserved resources for the system and kubelet increasing cpu to 500m. This should improve stability of the node, will report back if things improve

systemReserved:
cpu: "500m"
memory: "512Mi"
ephemeral-storage: "1Gi"
kubeReserved:
cpu: "500m"
memory: "512Mi"
ephemeral-storage: "1Gi"

@a7i
Copy link

a7i commented Oct 11, 2019

The following configuration [passed to worker node userdata bootstrap] worked for me. I used to face this often but haven't had this issue in months now:

--kube-reserved cpu=250m, memory=0.5Gi,ephemeral-storage=1Gi \
--system-reserved cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi \
--eviction-hard memory.available<300Mi,nodefs.available<10%

@PDRMAN Hopefully will do the same for you and others.

@whereisaaron
Copy link

These settings should have default values. It is just asking for outages to not protect the kubelet.

@mogren
Copy link

mogren commented Oct 13, 2019

#350

@Bhuvan26
Copy link

Bhuvan26 commented Nov 21, 2019

Atleast i'm not seeing any error during "docker ps". But nodes are continously flapping with the CPU spike..

[root@we2 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cwe1 Ready 30d v1.14.3
cwe2 Ready 30d v1.14.3
cwe3 NotReady 30d v1.14.3
we1 NotReady 30d v1.14.3
we2 NotReady 30d v1.14.3
[root@we2 ~]#
[root@we2 ~]#
[root@we2 ~]# journalctl -f
Nov 21 18:04:07 we2 kubelet[15634]: I1121 18:04:07.537029 15634 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 5m6.695070902s ago; threshold is 3m0s.
Nov 21 18:04:11 we2 kubelet[15634]: W1121 18:04:11.993022 15634 reflector.go:289] object-"cscf-2019"/"configmapenvoycfx": watch of *v1.ConfigMap ended with: too old resource version: 43741716 (43744582)
Nov 21 18:04:12 we2 kubelet[15634]: I1121 18:04:12.537171 15634 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 5m11.695207745s ago; threshold is 3m0s.

root@we2 ~]# kubectl top no
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
cwe1 11787m 24% 53295Mi 27%
cwe2 8783m 18% 39302Mi 20%
cwe3 6308m 13% 23564Mi 12%
we1 1988m 4% 33217Mi 17%
we2 2989m 6% 33663Mi 17%
[root@we2 ~]#

Im trying to recover my nodes with WA.

@paulopontesm
Copy link

paulopontesm commented Nov 26, 2019

Could this be related to this?
kubernetes/kubernetes#76531

Also: kubernetes/kubernetes#77654

@kr3cj
Copy link

kr3cj commented Dec 4, 2019

#350

We are seeing this issue as well. Some info about a node (instance type c5.2xlarge) running amazon-eks-node-1.14-v20190927 (ami-0392bafc801b7520f) that experienced it:

$ kubectl get no/ip-10-X-X-X.ec2.internal -owide
NAME                            STATUS   ROLES    AGE   VERSION              INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-X-X-X.ec2.internal   Ready    <none>   19d   v1.14.7-eks-1861c5   10.X.X.X   <none>        Amazon Linux 2   4.14.146-119.123.amzn2.x86_64   docker://18.6.1

Of note, we implemented some custom kubelet reservations to our worker nodes awhile back to protect from SystemOOMs (these were implemented before we became aware of #350. Since this eks AMI runs kubelet in system.slice already, we simply combined reserved memory for the kubelet and system into systemReserved).

cat /etc/eksctl/kubelet.yaml
...
evictionHard:
  imagefs.available: 15%
  memory.available: 200Mi
  nodefs.available: 10%
  nodefs.inodesFree: 5%
featureGates:
  DynamicKubeletConfig: true
  RotateKubeletServerCertificate: true
systemReserved:
  memory: 700Mi
systemReservedCgroup: /system.slice
...

The node experienced the NodeNotReady event 65 times in a ~4 hour period. A quick look at its metrics showed no correlations with CPU, memory, or Disk anomalies.

@kr3cj
Copy link

kr3cj commented Jan 23, 2020

UPDATE 1/27: This is not as predictable as I thought so please disregard


I have some more info on this. I think this has to do with the kubelet communicating with an impaired docker daemon. Additionally, I may have found an earlier indicator of the problem. At least with version v1.14.7-eks-1861c5 of the kubelet and docker-18.06.1ce-10.amzn2.x86_64, it will log 60 or more errors containing ExecSync. I only have 1 instance of this so far, but this happened about 4 hours before the node slipped into a permanently unhealthy state of "NodeNotReady" flapping and "PLEG is not healthy" errors.

Other kubelets will occasionally log the ExecSync errors, but don't necessarily go into a bad state. So it could be the rate of that error, not simply the existence of it.

@cshivashankar
Copy link

what is the workaround for this issue? I am facing the same issue in my cluster.
Any help would be appreciated.

@baifuwa
Copy link

baifuwa commented Aug 28, 2020

I met the same issue. The kubelete verison is 14.0.0

@yuzujoe
Copy link

yuzujoe commented Apr 8, 2021

This also occurred in the EKS1.19.

used m5.2xlarge instance for node.
this node can assign 58 Pods, but if more than 40 pods are scheduled, kubelet failed to check the status of Docker and
Node status becomes NotReady.

I don't know why I'm getting this symptom, but I don't have that many Pod resources, and I have plenty of Node resources.
https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt#L194

@whereisaaron
Copy link

@yuzujoe your issue could be related to #648 on which case reverting to 1.18 node or an older 1.19 AMI may help.

@bbroniewski
Copy link

bbroniewski commented Apr 14, 2021

Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92.

@jangrewe
Copy link

@yuzujoe your issue could be related to #648 on which case reverting to 1.18 node or an older 1.19 AMI may help.

"to an older 1.18" - current AMIs for both 1.19 and 1.18 are broken, so you'll need the previous version for both.
here are the exact versions that work for us: #648 (comment)

@yuzujoe
Copy link

yuzujoe commented Apr 16, 2021

@jangrewe
Thanks !!

already, using old ami.
It seems that a new ami has been released now, so I will try that one.
https://github.com/awslabs/amazon-eks-ami/releases/tag/v20210414

@chrissav
Copy link

I was experiencing the same, many nodes flapping NotReady in my cluster after upgrading to EKS 1.19. Running m5.4xlarge with ami v20210329 and average of 50 pods per node. Confirmed this ami had the version of runC 1.0.0-rc93.

Problem seems resolved since updating worker nodes with the AMI Release v20210414!

@saurav-agarwalla
Copy link
Contributor

We recently released a fix for this as part of the v20210414 release so any AMIs after that release shouldn't be seeing this issue.

@mmerkes
Copy link
Member

mmerkes commented May 6, 2021

Since it was opened, we've seen multiple issues come and go that impact PLEG health. PLEG is a good indication that something is going on, generally something with the container runtime, but not very diagnostic. If you're using the latest AMIs and still seeing PLEG issues, feel free to open a new GH issue with the latest details!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests