Node NotReady because of PLEG is not healthy #195

radityasurya · 2019-02-20T15:25:06Z

What happened:
One of the nodes using the latest AMI version started to become NotReady

What you expected to happen:
The node is always ready

How to reproduce it (as minimally and precisely as possible):
Not sure how, the node never get any CPU or Memory usage high

Anything else we need to know?:
We ssh to the nodes and found out that the PLEG is not healthy

Feb 20 14:23:09 ip-10-0-13-15.eu-west-1.compute.internal kubelet[3694]: I0220 14:23:09.120100    3694 kubelet.go:1775] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 4h19m47.369998188s ago; threshold is 3m0s]

and when we try to check the container with docker ps we get this error:

Feb 20 14:28:54 ip-10-0-13-15.eu-west-1.compute.internal dockerd[3188]: http: multiple response.WriteHeader calls
Feb 20 14:28:54 ip-10-0-13-15.eu-west-1.compute.internal dockerd[3188]: time="2019-02-20T14:28:54.455014979Z" level=error msg="Handler for GET /v1.25/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

this is the docker version we use

[root@ip-10-0-13-15 ~]# docker version
Client:
 Version:      17.06.2-ce
 API version:  1.30
 Go version:   go1.9.6
 Git commit:   3dfb8343b139d6342acfd9975d7f1068b5b1c3d3
 Built:        Mon Jan 28 22:06:48 2019
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.2-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.9.6
 Git commit:   402dd4a/17.06.2-ce
 Built:        Mon Jan 28 22:07:35 2019
 OS/Arch:      linux/amd64
 Experimental: false
[root@ip-10-0-13-15 ~]# docker info
Containers: 13
 Running: 12
 Paused: 0
 Stopped: 1
Images: 37
Server Version: 17.06.2-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.94-89.73.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.503GiB
Name: ip-10-0-13-15.eu-west-1.compute.internal
ID: PVWT:EV6L:L543:5IU4:WIAB:IZPK:FIAE:3LLA:WV7F:GG5V:XRKW:JA4S
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

We tried to forked and made some changes https://github.com/tiqets/amazon-eks-ami, we've updated the docker version and it seems to be working, do you guys think this is related with docker?

Environment:

AWS Region: eu-west-1
Instance Type(s): M5.Larget
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.11
AMI Version: amazon-eks-node-1.11-v20190211 (ami-0b469c0fef0445d29)
Kernel (e.g. uname -a):
Linux ip-10-0-13-15.eu-west-1.compute.internal 4.14.94-89.73.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Fri Jan 18 22:36:02 UTC 2019 x86_64 x86_64 x86_64 G
NU/Linux
Release information (run cat /tmp/release on a node):

empty

The text was updated successfully, but these errors were encountered:

toricls · 2019-03-02T13:31:44Z

it seems to be related to kubernetes/kubernetes#45419 if your nodes are flapping between Ready/NotReady?

mysunshine92 · 2019-03-06T15:37:34Z

/sig node

krish7919 · 2019-03-27T10:15:08Z

Also, I have read on slack that GKE kills/restarts containerd when it detects this issue automatically. Does AWS EKS do that too?

whereisaaron · 2019-03-27T15:37:00Z

There are 200+ issues in the k8s project for PLEG issues. Mostly attributed to PLEG resource exhaustion, either out of memory, or too many events too quickly, and some deadlock situations. I have seen it once and ended up restarting since there was no other option I could find.

krish7919 · 2019-03-27T15:41:32Z

Agreed, but this is present in most of the k8s versions now and even though there is no fix yet, I want to know what work-arounds, if any, does EKS have in place to solve this problem. This definitely affects k8s deployments and as a product EKS is directly affected.

Note: In our cluster, I have debugged this a lot before finding the issues online, and can say definitively that it is not related to CPU, memory, network or disk issues. It might be related to too many events too quickly.

emanuelecasadio · 2019-04-04T15:17:30Z

This night all of our nodes became unavailable because of this problem :|

a7i · 2019-05-19T15:27:36Z

This night all of our nodes became unavailable because of this problem :|

Have you been able to resolve this issue yet @emanuelecasadio ?

emanuelecasadio · 2019-05-19T20:33:20Z

@a7i no, but it seems that manually rebooting the node temporarily prevents this to happen again for some time (1-2 weeks approx.)

MohammedFadin · 2019-06-12T16:13:37Z

@emanuelecasadio Any update on this issue?, so we have to manually reboot the nodes all the time?

jesuslinares · 2019-08-14T07:13:27Z

It is happening in our environments too. We created a bash script that checks if docker ps is executed properly in less than 60 seconds. If this check fails 3 times, the Docker service is restarted.

It happens from time to time, but the script solves the issue.

What is the current status of this issue?.

PDRMAN · 2019-10-10T14:14:07Z

It is happening in our clusters as well. We're running:
Docker 18.06.1-ce
kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp-ee", GitCommit:"eb4df6c6fb47f5b4fd1ed8bfbfe2d0ed5ea636e1", GitTreeState:"clean", BuildDate:"2019-05-08T02:18:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

PDRMAN · 2019-10-10T21:02:04Z

we're going to be testing a possible solution by increasing the reserved resources for the system and kubelet increasing cpu to 500m. This should improve stability of the node, will report back if things improve

systemReserved:
cpu: "500m"
memory: "512Mi"
ephemeral-storage: "1Gi"
kubeReserved:
cpu: "500m"
memory: "512Mi"
ephemeral-storage: "1Gi"

a7i · 2019-10-11T14:41:07Z

The following configuration [passed to worker node userdata bootstrap] worked for me. I used to face this often but haven't had this issue in months now:

--kube-reserved cpu=250m, memory=0.5Gi,ephemeral-storage=1Gi \
--system-reserved cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi \
--eviction-hard memory.available<300Mi,nodefs.available<10%

@PDRMAN Hopefully will do the same for you and others.

whereisaaron · 2019-10-12T04:32:05Z

These settings should have default values. It is just asking for outages to not protect the kubelet.

mogren · 2019-10-13T01:30:30Z

#350

Bhuvan26 · 2019-11-21T16:01:32Z

Atleast i'm not seeing any error during "docker ps". But nodes are continously flapping with the CPU spike..

[root@we2 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cwe1 Ready 30d v1.14.3
cwe2 Ready 30d v1.14.3
cwe3 NotReady 30d v1.14.3
we1 NotReady 30d v1.14.3
we2 NotReady 30d v1.14.3
[root@we2 ~]#
[root@we2 ~]#
[root@we2 ~]# journalctl -f
Nov 21 18:04:07 we2 kubelet[15634]: I1121 18:04:07.537029 15634 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 5m6.695070902s ago; threshold is 3m0s.
Nov 21 18:04:11 we2 kubelet[15634]: W1121 18:04:11.993022 15634 reflector.go:289] object-"cscf-2019"/"configmapenvoycfx": watch of *v1.ConfigMap ended with: too old resource version: 43741716 (43744582)
Nov 21 18:04:12 we2 kubelet[15634]: I1121 18:04:12.537171 15634 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 5m11.695207745s ago; threshold is 3m0s.

root@we2 ~]# kubectl top no
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
cwe1 11787m 24% 53295Mi 27%
cwe2 8783m 18% 39302Mi 20%
cwe3 6308m 13% 23564Mi 12%
we1 1988m 4% 33217Mi 17%
we2 2989m 6% 33663Mi 17%
[root@we2 ~]#

Im trying to recover my nodes with WA.

paulopontesm · 2019-11-26T18:22:52Z

Could this be related to this?
kubernetes/kubernetes#76531

Also: kubernetes/kubernetes#77654

kr3cj · 2019-12-04T19:00:35Z

#350

We are seeing this issue as well. Some info about a node (instance type c5.2xlarge) running amazon-eks-node-1.14-v20190927 (ami-0392bafc801b7520f) that experienced it:

$ kubectl get no/ip-10-X-X-X.ec2.internal -owide
NAME                            STATUS   ROLES    AGE   VERSION              INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-X-X-X.ec2.internal   Ready    <none>   19d   v1.14.7-eks-1861c5   10.X.X.X   <none>        Amazon Linux 2   4.14.146-119.123.amzn2.x86_64   docker://18.6.1

Of note, we implemented some custom kubelet reservations to our worker nodes awhile back to protect from SystemOOMs (these were implemented before we became aware of #350. Since this eks AMI runs kubelet in system.slice already, we simply combined reserved memory for the kubelet and system into systemReserved).

cat /etc/eksctl/kubelet.yaml
...
evictionHard:
  imagefs.available: 15%
  memory.available: 200Mi
  nodefs.available: 10%
  nodefs.inodesFree: 5%
featureGates:
  DynamicKubeletConfig: true
  RotateKubeletServerCertificate: true
systemReserved:
  memory: 700Mi
systemReservedCgroup: /system.slice
...

The node experienced the NodeNotReady event 65 times in a ~4 hour period. A quick look at its metrics showed no correlations with CPU, memory, or Disk anomalies.

kr3cj · 2020-01-23T17:25:40Z

UPDATE 1/27: This is not as predictable as I thought so please disregard

I have some more info on this. I think this has to do with the kubelet communicating with an impaired docker daemon. Additionally, I may have found an earlier indicator of the problem. At least with version v1.14.7-eks-1861c5 of the kubelet and docker-18.06.1ce-10.amzn2.x86_64, it will log 60 or more errors containing ExecSync. I only have 1 instance of this so far, but this happened about 4 hours before the node slipped into a permanently unhealthy state of "NodeNotReady" flapping and "PLEG is not healthy" errors.

Other kubelets will occasionally log the ExecSync errors, but don't necessarily go into a bad state. So it could be the rate of that error, not simply the existence of it.

cshivashankar · 2020-06-26T12:14:30Z

what is the workaround for this issue? I am facing the same issue in my cluster.
Any help would be appreciated.

baifuwa · 2020-08-28T00:45:34Z

I met the same issue. The kubelete verison is 14.0.0

yuzujoe · 2021-04-08T08:51:54Z

This also occurred in the EKS1.19.

used m5.2xlarge instance for node.
this node can assign 58 Pods, but if more than 40 pods are scheduled, kubelet failed to check the status of Docker and
Node status becomes NotReady.

I don't know why I'm getting this symptom, but I don't have that many Pod resources, and I have plenty of Node resources.
https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt#L194

whereisaaron · 2021-04-09T12:36:08Z

@yuzujoe your issue could be related to #648 on which case reverting to 1.18 node or an older 1.19 AMI may help.

bbroniewski · 2021-04-14T14:45:11Z

Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92.

jangrewe · 2021-04-15T15:30:03Z

@yuzujoe your issue could be related to #648 on which case reverting to 1.18 node or an older 1.19 AMI may help.

"to an older 1.18" - current AMIs for both 1.19 and 1.18 are broken, so you'll need the previous version for both.
here are the exact versions that work for us: #648 (comment)

yuzujoe · 2021-04-16T02:21:17Z

@jangrewe
Thanks !!

already, using old ami.
It seems that a new ami has been released now, so I will try that one.
https://github.com/awslabs/amazon-eks-ami/releases/tag/v20210414

chrissav · 2021-04-16T13:56:45Z

I was experiencing the same, many nodes flapping NotReady in my cluster after upgrading to EKS 1.19. Running m5.4xlarge with ami v20210329 and average of 50 pods per node. Confirmed this ami had the version of runC 1.0.0-rc93.

Problem seems resolved since updating worker nodes with the AMI Release v20210414!

saurav-agarwalla · 2021-05-06T22:00:25Z

We recently released a fix for this as part of the v20210414 release so any AMIs after that release shouldn't be seeing this issue.

mmerkes · 2021-05-06T22:04:53Z

Since it was opened, we've seen multiple issues come and go that impact PLEG health. PLEG is a good indication that something is going on, generally something with the container runtime, but not very diagnostic. If you're using the latest AMIs and still seeing PLEG issues, feel free to open a new GH issue with the latest details!

jesuslinares mentioned this issue Aug 14, 2019

ContainerGCFailed / ImageGCFailed context deadline exceeded kubernetes/kubernetes#42164

Closed

kr3cj mentioned this issue Jan 9, 2020

[EKS] [CRI]: Support for Containerd CRI aws/containers-roadmap#313

Closed

saurav-agarwalla closed this as completed May 6, 2021

imriss mentioned this issue May 17, 2021

Pod stuck on Terminating after node lost kubernetes/kubernetes#72226

Closed

imriss mentioned this issue May 25, 2021

EKS k8s 1.19 - AMI 1.19-v20210414 - (combined from similar events): System OOM encountered, victim process: #671

Closed

codyharris-h2o-ai mentioned this issue Aug 9, 2021

Nodes are experiencing PLEG issues with recent AMI #729

Closed

allamand mentioned this issue Apr 13, 2023

[Bug]: Node NotReady because of PLEG is not healthy aws-samples/eks-workshop-v2#447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node NotReady because of PLEG is not healthy #195

Node NotReady because of PLEG is not healthy #195

radityasurya commented Feb 20, 2019

toricls commented Mar 2, 2019

mysunshine92 commented Mar 6, 2019

krish7919 commented Mar 27, 2019

whereisaaron commented Mar 27, 2019

krish7919 commented Mar 27, 2019

emanuelecasadio commented Apr 4, 2019 •

edited

Loading

a7i commented May 19, 2019

emanuelecasadio commented May 19, 2019

MohammedFadin commented Jun 12, 2019

jesuslinares commented Aug 14, 2019

PDRMAN commented Oct 10, 2019

PDRMAN commented Oct 10, 2019

a7i commented Oct 11, 2019 •

edited

Loading

whereisaaron commented Oct 12, 2019

mogren commented Oct 13, 2019

Bhuvan26 commented Nov 21, 2019 •

edited

Loading

paulopontesm commented Nov 26, 2019 •

edited

Loading

kr3cj commented Dec 4, 2019

kr3cj commented Jan 23, 2020 •

edited

Loading

cshivashankar commented Jun 26, 2020

baifuwa commented Aug 28, 2020

yuzujoe commented Apr 8, 2021

whereisaaron commented Apr 9, 2021

bbroniewski commented Apr 14, 2021 •

edited

Loading

jangrewe commented Apr 15, 2021

yuzujoe commented Apr 16, 2021

chrissav commented Apr 16, 2021

saurav-agarwalla commented May 6, 2021

mmerkes commented May 6, 2021

Node NotReady because of PLEG is not healthy #195

Node NotReady because of PLEG is not healthy #195

Comments

radityasurya commented Feb 20, 2019

toricls commented Mar 2, 2019

mysunshine92 commented Mar 6, 2019

krish7919 commented Mar 27, 2019

whereisaaron commented Mar 27, 2019

krish7919 commented Mar 27, 2019

emanuelecasadio commented Apr 4, 2019 • edited Loading

a7i commented May 19, 2019

emanuelecasadio commented May 19, 2019

MohammedFadin commented Jun 12, 2019

jesuslinares commented Aug 14, 2019

PDRMAN commented Oct 10, 2019

PDRMAN commented Oct 10, 2019

a7i commented Oct 11, 2019 • edited Loading

whereisaaron commented Oct 12, 2019

mogren commented Oct 13, 2019

Bhuvan26 commented Nov 21, 2019 • edited Loading

paulopontesm commented Nov 26, 2019 • edited Loading

kr3cj commented Dec 4, 2019

kr3cj commented Jan 23, 2020 • edited Loading

cshivashankar commented Jun 26, 2020

baifuwa commented Aug 28, 2020

yuzujoe commented Apr 8, 2021

whereisaaron commented Apr 9, 2021

bbroniewski commented Apr 14, 2021 • edited Loading

jangrewe commented Apr 15, 2021

yuzujoe commented Apr 16, 2021

chrissav commented Apr 16, 2021

saurav-agarwalla commented May 6, 2021

mmerkes commented May 6, 2021

emanuelecasadio commented Apr 4, 2019 •

edited

Loading

a7i commented Oct 11, 2019 •

edited

Loading

Bhuvan26 commented Nov 21, 2019 •

edited

Loading

paulopontesm commented Nov 26, 2019 •

edited

Loading

kr3cj commented Jan 23, 2020 •

edited

Loading

bbroniewski commented Apr 14, 2021 •

edited

Loading