-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node NotReady because of PLEG is not healthy #195
Comments
it seems to be related to kubernetes/kubernetes#45419 if your nodes are flapping between Ready/NotReady? |
/sig node |
Also, I have read on slack that GKE kills/restarts |
There are 200+ issues in the k8s project for PLEG issues. Mostly attributed to PLEG resource exhaustion, either out of memory, or too many events too quickly, and some deadlock situations. I have seen it once and ended up restarting since there was no other option I could find. |
Agreed, but this is present in most of the k8s versions now and even though there is no fix yet, I want to know what work-arounds, if any, does EKS have in place to solve this problem. This definitely affects k8s deployments and as a product EKS is directly affected. Note: In our cluster, I have debugged this a lot before finding the issues online, and can say definitively that it is not related to CPU, memory, network or disk issues. It might be related to too many events too quickly. |
This night all of our nodes became unavailable because of this problem :| |
Have you been able to resolve this issue yet @emanuelecasadio ? |
@a7i no, but it seems that manually rebooting the node temporarily prevents this to happen again for some time (1-2 weeks approx.) |
@emanuelecasadio Any update on this issue?, so we have to manually reboot the nodes all the time? |
It is happening in our environments too. We created a bash script that checks if It happens from time to time, but the script solves the issue. What is the current status of this issue?. |
It is happening in our clusters as well. We're running: |
we're going to be testing a possible solution by increasing the reserved resources for the system and kubelet increasing cpu to 500m. This should improve stability of the node, will report back if things improve systemReserved: |
The following configuration [passed to worker node userdata bootstrap] worked for me. I used to face this often but haven't had this issue in months now:
@PDRMAN Hopefully will do the same for you and others. |
These settings should have default values. It is just asking for outages to not protect the kubelet. |
Atleast i'm not seeing any error during "docker ps". But nodes are continously flapping with the CPU spike.. [root@we2 ~]# kubectl get nodes root@we2 ~]# kubectl top no Im trying to recover my nodes with WA. |
Could this be related to this? |
We are seeing this issue as well. Some info about a node (instance type $ kubectl get no/ip-10-X-X-X.ec2.internal -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-X-X-X.ec2.internal Ready <none> 19d v1.14.7-eks-1861c5 10.X.X.X <none> Amazon Linux 2 4.14.146-119.123.amzn2.x86_64 docker://18.6.1 Of note, we implemented some custom kubelet reservations to our worker nodes awhile back to protect from SystemOOMs (these were implemented before we became aware of #350. Since this eks AMI runs kubelet in cat /etc/eksctl/kubelet.yaml
...
evictionHard:
imagefs.available: 15%
memory.available: 200Mi
nodefs.available: 10%
nodefs.inodesFree: 5%
featureGates:
DynamicKubeletConfig: true
RotateKubeletServerCertificate: true
systemReserved:
memory: 700Mi
systemReservedCgroup: /system.slice
... The node experienced the NodeNotReady event 65 times in a ~4 hour period. A quick look at its metrics showed no correlations with CPU, memory, or Disk anomalies. |
UPDATE 1/27: This is not as predictable as I thought so please disregard I have some more info on this. I think this has to do with the kubelet communicating with an impaired docker daemon. Additionally, I may have found an earlier indicator of the problem. At least with version Other kubelets will occasionally log the ExecSync errors, but don't necessarily go into a bad state. So it could be the rate of that error, not simply the existence of it. |
what is the workaround for this issue? I am facing the same issue in my cluster. |
I met the same issue. The kubelete verison is 14.0.0 |
This also occurred in the EKS1.19. used m5.2xlarge instance for node. I don't know why I'm getting this symptom, but I don't have that many Pod resources, and I have plenty of Node resources. |
Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92. |
"to an older 1.18" - current AMIs for both 1.19 and 1.18 are broken, so you'll need the previous version for both. |
@jangrewe already, using old ami. |
I was experiencing the same, many nodes flapping NotReady in my cluster after upgrading to EKS 1.19. Running m5.4xlarge with ami v20210329 and average of 50 pods per node. Confirmed this ami had the version of runC 1.0.0-rc93. Problem seems resolved since updating worker nodes with the AMI Release v20210414! |
We recently released a fix for this as part of the v20210414 release so any AMIs after that release shouldn't be seeing this issue. |
Since it was opened, we've seen multiple issues come and go that impact PLEG health. PLEG is a good indication that something is going on, generally something with the container runtime, but not very diagnostic. If you're using the latest AMIs and still seeing PLEG issues, feel free to open a new GH issue with the latest details! |
What happened:
One of the nodes using the latest AMI version started to become NotReady
What you expected to happen:
The node is always ready
How to reproduce it (as minimally and precisely as possible):
Not sure how, the node never get any CPU or Memory usage high
Anything else we need to know?:
We ssh to the nodes and found out that the PLEG is not healthy
Feb 20 14:23:09 ip-10-0-13-15.eu-west-1.compute.internal kubelet[3694]: I0220 14:23:09.120100 3694 kubelet.go:1775] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 4h19m47.369998188s ago; threshold is 3m0s]
and when we try to check the container with
docker ps
we get this error:this is the docker version we use
We tried to forked and made some changes https://github.com/tiqets/amazon-eks-ami, we've updated the docker version and it seems to be working, do you guys think this is related with docker?
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.1aws eks describe-cluster --name <name> --query cluster.version
): 1.11uname -a
):Linux ip-10-0-13-15.eu-west-1.compute.internal 4.14.94-89.73.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Fri Jan 18 22:36:02 UTC 2019 x86_64 x86_64 x86_64 G
NU/Linux
cat /tmp/release
on a node):The text was updated successfully, but these errors were encountered: