-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oom kills #1218
Comments
just a quick note here since I was running into something similar, maybe it will help. OOM might be preventing you from seeing the the true logs on what is actually causing the issues. On my cluster I gave my pod way more memory than it needed, it stabilized and I was able to see the efs-plugin was stuck trying to delete a bunch of pvs, which was causing the out of memory error before I temporarily gave it more memory to troubleshoot. |
@bonickle Once you got past the OOM issue, can you share what you found to be the root cause of the issue, as well as what your resolution was? I'm having the same issue, but still haven't determined why it's happening even after manually removing the PVs. |
@df-lcb still looking into the root cause. If/when I figure it out, il drop an update |
From my point of view the root cause is that if you have dozens of efs volumes the oom crash occurs because it mounts them all at the same time. This is also the reason why you do see many mount and python processes in the list. I would recommend sequential mounts or a limit of parallel mounts. |
Thanks @runningman84 I'll look into this. |
Hi, @df-lcb. Any updates about this, we bump the memory to 512mb, the OOMKilled happens again. |
Hey, @jiangfwa. Due to resource spiking, we had to temporarily bump ours up to 1.5Gi to get past the OOM. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Just in case anyone encounters this issue, I think we found one potential cause. We have recently experienced frequent OOM within the efs-plugin container of efs-csi-node pods. The container was eventually allocated 2.5Gb limits to counter the OOM. We eventually realised that the underlying EFS cluster was configured in "Burstable" throughput mode - the OOM restarts occurred during high EFS use when the cluster throughput would eventually be throttled. We speculate that this probably caused a large number of open file transfers causing the OOM. Switching the EFS cluster to "Provisioned" through mode appears to have so far fixed the memory issue. |
/kind bug
What happened?
Lately we discovered that some of our efs csi pods crash due to oom kills.
What you expected to happen?
na
How to reproduce it (as minimally and precisely as possible)?
unclear because it does not effect all clusters or nodes
Anything else we need to know?:
The plugin container was tested with 64/256, 128/128, 256/256 mb requests/limits. Only increasing it to 512mb/512mb solved the issue.
From the prometheus stats the container does not consume more than 100mb of memory.
Environment
kubectl version
): 1.28.xPlease also attach debug logs to help us better diagnose
This is the dmesg output:
The text was updated successfully, but these errors were encountered: