You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have an application that is failing readiness and liveness probes because the traffic is being denied by NetPol agent. We've seen this across multiple versions including v1.1.0-eksbuild.1 and v1.1.2-eksbuild.1.
I was able to see that the network traffic was being denied in /var/log/aws-routed-eni/network-policy-agent.log. After some period of time, the traffic is accepted again and the application recovers.
What stuck out to me is that there are multiple PolicyEndpoints created. Our NP looks something like:
I tested that by changing namespaceSelector: {} to a rule like ipBlock.cidr: 0.0.0.0/0, the multiple PEs are removed and a single PE is created since every Pod in the cluster doesn't need to be enumerated in .spec.ingress. We haven't seen a single probe failure in a week after changing the configuration to remove the multiple PEs. This is compared to literally hundreds of failures over a couple of weeks.
It's also worth noting that this is an intermittent issue. The pattern we see is that the probes fail, the container is restarted, and then the service recovers. We'll see this anywhere from 1-5 times a day. Interestingly, we see this issue on a few of our clusters with ~2000 pods, but a relatively low pod churn rate. We never see container restarts on our cluster with ~3000 pods, but a higher churn rate due to heavy usage of CronJobs. I can see the Received a new reconcile request log line happening far more frequently in /var/log/aws-routed-eni/network-policy-agent.log on the cluster that's not experiencing this issue. This may still mean that any potential bug could still be occurring on that cluster, but the next reconciliation happens faster than the time it takes for the liveness probes to fail (~30s).
Attach logs
Logs were sent.
What you expected to happen:
Liveness / readiness probe traffic is not denied.
How to reproduce it (as minimally and precisely as possible):
Create a cluster that has >1000-2000 pods to simulate multiple PE entries
Configure an application with liveness probes, something like http-get http://some-endpoint delay=0s timeout=3s period=5s #success=1 #failure=6
Configure a NetworkPolicy using namespaceSelector: {} for ingress rules
Allow the application to run for some number of hours (again, we see this 0-3 times per day)
Anything else we need to know?:
Environment:
Kubernetes version (use kubectl version):
% kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.11-eks-db838b0
@jayanthvn I'll have to find some time to reproduce the issue again. It may not be for a few days. I didn't get a chance to capture those logs originally, but I'll make sure to run the capture script on the next one.
What happened:
We have an application that is failing readiness and liveness probes because the traffic is being denied by NetPol agent. We've seen this across multiple versions including
v1.1.0-eksbuild.1
andv1.1.2-eksbuild.1
.I was able to see that the network traffic was being denied in
/var/log/aws-routed-eni/network-policy-agent.log
. After some period of time, the traffic is accepted again and the application recovers.What stuck out to me is that there are multiple
PolicyEndpoint
s created. Our NP looks something like:Think of use cases where all pods need to reach a core service. This results in multiple PEs:
I tested that by changing
namespaceSelector: {}
to a rule likeipBlock.cidr: 0.0.0.0/0
, the multiple PEs are removed and a single PE is created since everyPod
in the cluster doesn't need to be enumerated in.spec.ingress
. We haven't seen a single probe failure in a week after changing the configuration to remove the multiple PEs. This is compared to literally hundreds of failures over a couple of weeks.It's also worth noting that this is an intermittent issue. The pattern we see is that the probes fail, the container is restarted, and then the service recovers. We'll see this anywhere from 1-5 times a day. Interestingly, we see this issue on a few of our clusters with ~2000 pods, but a relatively low pod churn rate. We never see container restarts on our cluster with ~3000 pods, but a higher churn rate due to heavy usage of
CronJob
s. I can see theReceived a new reconcile request
log line happening far more frequently in/var/log/aws-routed-eni/network-policy-agent.log
on the cluster that's not experiencing this issue. This may still mean that any potential bug could still be occurring on that cluster, but the next reconciliation happens faster than the time it takes for the liveness probes to fail (~30s).Attach logs
Logs were sent.
What you expected to happen:
Liveness / readiness probe traffic is not denied.
How to reproduce it (as minimally and precisely as possible):
http-get http://some-endpoint delay=0s timeout=3s period=5s #success=1 #failure=6
NetworkPolicy
usingnamespaceSelector: {}
for ingress rulesAnything else we need to know?:
Environment:
kubectl version
):v1.18.2-eksbuild.1
v1.1.2-eksbuild.1
cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: