Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long session connections get dropped #144

Closed
Rez0k opened this issue Nov 23, 2023 · 16 comments
Closed

Long session connections get dropped #144

Rez0k opened this issue Nov 23, 2023 · 16 comments
Labels
bug Something isn't working

Comments

@Rez0k
Copy link

Rez0k commented Nov 23, 2023

What happened:
After migrating from Calico to aws vpc cni network policies (we are working with istio if that matters) we experience disconnections on long sessions connections such as Redis pub-sub or MongoDB connections.
The connection gets closed and then it reconnects again, which happens every few minutes.

I configured the vpc cni addon to be:

{
  "enableNetworkPolicy": "true",
  "nodeAgent": {
      "enableCloudWatchLogs": "true"
  }
}

I can't see any logs in the nodeagent container of the aws-node pod, all I get is:

{"level":"info","ts":"2023-11-23T10:48:03.421Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
2023-11-23T10:48:03.497038855Z 2023-11-23 10:48:03.4968622 +0000 UTC Logger.check error: failed to get caller

So, I can't attach logs.

What you expected to happen:
I expect the connections not to be dropped from the first place

How to reproduce it (as minimally and precisely as possible):
Install VPC CNI (network policy enabled) on eks 1.28, apply network policy and try to connect to mongodb instance (or redis pub sub) or probably any other long session technology.
The image I am using is: public.ecr.aws/docker/library/node:18.16.0-bullseye-slim
Run this sample code on a nodejs pod (prefarable to be public.ecr.aws/docker/library/node:18.16.0-bullseye-slim):

const mongoose = require('mongoose');

async function init() {
    const db = await mongoose.connect('mongodb://<mongodb-host>:27017/<db>?retryWrites=true&w=majority&directConnection=true');
    
    mongoose.connection.on('error', error => {
        console.log(`Got error: ${error}`);
    });
    
    mongoose.connection.on('connected', () => {
        console.log(`Mongo Connected`);
    });
    
    mongoose.connection.on('disconnected', () => {
        console.log("Mongo Disconnected");
    });
    
    mongoose.connection.on('reconnected', () => {
        console.log("Mongo Reconnected");
    });

    console.log("connected!")
}

init()

Wait few minutes and you should see logs like:

user@container-7b9f7xzs2-ysl25:/app# node mongo-sample-code.js 
connected!
Mongo Disconnected
Mongo Connected
Mongo Reconnected

Anything else we need to know?:
I use istio in my cluster and used Calico up until yesterday, I terminated all instances to flush all leftovers from Calico.
With Calico everything worked as expected.

Environment:

  • Kubernetes version (use kubectl version): v1.28.3-eks-4f4795d
  • CNI Version: v1.15.4-eksbuild.1
  • Network Policy Agent Version
  • OS (e.g: cat /etc/os-release): Debian GNU/Linux 11 (bullseye)
  • Kernel (e.g. uname -a): Linux #### ####.amzn2.x86_64
@Rez0k Rez0k added the bug Something isn't working label Nov 23, 2023
@jayanthvn
Copy link
Contributor

Can you please set this flag -> --enable-policy-event-logs=true and check if you see DENY verdict for the flow which might be happening is what I suspect and might be similar to #139

@Rez0k
Copy link
Author

Rez0k commented Nov 26, 2023

  • I elaborated a bit in the HOW TO REPRODUCE section (maybe it will help)

I enabled this flag and now I am getting logs in the nodeagent container but all the logs looks like:

2023-11-26 09:39:08.533284802 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533309791 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533326096 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533344727 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533365531 +0000 UTC Logger.check error: failed to get caller
2023-11-26 09:39:08.533388542 +0000 UTC Logger.check error: failed to get caller
...

There is no DENY verdict in those logs, I don't know why the nodeagent print the logs like this.
I am still getting connection drops on long session connections like I mentioned above.

any ideas why?
Is this a bug on your side or is it something on my side?

@Rez0k
Copy link
Author

Rez0k commented Nov 28, 2023

maybe #73 and #83 related

@jayanthvn
Copy link
Contributor

Yes it looks similar.

v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -

 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

Please try and let us know if it is holding up.

@Rez0k
Copy link
Author

Rez0k commented Nov 29, 2023

Yes it looks similar.

v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -

 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

Please try and let us know if it is holding up.

Still not working for me, mongodb disconnecting every 5 minutes.

@jayanthvn
Copy link
Contributor

jayanthvn commented Nov 29, 2023

I missed the above logs and those are just the pod logs. Can you share the node logs so you can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is trying to connect to mongoDB and seeing disconnects. You can mail them to k8s-awscni-triage@amazon.com. Please also share the describe o/p of corresponding policyEndpoint resources as well. Can you share the source and dest IPs where the long sessions is going on in your test for us to review the logs.

@Rez0k
Copy link
Author

Rez0k commented Dec 3, 2023

I missed the above logs and those are just the pod logs. Can you share the node logs so you can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is trying to connect to mongoDB and seeing disconnects. You can mail them to k8s-awscni-triage@amazon.com. Please also share the describe o/p of corresponding policyEndpoint resources as well. Can you share the source and dest IPs where the long sessions is going on in your test for us to review the logs.

After furthere investigation, it seems to be that the long session connections got terminated because of istio envoy sidecar.
Short brief, istio creates a sidecar container for each pod, this sidecar container is an envoy proxy container responsible for forwarding the traffic to the main sidecar.
It worked before with calico but now, it's not.

Does this new information help?
I submitted the machine logs to the mail you mentioned with the relevant policyEndpoint

How to reproduce:
Create an eks cluster with aws VPC network policy and istio (https://istio.io/latest/docs/setup/getting-started/).
Create a nodejs pod with istio sidecar and paste the code I mentioned above (mongodb connection).
After 2-3 minutes the connection should get dropped.

@jayanthvn
Copy link
Contributor

Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

@jdn5126
Copy link
Contributor

jdn5126 commented Dec 27, 2023

Discussed with @Rez0k offline and the official release, i.e. VPC CNI v1.15.5 containing Network Policy agent tag v1.0.7, should fix this issue. Waiting for confirmation before closing issue

@Rez0k
Copy link
Author

Rez0k commented Dec 31, 2023

Discussed with @Rez0k offline and the official release, i.e. VPC CNI v1.15.5 containing Network Policy agent tag v1.0.7, should fix this issue. Waiting for confirmation before closing issue

I will try it this week and will update here.

@Rez0k
Copy link
Author

Rez0k commented Jan 8, 2024

update: I prefer to wait for your next release candidate according to: #175 (comment)

@jayanthvn
Copy link
Contributor

jayanthvn commented Jan 9, 2024

@Rez0k - We have v1.0.8-rc1 tag available if you would like to test.

@jayanthvn
Copy link
Contributor

@Rez0k - Did you get a chance to verify the image?

@Rez0k
Copy link
Author

Rez0k commented Jan 23, 2024

I prefer to wait for the official release, I will try on the v1.0.8 release.
I don't want to apply the network policy when I am not sure it will work as it will cause problems with my env and devs

@jayanthvn
Copy link
Contributor

v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3. Please try it out and let us know if you see any issues..

@Rez0k
Copy link
Author

Rez0k commented Mar 3, 2024

Seems to work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants