Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag ENI with creation timestamp and avoid cleanup if created in last … #1109

Merged
merged 1 commit into from
Aug 22, 2020

Conversation

jayanthvn
Copy link
Contributor

@jayanthvn jayanthvn commented Jul 30, 2020

…5 mins

Issue #, if available:

Description of changes:

Tag EKS created ENI with creation timestamp. While checking for leaked ENIs we prevent deletes of ENIs created in the last 5 minutes. Also for scenarios where Cx has one cluster with 1.5. x and another with 1.6.x - The cleanup function in 1.6.x looks for ENIs that are created by the CNI, but not attached to any instance. In 1.6.0 we tag the ENI with the instance name after it has been attached. In 1.5.x, we tag the nodes first, then attach them. If we find any such ENIs then we update with new time preventing those to be deleted.

Issue logs debugged by internal support team->

Due to a change introduced since 1.6.0 (clean up ENIs), with both version running at the same time, it will cause the ENI created with pre 1.6.0 version randomly got deleted by CNI running 1.6.0.

  1. New ENI clean up introduced since 1.6.0, the function getFilteredListOfNetworkInterfaces will check tag key node.k8s.amazonaws.com/instance_id and the ENI status. The ENI status needs to be available, to be included to clean up:
    // The tag key has to be "node.k8s.amazonaws.com/instance_id"
    tagFilter := &ec2.Filter{
        Name: aws.String("tag-key"),
        Values: []*string{
            aws.String(eniNodeTagKey),
        },
    }
    // Only fetch "available" ENIs.
    statusFilter := &ec2.Filter{
        Name: aws.String("status"),
        Values: []*string{
            aws.String("available"),
        },
    }
  1. Along with this change, the Create ENI process has also been changed in 1.6.0, and here is the order:
    Create ENI
    Attach ENI
    Tag ENI
Code here:
    // Once the ENI is attached, tag it.
    cache.tagENI(eniID, maxENIBackoffDelay)

Since ENI will only be tagged after attached, this ensures the newly created ENI will not meet the filter above, so this ensures newly created ENI won't get cleaned up.

  1. Now, what is process of create ENI in pre-1.6.0 version? Here is the code:
    The order is different:
    Create ENI
    Tag ENI
    Attach ENI
func (cache *EC2InstanceMetadataCache) AllocENI(useCustomCfg bool, sg []*string, subnet string) (string, error) {
    eniID, err := cache.createENI(useCustomCfg, sg, subnet)
    if err != nil {
        return "", errors.Wrap(err, "AllocENI: failed to create ENI")
    }

    cache.tagENI(eniID)

    attachmentID, err := cache.attachENI(eniID)
    if err != nil {
        _ = cache.deleteENI(eniID)
        return "", errors.Wrap(err, "AllocENI: error attaching ENI")
    }

So in this case, CNI 1.5.5 from instance i-0dd659cd2b0acd35e created ENI eni-08cec64239ece5335 and tagged it immediately not yet attaching it:

2020-07-20T00:33:05.040Z [INFO] Created a new ENI: eni-08cec64239ece5335
2020-07-20T00:33:05.040Z [DEBUG]    Trying to tag newly created ENI: key=node.k8s.amazonaws.com/instance_id, value=i-0dd659cd2b0acd35e
2020-07-20T00:33:05.113Z [INFO] Add/Update for Pod check-reaper-1595205180-vdszs on my node, namespace = kuberhealthy, IP = 
2020-07-20T00:33:05.129Z [INFO] Add/Update for Pod check-reaper-1595205180-vdszs on my node, namespace = kuberhealthy, IP = 
2020-07-20T00:33:05.158Z [DEBUG]    Successfully tagged ENI: eni-08cec64239ece5335

This newly created ENI, has the tag key, and it is available status, perfectly matches the filter for CNI 1.6.0 to delete, so CNI 1.6.0 from instance i-0c91507c3fa5a862a deleted this ENI.

CNI 1.5.5/7 from instance i-0dd659cd2b0acd35e then tried to attach the ENI, and failed, because ENI has been deleted:

2020-07-20T00:33:09.106Z [ERROR]    Failed to attach ENI eni-08cec64239ece5335: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-08cec64239ece5335' does not exist
    status code: 400, request id: a79c920b-a878-4378-83b9-f705195ee5cc
2020-07-20T00:33:09.106Z [DEBUG]    Trying to delete ENI: eni-08cec64239ece5335
2020-07-20T00:33:09.559Z [DEBUG]    Not able to delete ENI yet (attempt 1/20): InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-08cec64239ece5335' does not exist
2020-07-20T00:33:14.715Z [DEBUG]    Not able to delete ENI yet (attempt 2/20): InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-08cec64239ece5335' does not exist
    status code: 400, request id: 09309204-cc82-420e-9f8f-272d789d18b9 

Logs after fix ->

{"level":"debug","ts":"2020-07-29T18:56:14.571Z","caller":"wait/wait.go:133","msg":"Checking for leaked AWS CNI ENIs."}
{"level":"info","ts":"2020-07-29T18:56:14.661Z","caller":"awsutils/awsutils.go:1311","msg":"Found ENI created less than 5 mins"}
{"level":"debug","ts":"2020-07-29T18:56:14.661Z","caller":"awsutils/awsutils.go:1311","msg":"No AWS CNI leaked ENIs found."}


{"level":"debug","ts":"2020-07-29T18:49:34.549Z","caller":"wait/wait.go:133","msg":"Checking for leaked AWS CNI ENIs."}
{"level":"info","ts":"2020-07-29T18:49:34.633Z","caller":"awsutils/awsutils.go:1311","msg":"2020-07-29T18:00:00Z"}
{"level":"debug","ts":"2020-07-29T18:49:34.633Z","caller":"awsutils/awsutils.go:1311","msg":"Found 1 available instances with the AWS CNI tag."}
{"level":"debug","ts":"2020-07-29T18:49:34.633Z","caller":"awsutils/awsutils.go:1318","msg":"Trying to delete ENI: eni-0ed34872fcac8ce7a"}
{"level":"info","ts":"2020-07-29T18:49:35.095Z","caller":"retry/retry.go:69","msg":"Successfully deleted ENI: eni-0ed34872fcac8ce7a"}

****
{"level":"debug","ts":"2020-07-30T00:49:51.473Z","caller":"wait/wait.go:133","msg":"Checking for leaked AWS CNI ENIs."}
{"level":"debug","ts":"2020-07-30T00:49:51.550Z","caller":"awsutils/awsutils.go:1406","msg":"Tag untagged ENI eni-09859c16d9cc9be17: key=node.k8s.amazonaws.com/createdAt, value=2020-07-30T00:49:51Z"}

Thanks to @mogren

Updated the Unit test to simulate ENI create TS to be 10 mins before the the current time.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, and needs a rebase.

pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
pkg/awsutils/awsutils.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks @jayanthvn!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants