Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

‼️ NOTICE custom-resources: various custom resources may fail to deploy / destroy #26325

Closed
kishiel opened this issue Jul 11, 2023 · 12 comments · Fixed by #26283
Closed

‼️ NOTICE custom-resources: various custom resources may fail to deploy / destroy #26325

kishiel opened this issue Jul 11, 2023 · 12 comments · Fixed by #26283
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/medium Medium work item – several days of effort p1

Comments

@kishiel
Copy link
Contributor

kishiel commented Jul 11, 2023

Status: In-Progress

What is the issue?

In #26212, we upgraded our NodeJS runtime to Node18, which meant all our custom resources now needed to operate on AWS SDK for JavaScript v3. There were a few places that we missed:

Who is affected?

Users of aws-cdk-lib version 2.87.0

How do I resolve this?

Upgrade to a version higher than 2.87.0

Workaround

No workaround


Original posting

Describe the bug

When running the integration tests for aws-eks or aws-stepfunctions-tasks where the cluster-resource-handler is invoked will result in a failure when onDelete is called. This is because the key code which is caught during the exception no longer exists. Fargate's handler is similarly affected.

Expected Behavior

When calling the integration tests I expected the clusters to successfully create, update, and delete themselves.

Current Behavior

The final step of deleting the cluster fails with:

2023-07-10T18:59:56.441Z    fdc661e6-24f7-4d12-ac99-51342071842d    ERROR   Invoke Error    {
    "errorType": "ResourceNotFoundException",
    "errorMessage": "No cluster found for name: integrationtesteksclusterE5C0ED98-41454fc08d0746558ff42bc9a701230c.",
    "name": "ResourceNotFoundException",
    "$fault": "client",
    "$metadata": {
        "httpStatusCode": 404,
        "requestId": "75681069-c732-4896-98f7-ce3fb2f8e777",
        "attempts": 1,
        "totalRetryDelay": 0
    },
    "clusterName": "integrationtesteksclusterE5C0ED98-41454fc08d0746558ff42bc9a701230c",
    "nodegroupName": null,
    "fargateProfileName": null,
    "addonName": null,
    "stack": [
        "ResourceNotFoundException: No cluster found for name: integrationtesteksclusterE5C0ED98-41454fc08d0746558ff42bc9a701230c.",
        "    at deserializeAws_restJson1ResourceNotFoundExceptionResponse (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/protocols/Aws_restJson1.js:2586:23)",
        "    at deserializeAws_restJson1DescribeClusterCommandError (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/protocols/Aws_restJson1.js:1492:25)",
        "    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)",
        "    at async /var/runtime/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24",
        "    at async /var/runtime/node_modules/@aws-sdk/middleware-signing/dist-cjs/middleware.js:13:20",
        "    at async StandardRetryStrategy.retry (/var/runtime/node_modules/@aws-sdk/middleware-retry/dist-cjs/StandardRetryStrategy.js:51:46)",
        "    at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:6:22",
        "    at async ClusterResourceHandler.isDeleteComplete (/var/task/cluster.js:69:26)"
    ]
}

The cluster itself is deleting, but our evaluation of the result is failing, and thus is being treated as a failure.

Reproduction Steps

Check out the most recent build of aws-cdk and run any eks test which includes a fargate profile (e.g. integ.eks-cluster-ipv6)

Possible Solution

We can change the current evaluation to use $e.metadata.httpResponseCode === 404 instead of a string evaluation against the message.

Additional Information/Context

There's a bunch of other stuff that's broken in the eks tests, especially with the helm chart for the kubernetes-dashboard. I've been working on a fix for the better part of 3 days and still haven't hit the bottom of the breakage.

This is affecting three tests in aws-stepfunctions-tasks.

I believe these failures are related to #26212 but I haven't had the time to identify the exact changes. The upgrade from aws-sdk-js v2 to v3 would have ideally triggered a re-run of all of the integration tests which use the SDK, but I don't believe the resource trees can see that difference.

CDK CLI Version

0.0.0 (build c38e784) (npx cdk)
2.42.0 (build 7d8ef0b) (local install)

Framework Version

No response

Node.js Version

v18.16.0

OS

MacOS 13.4

Language

Typescript

Language Version

No response

Other information

No response

@kishiel kishiel added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 11, 2023
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Jul 11, 2023
@kishiel
Copy link
Contributor Author

kishiel commented Jul 11, 2023

I've made some changes to get the tests working again, but getting framework-integ directory to consistently apply these changes is difficult for some reason.

main...kishiel:eks-ipv6:main

There's an additional problem manifesting in the kubernetes-dashboard component of the cluster test that I couldn't iron out. I've removed that assertion as it's duplicative of the namespace test if we're just trying to demonstrate that chart definitions can build in a helm chart.

@pahud
Copy link
Contributor

pahud commented Jul 12, 2023

Thank you for the report. Can you clarify which integ tests are failing?

@pahud pahud added p2 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Jul 12, 2023
@kishiel
Copy link
Contributor Author

kishiel commented Jul 12, 2023

Here are a few:

aws-eks/test/integ.eks-cluster-ipv6.js
aws-eks/test/integ.eks-cluster.js
aws-stepfunctions-tasks/test/eks/integ.call.js
aws-stepfunctions-tasks/test/emrcontainers/integ.job-submission-workflow.js
aws-stepfunctions-tasks/test/emrcontainers/integ.start-job-run.js

The error.code key comparison against ResourceNotFoundException is littered throughout the code-base, so my guess is that the scope of this problem is bigger than just EKS.

@kishiel
Copy link
Contributor Author

kishiel commented Jul 12, 2023

onDelete for the cluster handler returns a response without the ResourceNotFoundException value in any of the keys, so moving over to the httpResponseCode seems like a reasonable change:

2023-07-11T16:40:50.634Z	45e636a4-f247-4aeb-8789-4e625c5d463c	INFO	describeCluster error: ResourceNotFoundException: No cluster found for name: Cluster9EE0221C-2dda1421b4b94d189f7a5a65f46f3902.
    at deserializeAws_restJson1ResourceNotFoundExceptionResponse (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/protocols/Aws_restJson1.js:2586:23)
    at deserializeAws_restJson1DescribeClusterCommandError (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/protocols/Aws_restJson1.js:1492:25)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /var/runtime/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24
    at async /var/runtime/node_modules/@aws-sdk/middleware-signing/dist-cjs/middleware.js:13:20
    at async StandardRetryStrategy.retry (/var/runtime/node_modules/@aws-sdk/middleware-retry/dist-cjs/StandardRetryStrategy.js:51:46)
    at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:6:22
    at async ClusterResourceHandler.isDeleteComplete (/var/task/cluster.js:69:26) {
  '$fault': 'client',
  '$metadata': {
    httpStatusCode: 404,
    requestId: 'cd0658dc-5f8d-4659-86ee-f92bd05e0a0c',
    extendedRequestId: undefined,
    cfId: undefined,
    attempts: 1,
    totalRetryDelay: 0
  },
  clusterName: 'Cluster9EE0221C-2dda1421b4b94d189f7a5a65f46f3902',
  nodegroupName: null,
  fargateProfileName: null,
  addonName: null
}

@kishiel
Copy link
Contributor Author

kishiel commented Jul 12, 2023

I think I'm close to being able to open a PR to fix this, but I'm having to completely strip out the nginx helm charts from the tests due to some breaking changes that were recently released.

kubernetes-dashboard needs a complete rewire in the tests, but I really wonder how sustainable this model is. We could anchor to a specific version, I guess, but at the end of the day we've got a snapshot test with mutable dependencies.

nginx-controller has a new failure which causes the nginx-elb-controller security group to remain after cluster deletion which prevents the vpc from being deleted.

@mrgrain
Copy link
Contributor

mrgrain commented Jul 15, 2023

@MrArnoldPalmer fyi

@mrgrain
Copy link
Contributor

mrgrain commented Jul 17, 2023

Upgrade severity since this does not only effect integration tests but also real deployments.

@mrgrain mrgrain added p0 and removed p1 labels Jul 17, 2023
@iliapolo iliapolo pinned this issue Jul 17, 2023
@iliapolo iliapolo changed the title (eks): cluster-resource-handler onDelete fails for fargate and cluster events ‼️ NOTICE (eks): cluster-resource-handler onDelete fails for fargate and cluster events Jul 17, 2023
@mergify mergify bot closed this as completed in #26283 Jul 19, 2023
mergify bot pushed a commit that referenced this issue Jul 19, 2023
Ran npm-check-updates and yarn upgrade to keep the `yarn.lock` file up-to-date.

Fixes #26325
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@mrgrain
Copy link
Contributor

mrgrain commented Jul 19, 2023

Keep open until release is out.

@mrgrain mrgrain reopened this Jul 19, 2023
@iliapolo iliapolo changed the title ‼️ NOTICE (eks): cluster-resource-handler onDelete fails for fargate and cluster events ‼️ NOTICE (custom-resources): various custom resources may fail to deploy / destroy Jul 20, 2023
mergify bot pushed a commit to cdklabs/aws-cdk-notices that referenced this issue Jul 20, 2023
bmoffatt pushed a commit to bmoffatt/aws-cdk that referenced this issue Jul 29, 2023
Ran npm-check-updates and yarn upgrade to keep the `yarn.lock` file up-to-date.

Fixes aws#26325
@pahud pahud changed the title ‼️ NOTICE (custom-resources): various custom resources may fail to deploy / destroy custom-resources: various custom resources may fail to deploy / destroy Aug 3, 2023
@pahud pahud added p1 and removed p0 labels Aug 3, 2023
@mrgrain mrgrain changed the title custom-resources: various custom resources may fail to deploy / destroy ‼️ NOTICE custom-resources: various custom resources may fail to deploy / destroy Aug 3, 2023
@mrgrain mrgrain closed this as completed Aug 3, 2023
@github-actions
Copy link

github-actions bot commented Aug 3, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@jedkass
Copy link

jedkass commented Aug 27, 2023

Is this related?

ReferenceError: TextDecoder is not defined

      2 | Object.defineProperty(exports, "__esModule", { value: true });
      3 | exports.CodePipelineUtils = void 0;
    > 4 | const aws_codepipeline_actions_1 = require("aws-cdk-lib/aws-codepipeline-actions");
        |                                    ^
      5 | const codePipeline_1 = require("../constants/codePipeline");
      6 | const aws_lambda_1 = require("aws-cdk-lib/aws-lambda");
      7 | const aws_iam_1 = require("aws-cdk-lib/aws-iam");

      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/runtime/shared.js:1:340)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/runtime/index.js:1:171)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/aws-custom-resource.js:1:434)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/lib/log-group-resource-policy.js:1:198)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/lib/log-group.js:1:304)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/lib/index.js:1:1189)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/lib/codecommit/source-action.js:1:336)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/lib/codebuild/build-action.js:1:492)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/lib/index.js:1:887)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/index.js:1:649)
.
.
.

I can open a separate issue for this if not, but I'm seeing seeing this all of a sudden and can't figure out why

@mrgrain
Copy link
Contributor

mrgrain commented Aug 27, 2023

Is this related?

ReferenceError: TextDecoder is not defined

      2 | Object.defineProperty(exports, "__esModule", { value: true });
      3 | exports.CodePipelineUtils = void 0;
    > 4 | const aws_codepipeline_actions_1 = require("aws-cdk-lib/aws-codepipeline-actions");
        |                                    ^
      5 | const codePipeline_1 = require("../constants/codePipeline");
      6 | const aws_lambda_1 = require("aws-cdk-lib/aws-lambda");
      7 | const aws_iam_1 = require("aws-cdk-lib/aws-iam");

      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/runtime/shared.js:1:340)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/runtime/index.js:1:171)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/aws-custom-resource.js:1:434)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/aws-custom-resource/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/lib/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/custom-resources/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/lib/log-group-resource-policy.js:1:198)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/lib/log-group.js:1:304)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/lib/index.js:1:1189)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-events-targets/index.js:1:649)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/lib/codecommit/source-action.js:1:336)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/lib/codebuild/build-action.js:1:492)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/lib/index.js:1:887)
      at Object.<anonymous> (../node_modules/aws-cdk-lib/aws-codepipeline-actions/index.js:1:649)
.
.
.

I can open a separate issue for this if not, but I'm seeing seeing this all of a sudden and can't figure out why

Could be! Please open a separate issue and we will start investigating.

@MrArnoldPalmer MrArnoldPalmer unpinned this issue Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/medium Medium work item – several days of effort p1
Projects
None yet
4 participants