CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837

jaapvanblaaderen · 2023-08-22T10:16:46Z

Describe the bug

Deploying multiple stacks with a lot of Lambda functions, results in a lot of CloudWatch Log groups being created by CDK. CDK provides retry options through the logRetentionRetryOptions property of a Lambda Function construct to prevent rate exceeded errors from happening.

Until CDK 2.89.0 this worked fine. In CDK 2.90.0, the retry implementation changed in this PR and the rate exceeded errors occur again.

Expected Behavior

The retry mechanism should just function properly, same way it did in older, CDK 2.89 and lower, versions. Stacks should deploy without issues.

Current Behavior

Stacks cannot be deployed with CDK 2.90.0 and up as they error with a Rate exceeded error:

LogRetentionRateLimitStack3 |  57/183 | 11:34:35 AM | CREATE_FAILED        | Custom::LogRetention        | hello_11/LogRetention (hello11LogRetention0EC20DD0) Received response status [FAILED] from custom resource. Message returned: Rate exceeded (RequestId: b84c54c4-3053-487a-82b9-48671904d05a)

Reproduction Steps

I created a small test project to reproduce the issue: https://github.com/jaapvanblaaderen/log-retention-rate-limit. With this simple setup, the issue can be easily reproduced when deploying the stacks in parallel as described in the readme of the repo. I even increased the amount of retries to 50 but then it still fails.

Testing it against 2.89.0 (which works fine) is possible by checking out the cdk-289 tag.

Possible Solution

Revert the retry mechanism as implemented in CDK 2.89.0 or troubleshoot why the new retry mechanism doesn't work properly.

Additional Information/Context

I'm actually a bit surprised and disappointed that this issue popped up again. I fixed this issue myself in CDK 3 years ago and now we have this regression on it which causes it to fail.

CDK CLI Version

2.90.0

Framework Version

No response

Node.js Version

18

OS

OSX

Language

Typescript

Language Version

No response

Other information

Related issues:

The text was updated successfully, but these errors were encountered:

mrgrain · 2023-08-22T10:45:36Z

Thanks @jaapvanblaaderen for reporting and apologies for this regression. The retry mechanism changed from SDK v2 to SDK v3, and was supposed to be smarter. But clearly something did not work. Looking into it.

jaapvanblaaderen · 2023-08-22T11:12:03Z

Thanks @mrgrain, please let me know if I can assist in any way.

mrgrain · 2023-08-24T12:22:18Z

With the patched version from my PR, your reproduction example now succeeds! 🥳

jaapvanblaaderen · 2023-08-24T12:34:58Z

That sounds great!

) The LogRetention Custom Resource used to be able to handle server-side throttling, when a lot of requests to the CloudWatch Logs service are made at the same time. Handling of this error case got lost during the migration to SDK v3. If we have (read: a lot) `LogRetention` Custom Resources in a _single_ Stack, CloudFormation apparently applies some internal breaks to the amount of parallelism. For example it appears that resources are batched in smaller groups that need to be completed before the next group is provisioned. And within the groups there appears to be a ever so slight delay between individual resources. Together this is enough to avoid rate limiting in most circumstances. **Therefore, in practice this issues only occurs when multiple stacks are deployed in parallel.** To test this scenario, I have added support for `integ-runner` to deploy all stacks of a test case concurrently. Support for arbitrary command args already existed, but needed to explicitly include the `concurrency` option. I then create an integration test that deploys 3 stacks à 25 LogRetention resources. This triggers the error cases described in #26837. The fix itself is twofold: - Pass the `maxRetries` prop value to the SDK client to increase the number of attempts of the SDK internal throttling handling. But also enforce a minimum for these retries since they might catch additional retryable failures that our custom outer loop does not account for. - Explicitly catch `ThrottlingException` errors in the outer retry loop. Closes #26837 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2023-08-24T13:08:09Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

nikolay-radkov · 2023-08-30T08:47:43Z

The issue was resolved with updating to 2.93.0 on our side but it starts to occur again without changing the version. Is it reproducible again on your side @jaapvanblaaderen? We have the same setup as described in the issue.

mrgrain · 2023-08-30T09:16:20Z

Thanks @nikolay-radkov for the report! Have you tried setting a higher number for maxRetries? The default might well be too conservative.

mrgrain · 2023-09-01T17:42:47Z

~~@jaapvanblaaderen Did you ever get around to confirming if this has been resolved for you?~~

Never mind, this still needs to be released. The release of the fix should be imminent.

wwjhu · 2023-09-03T14:05:30Z

I can confirm that after upgrading to 2.94, the test project can be deployed without rate limit errors. We will upgrade our services to 2.94 coming weeks, but so far so good. Nice job!

mrgrain · 2023-09-04T10:11:17Z

I can confirm that after upgrading to 2.94, the test project can be deployed without rate limit errors. We will upgrade our services to 2.94 coming weeks, but so far so good. Nice job!

Thanks, but should have worked in first place!
Once you have upgraded your services, could you please let me know if it works with a real world project and if you have to customize maxRetries? We are very open to adjusting the default number of retries to one that's working for most.

wwjhu · 2023-09-04T10:13:08Z

Once you have upgraded your services, could you please let me know if it works with a real world project and if you have to customize maxRetries? We are very open to adjusting the default number of retries to one that's working for most.

Sure, will keep you posted.

jaapvanblaaderen added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 22, 2023

github-actions bot added the @aws-cdk/aws-lambda Related to AWS Lambda label Aug 22, 2023

jaapvanblaaderen mentioned this issue Aug 22, 2023

custom_resources: log retention rate limit error during deploy #24485

Closed

mrgrain self-assigned this Aug 22, 2023

pahud added p2 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Aug 22, 2023

peterwoodworth added p0 and removed p2 labels Aug 22, 2023

mrgrain mentioned this issue Aug 23, 2023

fix(logs): LogRetention resources fail with rate exceeded errors #26858

Merged

mergify bot closed this as completed in #26858 Aug 24, 2023

github-actions bot mentioned this issue Sep 1, 2023

Monthly PRs metrics report #26964

Closed

shaelm mentioned this issue Feb 20, 2024

CDK deploy: Cloudfront Distribution resources update fail with rate exceeded errors #29183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837

CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837

jaapvanblaaderen commented Aug 22, 2023 •

edited

Loading

mrgrain commented Aug 22, 2023

jaapvanblaaderen commented Aug 22, 2023

mrgrain commented Aug 24, 2023

jaapvanblaaderen commented Aug 24, 2023

github-actions bot commented Aug 24, 2023

nikolay-radkov commented Aug 30, 2023

mrgrain commented Aug 30, 2023

mrgrain commented Sep 1, 2023 •

edited

Loading

wwjhu commented Sep 3, 2023

mrgrain commented Sep 4, 2023

wwjhu commented Sep 4, 2023

CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837

CDK deploy: Lambda LogRetention resources fail with rate exceeded errors #26837

Comments

jaapvanblaaderen commented Aug 22, 2023 • edited Loading

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

mrgrain commented Aug 22, 2023

jaapvanblaaderen commented Aug 22, 2023

mrgrain commented Aug 24, 2023

jaapvanblaaderen commented Aug 24, 2023

github-actions bot commented Aug 24, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

nikolay-radkov commented Aug 30, 2023

mrgrain commented Aug 30, 2023

mrgrain commented Sep 1, 2023 • edited Loading

wwjhu commented Sep 3, 2023

mrgrain commented Sep 4, 2023

wwjhu commented Sep 4, 2023

jaapvanblaaderen commented Aug 22, 2023 •

edited

Loading

mrgrain commented Sep 1, 2023 •

edited

Loading