Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(core): cdk diff fails with Need to perform AWS calls for account XXX, but the current credentials are for YYY #28690

Open
SamuraiPrinciple opened this issue Jan 12, 2024 · 23 comments
Labels
@aws-cdk/core Related to core CDK functionality bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@SamuraiPrinciple
Copy link

Describe the bug

When running cdk diff on a project with stack that belong to multiple AWS accounts (bootstrapped so that a IAM role is assumed by CDK), the following error is reported:

[100%] fail: Need to perform AWS calls for account XXX, but the current credentials are for YYY
Failed to create change set with error: 'Failed to publish one or more assets. See the error messages above for more information.', falling back to no change-set diff

This only happens since version 119 (120 is affected too).

Running cdk deploy for the same stack works correctly.

Expected Behavior

cdk diff should work correctly as before

Current Behavior

cdk diff fails

Reproduction Steps

will try and provide an isolated example later, but this only happens for two stacks (out of ~10 identical ones across different accounts).

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.120.0 (build 58b90c4)

Framework Version

No response

Node.js Version

v20.10.0

OS

Linux

Language

TypeScript

Language Version

No response

Other information

No response

@SamuraiPrinciple SamuraiPrinciple added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 12, 2024
@github-actions github-actions bot added the @aws-cdk/core Related to core CDK functionality label Jan 12, 2024
@pahud
Copy link
Contributor

pahud commented Jan 16, 2024

If you deploy from account A to account B, generally you will need:

  1. bootstrap account B with --trust or --trust-for-lookup. See doc for details.
  2. The IAM identity that executes cdk bootstrap from account A might need relevant privileges to assume account B.

Please provide minimal code snippet/sample that reproduces this issue so we can investigate.

@pahud pahud added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Jan 16, 2024
@SamuraiPrinciple
Copy link
Author

SamuraiPrinciple commented Jan 16, 2024

Hi @pahud - thanks for the reply. I'm still in the process of providing isolated repro, but just a couple of details:

  • I believe that this is not a bootstrap/privileges problem (happens with widest of privileges too)

  • this only started happening since v119 and it seems to be related to this commit 10ed194 and the newly introduced --change-set flag (on by default)

  • it seems that it only happens on stacks with lots of resources, but will confirm soon

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 16, 2024
@SamuraiPrinciple
Copy link
Author

SamuraiPrinciple commented Jan 16, 2024

Below is the self-contained CDK app that demonstrates the issue:

#!/usr/bin/env node
import * as cdk from 'aws-cdk-lib';
import {
  Duration,
  Stack,
  StackProps,
  aws_cloudwatch as cloudwatch,
  aws_route53 as route53,
} from 'aws-cdk-lib';
import { Construct } from 'constructs';

class ReproStack extends Stack {
  constructor(scope: Construct, id: string, props: StackProps) {
    super(scope, id, props);
    Array.from({ length: 55 }, (_, i) => `${i}.com`).forEach(
      (domainName, i) =>
        new cloudwatch.Alarm(this, `CloudwatchAlarm-${i}`, {
          actionsEnabled: true,
          alarmDescription: `${domainName} not healthy`,
          evaluationPeriods: 1,
          threshold: 1.0,
          datapointsToAlarm: 1,
          comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
          metric: new cloudwatch.Metric({
            namespace: 'AWS/Route53',
            metricName: 'HealthCheckStatus',
            period: Duration.minutes(1),
            statistic: 'Minimum',
            dimensionsMap: {
              HealthCheckId: new route53.CfnHealthCheck(this, `Route53HealthCheck-${i}`, {
                healthCheckConfig: {
                  type: 'HTTPS_STR_MATCH',
                  fullyQualifiedDomainName: domainName,
                  resourcePath: '/api/info/healthz',
                  searchString: 'pass (',
                },
                healthCheckTags: [{ key: 'Name', value: domainName }],
              }).ref,
            },
          }),
        })
    );
  }
}

const app = new cdk.App();

new ReproStack(app, 'Repro', {
  env: { account: 'YYYYY', region: 'us-east-1' },
});

The AWS account XXXXX is used to run the app, and the resources are provisioned in AWS account YYYYY (bootstrapping is done as you suggested).

After running cdk deploy, cdk diff outputs the following:


> aws-infrastructure@0.1.0 diff
> cdk diff Repro

Stack Repro
Creating a change set, this may take a while...
There were no differences

✨  Number of stacks with differences: 0

If I try and increase the value of length on line 15 from 55 to 56, and then run cdk diff, the output I get is:

> aws-infrastructure@0.1.0 diff
> cdk diff Repro

Stack Repro
[100%] fail: Need to perform AWS calls for account YYYYY, but the current credentials are for XXXXX
Failed to create change set with error: 'Failed to publish one or more assets. See the error messages above for more information.', falling back to no change-set diff
Resources
[+] AWS::Route53::HealthCheck Route53HealthCheck-55 Route53HealthCheck55 
[+] AWS::CloudWatch::Alarm CloudwatchAlarm-55 CloudwatchAlarm553184D097 


✨  Number of stacks with differences: 1

Decreasing length to 54 yields:

> aws-infrastructure@0.1.0 diff
> cdk diff Repro

Stack Repro
Creating a change set, this may take a while...
Resources
[-] AWS::Route53::HealthCheck Route53HealthCheck-54 Route53HealthCheck54 destroy
[-] AWS::CloudWatch::Alarm CloudwatchAlarm-54 CloudwatchAlarm5427654D1E destroy


✨  Number of stacks with differences: 1

@SamuraiPrinciple
Copy link
Author

One additional comment - we have a few different stacks/constructs behaving in this way, so I don't think it's related to the kind of resources being provisioned, just their number.

@mrlikl
Copy link
Contributor

mrlikl commented Jan 16, 2024

The same behaviour is being observed while executing diff however cdk deploy works fine for cross account deployments.

The setup includes bootstrapping org accounts using a stack set hence the toolkit stack would have the name similar to StackSet-CDKToolkit-xxxxx

While diff is executing, it is trying to check for the stack with name CDKToolkit which fails.

Call failed: describeStacks({"StackName":"CDKToolkit"}) => Stack with id CDKToolkit does not exist (code=ValidationError)

The deployment works fine, however.

@pahud
Copy link
Contributor

pahud commented Jan 16, 2024

Unfortunately I was not able to reproduce it.

Can you try cdk diff -v and show me the relevant error or warning messages?

You should see this indicating account A is assuming the lookup-role from account B and it should work.

Stack DummyStack
[18:21:35] Retrieved account ID 111111111111 from disk cache
[18:21:35] Assuming role 'arn:aws:iam::222222222222:role/cdk-hnb659fds-lookup-role-222222222222-us-east-1'.
Resources
[+] AWS::Route53::HealthCheck Route53HealthCheck-55 Route53HealthCheck55 
[+] AWS::CloudWatch::Alarm CloudwatchAlarm-55 CloudwatchAlarm553184D097 


✨  Number of stacks with differences: 1
[18:21:37] Reading cached notices from /home/mde-user/.cdk/cache/notices.json

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 16, 2024
@SamuraiPrinciple
Copy link
Author

SamuraiPrinciple commented Jan 16, 2024

Will do - any chance you could try with larger value of length (say 200, as it could be some data race/request latency issue)?

@SamuraiPrinciple
Copy link
Author

Also, not sure if clear from the previous comments, but passing --change-set false makes the error go away (makes it behave like pre version 119).

@SamuraiPrinciple
Copy link
Author

Unfortunately I was not able to reproduce it.

Can you try cdk diff -v and show me the relevant error or warning messages?

You should see this indicating account A is assuming the lookup-role from account B and it should work.

Stack DummyStack
[18:21:35] Retrieved account ID 111111111111 from disk cache
[18:21:35] Assuming role 'arn:aws:iam::222222222222:role/cdk-hnb659fds-lookup-role-222222222222-us-east-1'.
Resources
[+] AWS::Route53::HealthCheck Route53HealthCheck-55 Route53HealthCheck55 
[+] AWS::CloudWatch::Alarm CloudwatchAlarm-55 CloudwatchAlarm553184D097 


✨  Number of stacks with differences: 1
[18:21:37] Reading cached notices from /home/mde-user/.cdk/cache/notices.json

Also, could you please confirm your CDK version and/or command line args you use to run cdk diff as I can't see Creating a change set, this may take a while... in the output you've provided?

@rix0rrr
Copy link
Contributor

rix0rrr commented Jan 16, 2024

It looks like the diff proceeds, the messaging is just unnecessarily scary. We will tone down the error messaging

@mrgrain
Copy link
Contributor

mrgrain commented Jan 16, 2024

@SamuraiPrinciple Your issue title and description imply the command fails. However the output log you've posted implies that the command is working, it's just falling back to the previous behavior (not using a change set) and printing an understandably scary message.

Can you confirm which is the case?

@SamuraiPrinciple
Copy link
Author

The diff is indeed computed, but with that (error?) message being shown.

@mrgrain
Copy link
Contributor

mrgrain commented Jan 16, 2024

The diff is indeed computed, but with that (error?) message being shown.

Thanks for the confirmation. So the unfortunate message aside, nothing is actually broken on your end?

@SamuraiPrinciple
Copy link
Author

SamuraiPrinciple commented Jan 16, 2024

Correct. If this is indeed expected behaviour, feel free to close the issue.

@mrgrain
Copy link
Contributor

mrgrain commented Jan 16, 2024

Correct. If this is indeed expected behaviour, feel free to close the issue.

Thank you. We are going to address the messaging and look into it further. I wanted to make sure this receives the right priority. For example if the command would have failed completely, we would have reverted or released a hot fix.

Definitely keeping this open for further investigation.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 16, 2024
@scarytom
Copy link
Contributor

scarytom commented Mar 7, 2024

I believe I have some insight here.

We are experiencing the same issue, but we only get the error when the stack in question is larger than 50KiB, so I believe the issue is that the code which uploads the template to S3 is not respecting the need to assume a role in the target account.

As the diff could not create a change set, it then bases the diff on template differences, which is not desirable. I would therefore consider this a bug, rather than just an issue with messaging.

@scarytom
Copy link
Contributor

scarytom commented Mar 7, 2024

Here is the error we get, with trace debug on, if that helps.

[11:02:53] Storing template in S3 at: https://cdk-demo-infra-assets-XXX-us-east-1.s3.us-east-1.amazonaws.com/cdk/DemoStack/12111f5bf4b71f77e545882f66beabc16874487602d46cf99a272a01fbc58657.yml
[11:02:53] [0%] start: Publishing 12111f5bf4b71f77e545882f66beabc16874487602d46cf99a272a01fbc58657:current
[11:02:53] [trace] SdkProvider#forEnvironment()
[11:02:53] [trace]   SdkProvider#resolveEnvironment()
[11:02:53] [trace]   SdkProvider#obtainBaseCredentials()
[11:02:53] [trace]     SdkProvider#defaultAccount()
[11:02:53] [trace]     SdkProvider#defaultCredentials()
[100%] fail: Need to perform AWS calls for account XXX, but the current credentials are for YYY
[11:02:53] Failed to publish one or more assets. See the error messages above for more information.
Could not create a change set, will base the diff on template differences (run again with -v to see the reason)```

@cslogan-red
Copy link

cslogan-red commented Apr 10, 2024

@pahud @mrgrain - as @scarytom mentioned, this behavior appears to be a bug in the diff being able to properly assume the trust role required to download/upload the diff to the metadata directory in the target account and so the CDK is falling back to doing a diff with disk cache locally or the like; this is breaking behavior in an automated environment where CDKPipelines has a single deployment pipeline account that uses a trust role to deploy resources to each target account in the pipeline.

I've just upgraded our pipelines to 2.136.0 from 2.88.0 and prior to the upgrade I was not experiencing the error locally.

If I locally assume the creds of my pipelines account to run a diff AS the pipelines account against a target account locally on my machine (something I do all the time) using the verbose command, the diff is failing to assume the trust role in the target account and gives the same error message as above but the diff succeeds using disk cache.

If I then run a deploy to the target account of some change, again from my local machine AS the pipelines account, using those same creds, it succeeds (like it did before), but then if I let the change roll through my CDKPipeline in CodeBuild, the pipeline build is not detecting the changes that I deployed locally (AS the pipelines account) because the metadata in the target account from my previous deploy is not getting updated due to this failure.

Previously, in 2.88.0, the local CDK diff & deploy against the target account AS the pipelines account would result in a no-op once the deploy reached that stage in the pipeline because it would see the templates are already the same and have been updated.

To summarize:

  • Pipeline Account -> trust with many target accounts, still working
  • Pipeline Account runs CDKPipeline and deploys to all target accounts via this trust, still working
  • Assume Pipeline Account creds locally to run a diff AS the Pipeline Account against target CF stack, fails
  • Assume Pipeline Account creds locally to run a deploy AS the Pipeline Account against target CF stack, still working
  • Allow the already deployed change to flow through the pipeline, fails -> it tries to redeploy the same change that I deployed locally because the metadata wasn't correctly updated during the diff & deploy I did locally

Because the last one fails, even if I can run a successful deploy locally, the next time a deploy rolls through the pipeline, the metadata hasn't reflected my local deployment and the pipeline tries to duplicate the changes.

This is blocking for us and I've rolled our infra pkg back to the previous rev but this appears to be the behavior I'm seeing.

@TheRealAmazonKendra
Copy link
Contributor

TheRealAmazonKendra commented Apr 10, 2024

@cslogan-red I'm wondering if something else is in play here, or perhaps I'm just not fully understanding your workflow. The output of diff has no bearing on the deployment. The commands are not at all coupled. Even if the change0set version of diff fails, the original diff still runs.

Do you have additional logic that first does diff and then only does deploy if there are changes present and are seeing different outputs of diff depending on the environment? Also, by the metadata in the target account, do you mean CDKMetadata?

There's been quite a lot of changes since 2.88.0 so to help narrow this down, if you have the time, could you try this with version 2.118.0? That's the version just before we started using the change-set diff.

@TheRealAmazonKendra
Copy link
Contributor

I believe I have some insight here.

We are experiencing the same issue, but we only get the error when the stack in question is larger than 50KiB, so I believe the issue is that the code which uploads the template to S3 is not respecting the need to assume a role in the target account.

As the diff could not create a change set, it then bases the diff on template differences, which is not desirable. I would therefore consider this a bug, rather than just an issue with messaging.

Yep. If you are specifically wanting the change-set diff and it's failing, even with the fall back to classic diff succeeding, definitely a bug. And I think it's likely your correct on your assessment of what's going on here.

@cslogan-red
Copy link

@TheRealAmazonKendra thank you for responding, I was in these trenches for AWS Amplify so I know what it's like from your side.

More that the diff & deploy locally when using a trust
account appears to have a failing CDK metadata upload/update based on the error I am seeing with the diff and the failed change set upload (only on the diff) and jumping to conclusions that the same error may be happening but silenced on the deploy?

The short version is that it appears to be a bug when two levels of IAM role hierarchy are assumed...my pipelines accounts have trusts to their target deployment accounts (1 level of role assume via CDK), and I am able to assume the creds of any of my pipelines accounts locally, so as the pipelines account, run a diff & deploy against a target account manually from my machine (2nd level of role assume via CDK) and this worked w/o error on the earlier CDK version. Then, in my CDKPipeline pipelines, any stage that was already deployed to manually (like to dev, manually smoke test or a canary) AS one of the pipelines accounts would result in a no-op once the deploy reached that stage as the templates were the same.

I am seeing an error once I let my release roll through my pipelines where envs that I manually diffed & deployed to locally (AS the pipelines trust account) on this latest release are attempting to redeploy the changes I've already deployed and so the CF stack fails because resources already exist (it's trying to apply the change set that I've already deployed locally as the same pipelines account, the prev CDK rev did not have this issue)

My infra repo packages the CDK diff & deploy commands into separate NPM commands (to apply a linter and minification but the final command is just a cdk diff or cdk deploy) and my CDKPipeline's run the same commands.

I'll take another look tomorrow to see if I can see anything else, but I know that if I do the same thing in my dev env with the manual diff & deploy locally as the pipelines account, that I can repro this failure message, and the resulting pipeline CloudFormation stack failure when the pipeline tries to roll the current state of the codebase through the already deployed env.

@cslogan-red
Copy link

cslogan-red commented Apr 11, 2024

@TheRealAmazonKendra so I can confirm reverting back to 2.88.0 resolves the [100%] fail: Need to perform AWS calls for account XXX, but the current credentials are for YYY error that I see on a diff (and that I think may potentially be happening on a deploy but silently eaten) in 2.136.0 and my diff on 2.88.0 as my pipelines account to my dev env where I manually deployed the change with 2.136.0 does not see the changes deployed to the stack from 2.136.0, so somewhere (I'm assuming when change sets are created at deploy time, not familiar w/ that internal CDK logic) the CDK is failing to upload (to the target account, AS the trust account, when locally the trust account is already assumed) the change set diff during a deploy w/ 2.136.0 and you can get in this state where different machines assuming creds of a trust account are not properly storing the state of the previous deploy & potentially attempting to redeploy the same change set.

If all I do is revert my primary CDK infra repo back to 2.88.0 (along w/ a couple of alpha cdk-monitoring-constructs deps), everything starts working as expected; I can assume my any of my pipelines accounts creds locally on my machine to run a diff & deploy AS the pipelines account to a given target account, and once my pipelines deploy reaches that stage, it results in a no-op because the diffs are showing the templates are the same.

Below is my config for my diff & deploy commands that I run locally:

    "build": "tsc && npm run pretty && npm run lint",
    "clean": "rm -rf cdk.out && rm -rf cdk.out",
...
    "mini-diff-lookup": "npm run clean && npm run build && cdk synth --json --output cdk.out -c SSMvalueFromLookup=true && node minify-cfn-templates.js && cdk diff -a cdk.out",
    "mini-deploy-lookup": "npm run clean && npm run build && cdk synth --json --output cdk.out -c SSMvalueFromLookup=true && node minify-cfn-templates.js && cdk deploy -a cdk.out",

And what runs in the CDKPipelines build container:

        installCommands: ['npm ci'],
        commands: ['npx cdk synth -c SSMvalueFromLookup=true', 'node minify-cfn-templates.js'],

The lookup param is just a flag to grab dynamic env config from SSM and the minify script just minifies the output JSON templates. Lmk if you see something obvious with the commands that would result in this behavior? I've been running this setup in these pipelines for over a year w/o issue and it's common for an engineer to diff & deploy something locally AS the pipelines account for testing and then release the change via a release branch once safe, and any already deployed envs result in no-ops because the diffs show the templates have no changes.

@cslogan-red
Copy link

@TheRealAmazonKendra @pahud attempted to update to the latest CDK version and lib today and this is still fundamentally different due to the changes introduced in v119 in this commit 10ed194, where v118 and previous versions work as expected.

This is really easy to reproduce:

  1. Clean install 2.118.0
  2. Create a pipelines account (the trust account)
  3. Create a target account, bootstrap the target account using the --trust flag to the pipelines account
  4. Paste temp creds for the pipelines account into a new terminal, run a cdk diff AS the pipelines account to the target account and notice there is no error during the diff command:
2 118 0
  1. If you now run a cdk deploy to the target account from your local machine as the pipelines account, it will work as expected - however, most importantly, if you let this change roll through a CDKPipeline it will no-op at this stage and see the changes you've already deployed locally as the pipelines account as expected

If you repeat the above steps but start with a clean install of anything after 2.118.0, the cdk diff command will display this error:

latest_cdk

If you then deploy this change locally, it will still succeed (w/o the error), but the same scenario of then letting it roll through a CDKPipeline will not see the changes you've deployed locally and CloudFormation will attempt to duplicate the changeset you've already deployed locally.

I haven't had time to dig into the commit that created this behavior to submit a PR w/ the fix, for now I'm just leaving my infra at 2.118.0 to avoid the problem, but wanted to bump it on you radar as people with more complex implementations that rely on a pipelines trust account are effectively blocked from running local one-off diff & deploys when those are necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/core Related to core CDK functionality bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

No branches or pull requests

8 participants