Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression issue: Pcluster 3.9 cli - unofficial ami(s) and ec2-imagebuilder #6273

Closed
zeekus opened this issue Jun 6, 2024 · 5 comments
Closed
Labels

Comments

@zeekus
Copy link

zeekus commented Jun 6, 2024

Issue Description

When using AWS ParallelCluster to build custom images with a custom parent AMI, the image build process fails with an error related to the installation of the Lustre client modules. Specifically, the error message states that no candidate version is available for the lustre-client-modules-6.5.0-1017-aws package.

This issue does not occur when using the official supported AMIs provided by AWS ParallelCluster.

Reproduction Steps

  1. Use the pcluster build-image command to create a custom image with a custom parent AMI (e.g., Ubuntu 22.04).
  2. During the image build process, the installation of the Lustre client modules fails with the following error:
Stdout: [2024-06-08T08:17:48+00:00] FATAL: Chef::Exceptions::Package: lustre[Install FSx options] (aws-parallelcluster-environment::install line 22) had an error: Chef::Exceptions::Package: apt_package[lustre-client-modules-6.5.0-1017-aws, lustre-client-modules-aws, initramfs-tools] (aws-parallelcluster-environment::install line 27) had an error: Chef::Exceptions::Package: No candidate version available for lustre-client-modules-6.5.0-1017-aws
  1. The image build status is marked as BUILD_FAILED.

Affected Versions

  • AWS ParallelCluster version 3.9.2 (and potentially other versions)

Additional Notes

  • The issue does not occur when using the official supported AMIs provided by AWS ParallelCluster.
  • The AWS support team has been notified and is investigating the root cause of this behavior.
  • Further updates and potential workarounds or resolutions will be provided by the AWS support team.
@zeekus zeekus added the 3.x label Jun 6, 2024
@zeekus
Copy link
Author

zeekus commented Jun 7, 2024

I think I know what is going on here. It appears that 'pcluster build-image' is only working when you start with an 'official ami'.

ref: pcluster list-official-images | less

I built two successfully from the official ami(s).

image_builder_from_official_images

Custom images seem to have an error similar to this. Maybe this has something to do with the lustre change or late.

image

@zeekus
Copy link
Author

zeekus commented Jun 8, 2024

I opened a support ticket with AWS EC2-imagebuilder. There appears to be a change that was introduced to image builder that may be breaking the ability of pcluster users from using pclsuter build-image using a custom AMI . Here are the tech notes from my ticket:

source: AWS support.

From the case details, I understand that you’re preparing images using AWS ParallelCluster (which makes use of Image Builder at the backend) for building HPC clusters, however you are encountering Chef errors that were not observed previously. Therefore you wish to know if there has been some changes made to chef code which are causing such errors. Please feel free to correct me if there is any gap in my understanding of the issue.

In order to verify the issue from my end, I replicated the scenario in my test environment by creating 2 ParallelCluster images, one with the official supported image and other with custom image.

Since, you observed the issue to be occurring for all the linux distros, I replicated with using Ubuntu 22.04 OS AMI as it is one of the AMI that already comes pre installed with SSM agent.

I ran “pcluster build-image” command with the build configuration similar as yours for the AMI “ami-039bb043f0419a703” which is the official Ubuntu 22.04 image for ParallelCluster in us-east-1 region. Along with this, I ran the same command again with ParentImage as custom Ubuntu 22.04 AMI ID. In order to simplify the replication and isolate the issue I did not install any explicit package on the custom AMI.

After both the execution/builds were completed, I could observe the same results as you. The build with official Ubuntu 22.04 was successfully completed, however the build with custom Ubuntu 22.04 as the ParentImage got failed with below error which is the same as you observed :
~~~~~

Stdout: [2024-06-08T08:17:48+00:00] FATAL: Chef::Exceptions::Package: lustre[Install FSx options] (aws-parallelcluster-environment::install line 22) had an error: Chef::Exceptions::Package: apt_package[lustre-client-modules-6.5.0-1017-aws, lustre-client-modules-aws, initramfs-tools] (aws-parallelcluster-environment::install line 27) had an error: Chef::Exceptions::Package: No candidate version available for lustre-client-modules-6.5.0-1017-aws
~~~~~ 

Also verified the build status as below :
————————
>> pcluster list-images --r us-east-1 --image-status AVAILABLE

{
  "images": [
    {
      "imageId": "ubuntu_22_official",
      "imageBuildStatus": "BUILD_COMPLETE",
      "ec2AmiInfo": {
        "amiId": "ami-04785fxxxxxxxxx”
      },
      "region": "us-east-1",
      "version": "3.9.2"
    }
  ]
}


>> pcluster list-images --r us-east-1 --image-status FAILED
{
  "images": [
    {
      "imageId": "ubuntu22custom",
      "imageBuildStatus": "BUILD_FAILED",
      "cloudformationStackStatus": "CREATE_FAILED",
      "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/ubuntu22custom/4a6065c0-2569-11ef-8010-123a7abb44f1",
      "region": "us-east-1",
      "version": "3.9.2"
    }
  ]
}
————————


Keeping the above observations in mind, I have reached out to our internal team to gather more information on the root cause of this behaviour. Please note that it can take some time before we have the first response from the team due to which it would be difficult to provide an ETA for the same. However, rest assured that I will be doing my best to make sure this gets the attention it needs.

@zeekus zeekus changed the title possible bug: Pcluster 3.9 cli - rocky8 - image builder code on AWS side of things. possible bug: Pcluster 3.9 cli - unofficial ami(s) and ec2-imagebuilder Jun 8, 2024
@gmarciani
Copy link
Contributor

Hi @zeekus, thanks for your interest in ParallelCluster and for reporting this issue.

The build with vanilla AMI fails because it runs on kernel 6.5.0, which is not yet supported by the latest FSx Lustre client.

The build with the official ParallelCluster AMI 3.9.2 succeeds because it runs on kernel 6.2.0, which is supported.

If you want to build a custom AMI you need to use a ParentImage having kernel 6.2.0.

@zeekus
Copy link
Author

zeekus commented Jun 13, 2024

Thanks for the update. This seems like a 'regression issue'. I updated the potential bug title to regession. That seems to fit.

@zeekus zeekus changed the title possible bug: Pcluster 3.9 cli - unofficial ami(s) and ec2-imagebuilder regression issue: Pcluster 3.9 cli - unofficial ami(s) and ec2-imagebuilder Jun 13, 2024
@dreambeyondorange
Copy link
Contributor

Unfortunately, this is an issue with FSx Lustre's support pipeline, nothing to do with ParallelCluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants