Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

rmarable-flaretx · 2024-07-29T14:20:50Z

We are unable to bootstrap a custom Rocky LInux 9.4 AMI using ParallelCluster 3.10.1.

Here is the cfn-init log stream:

    {
      "message": "2024-07-29 14:07:13,212 [ERROR] Error encountered during build of chefConfig: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 579, in run_config\n    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 277, in build\n    changes['commands'] = CommandTool().apply(\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "2024-07-29 14:07:13,296 [ERROR] -----------------------BUILD FAILED!------------------------",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "2024-07-29 14:07:13,296 [ERROR] Unhandled exception during build: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-init\", line 181, in <module>\n    worklog.build(metadata, configSets, strict_mode)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 137, in build\n    Contractor(metadata, strict_mode).build(configSets, self)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 567, in build\n    self.run_config(config, worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 579, in run_config\n    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 277, in build\n    changes['commands'] = CommandTool().apply(\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.296Z"
    }

From the system-messages log strem:

    {
      "message": "Jul 29 14:07:23 ip-10-2-34-41 cloud-init[1084]: + /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-signal --exit-code=1 '--reason=Failed to run chef recipe aws-parallelcluster-slurm::config_munge_key line 27. Please check /var/log/chef-client.log in the head node, or check the chef-client.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html for more details.' 'https://cloudformation-waitcondition-us-east-2.s3.us-east-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-east-2%3A227394971585%3Astack/darius/3a0f8320-4db1-11ef-a95c-0a041a247431/3a117ef0-4db1-11ef-a95c-0a041a247431/HeadNodeWaitConditionHandle20240729134822?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240729T134828Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86399&X-Amz-Credential=AKIAVRFIPK6PEIG2DZWK%2F20240729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=a7a1c96d932fa315e993bee2c2909d6ed8bbe74aa1377a0d97b064a6961a15fc' --region us-east-2 --url https://cloudformation.us-east-2.amazonaws.com",
      "timestamp": "2024-07-29T14:07:23.000Z"
    },

From the chef-client log:

    {
      "message": "    \n    ================================================================================\n    Error executing action `restart` on resource 'service[munge]'\n    ================================================================================\n    \n    Mixlib::ShellOut::ShellCommandFailed\n    ------------------------------------\n    Expected process to exit with [0], but received '1'\n    ---- Begin output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----\n    STDOUT: \n    STDERR: Job for munge.service failed because the control process exited with error code.\n    See \"systemctl status munge.service\" and \"journalctl -xeu munge.service\" for details.\n    ---- End output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----\n    Ran [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] returned 1\n    \n    Resource Declaration:\n    ---------------------\n    # In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb\n    \n     27:   declare_resource(:service, \"munge\") do\n     28:     supports restart: true\n     29:     action :restart\n     30:     retries 5\n     31:     retry_delay 10\n     32:   end unless on_docker?\n     33: end\n     34: \n    \n    Compiled Resource:\n    ------------------\n    # Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb:27:in `restart_munge_service'\n    \n    service(\"munge\") do\n      action [:restart]\n      default_guard_interpreter :default\n      declared_type :service\n      cookbook_name \"aws-parallelcluster-slurm\"\n      recipe_name \"config_munge_key\"\n      supports {:restart=>true}\n      retries 5\n      retry_delay 10\n    end\n    \n    System Info:\n    ------------\n    chef_version=18.4.12\n    platform=rocky\n    platform_version=9.4\n    ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]\n    program_name=/bin/cinc-client\n    executable=/opt/cinc/bin/cinc-client\n    ",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] INFO: Running queued delayed notifications before re-raising exception\n",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Running handlers:",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] ERROR: Running exception handlers\n  - WriteChefError::WriteHeadNodeChefError",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Running handlers complete",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] ERROR: Exception handlers complete",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Infra Phase failed. 64 resources updated in 01 minutes 09 seconds",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/cinc-stacktrace.out",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: ---------------------------------------------------------------------------------------",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: PLEASE PROVIDE THE CONTENTS OF THE stacktrace.out FILE (above) IF YOU FILE A BUG REPORT",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: ---------------------------------------------------------------------------------------",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: service[munge] (aws-parallelcluster-slurm::config_munge_key line 27) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "---- Begin output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "STDOUT: ",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "STDERR: Job for munge.service failed because the control process exited with error code.",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "See \"systemctl status munge.service\" and \"journalctl -xeu munge.service\" for details.",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "---- End output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "Ran [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] returned 1",
      "timestamp": "2024-07-29T14:07:17.561Z"
    }

We can't get into the head node so unfortunately we are unable to provide the log files referenced above.

For now, we are dropping back to Rocky Linux 8.

Any guidance you can provide would be appreciated.

The text was updated successfully, but these errors were encountered:

hanwen-pcluste · 2024-08-06T18:51:03Z

Sorry for the late reply,

This error seems to be related to #6378

rmarable-flaretx · 2024-08-27T14:42:57Z

The munge key issue referred to in #6378 has been fixed but Rocky LInux 9 clusters are still failing.

Recipe: aws-parallelcluster-slurm::config_munge_key
  * munge_key_manager[manage_munge_key] action setup_munge_key[2024-08-27T14:25:47+00:00] INFO: Processing munge_key_manager[manage_munge_key] action setup_munge_key (aws-parallelcluster-slurm::config_munge_key line 73)
 (up to date)
  * execute[fetch_and_decode_munge_key] action run[2024-08-27T14:25:47+00:00] INFO: Processing execute[fetch_and_decode_munge_key] action run (aws-parallelcluster-slurm::config_munge_key line 66)

    [execute] Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
              Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
              Restarting munge service
              Job for munge.service failed because the control process exited with error code.
              See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.
    
    ================================================================================
    Error executing action `run` on resource 'execute[fetch_and_decode_munge_key]'
    ================================================================================
    
    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
    STDOUT: Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
    Restarting munge service
    STDERR: Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
    Job for munge.service failed because the control process exited with error code.
    See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.
    ---- End output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
    Ran //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d returned 1
    
    Resource Declaration:
    ---------------------
    # In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb
    
     66:   declare_resource(:execute, 'fetch_and_decode_munge_key') do
     67:     user 'root'
     68:     group 'root'
     69:     command "/#{node['cluster']['scripts_dir']}/slurm/update_munge_key.sh -d"
     70:   end
     71: end
    
    Compiled Resource:
    ------------------
    # Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb:66:in `fetch_and_decode_munge_key'
    
    execute("fetch_and_decode_munge_key") do
      action [:run]
      default_guard_interpreter :execute
      command "//opt/parallelcluster/scripts/slurm/update_munge_key.sh -d"
      declared_type :execute
      cookbook_name "aws-parallelcluster-slurm"
      recipe_name "config_munge_key"
      user "root"
      group "root"
    end
    
    System Info:
    ------------
    chef_version=18.4.12
    platform=rocky
    platform_version=9.4
    ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
    program_name=/bin/cinc-client
    executable=/opt/cinc/bin/cinc-client

More logs:

[2024-08-27T14:25:49+00:00] ERROR: Running exception handlers
  - WriteChefError::WriteHeadNodeChefError

And more:

[2024-08-27T14:25:49+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: execute[fetch_and_decode_munge_key] (aws-parallelcluster-slurm::config_munge_key line 66) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'

So to reiterate, this works with Rocky 8 but NOT with Rocky 9.

rmarable-flaretx added the 3.x label Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

rmarable-flaretx commented Jul 29, 2024

hanwen-pcluste commented Aug 6, 2024

rmarable-flaretx commented Aug 27, 2024

Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

Comments

rmarable-flaretx commented Jul 29, 2024

hanwen-pcluste commented Aug 6, 2024

rmarable-flaretx commented Aug 27, 2024