(3.3.0 3.5.0) Update cluster to remove shared EBS volumes can potentially cause node launching failures

The issue

The cluster may hit issues during scale-up operations with new nodes joining after a cluster update where shared EBS volumes have been removed from the SharedStorages section of the configuration.

In particular, the error occurs if the removed EBS had a MountDir path that's a sub-path of other volumes shared in the cluster. For example, removing a custom shared storage with a mount directory as /shared, /slurm, /parallelcuster/shared or /intel would corrupt the ParallelCluster NFS export paths from the head node, causing the mount operations on the compute nodes to fail.

More in general, the issue happens whenever you remove a shared EBS with mount path /path_1 and another system or custom EBS volume has an export path as /path_2/path_1.

This is caused by an issue in our code to unmount shared EBS volumes, which deletes lines from /etc/exports that match the regular expression of the /<MountDir> without checking if it starts with ^/<MountDir> . For example, if you have multiple SharedEBS volumes with MountDir that are contains the path name of each other: path_1 and path_2/path_1. Both paths ( /path_1 and /path_2/path_1)are written in /etc/exports. If you update the cluster to remove shared EBS with path_1, the regular expression with match lines that have /path_1 and also unintentionally removed /path_2/path_1, further cluster operations like launching nodes or submitting jobs may fail due to required softwares or paths no longer exported over NFS.

Affected versions (OSes, schedulers)

ParallelCluster 3.3.0 -3.5.0

Mitigation

After updating the cluster to remove shared EBS volumes, check /etc/exports to see if there’s missing lines in /etc/exports, make sure your /etc/exports file contains the following lines plus any additional export based on the EBS shared storage set into cluster configuration:

/home 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)
/opt/parallelcluster/shared 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)
/opt/intel 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)
/opt/slurm 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)

If they are missing, stop the cluster compute fleet with:

pcluster update-compute-fleet —-cluster-name<cluster-name> —-status STOP_REQUESTED

If any of these lines are missing or any shared EBS MountDir is missing , add them back to /etc/exports, then run the following command the re-export the NFS shareds.

sudo exportfs -ra

Start the cluster compute fleet with:

pcluster update-compute-fleet —-cluster-name <cluster-name> —-status START_REQUESTED

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.3.0 3.5.0) Update cluster to remove shared EBS volumes can potentially cause node launching failures

The issue

Affected versions (OSes, schedulers)

Mitigation

Clone this wiki locally