Best Practices

Head Node Instance Type

Although the head node doesn't execute any job, its functions and its sizing are crucial to the overall performance of the cluster.

When choosing the instance type to use for your head node you want to evaluate the following items:

Cluster size: the head node orchestrates the scaling logic of the cluster and is responsible of attaching new nodes to the scheduler. If you need to scale up and down the cluster of a considerable amount of nodes then you want to give the head node some extra compute capacity.
Shared file systems: when using shared file systems to share artefacts between compute nodes and the head node take into account that the head node is the node exposing the NFS server. For this reason you want to choose an instance type with enough network bandwidth and enough dedicated EBS bandwidth to handle your workflows.

There are three hints that cover the whole range of possibilities to improve network communication.

Placement Group: a cluster placement group is a logical grouping of instances within a single Availability Zone. You can find more information on placement group here.
- With ParallelCluster 2.x, you can configure the cluster to use your own placement group with placement_group = <your_placement_group_name> or let ParalleCluster create a placement group with the "compute" strategy with placement_group = DYNAMIC. Details are [here].
- With ParallelCluster 3.x, you can configure the queues to use your own placement group with Networking > PlacementGroup > Id set to your placement group id or let ParalleCluster create a placement group with Networking > PlacementGroup > Enabled set to true. Details are here.
Enhanced Networking: consider to choose an instance type that supports Enhanced Networking. For more information about Enhanced Networking see here.
Instance bandwidth: the bandwidth scales with instance size, please consider to choose the instance type which better suits your needs, see here and here.

Increase the ulimit to allow a large number of files to be open:

# For large scale MPI runs (>1000 ranks)
echo 'sudo prlimit --pid $$ --nofile=10000:10000' >> $HOME/.bashrc

Run in a placement group:

[cluster yourcluster]
placement_group = DYNAMIC

There are three limits that effect AWS ParallelCluster. You can check them by going to the EC2 Console > Limits

Running On-Demand EC2 instances, make sure this is at least + 1 greater than the biggest cluster you want to launch.
Running On-Demand [instace_type] instances
EC2-VPC Elastic IPs each cluster launched with use_public_ips = true (which the default if you don't set anything) uses 1 elastic ip. So if you want to have more than 5 clusters, you'll need to raise this limit.