Resources from the web on getting started with MPI:
MPI is a standard that dictates the semantics and features of "message passing". There are different implementations of MPI. Those installed on Oscar are
We recommend using MVAPICH2 as it is integrated with the SLURM scheduler and optimized for the Infiniband network.
The MPI module is called "mpi". The different implementations (mvapich2, openmpi, different base compilers) are in the form of versions of the module "mpi". This is to make sure that no two implementations can be loaded simultaneously, which is a common source of errors and confusion.
$ module avail mpi~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ name: mpi*/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~mpi/cave_mvapich2_2.3b_gccmpi/cave_mvapich2_2.3b_intelmpi/cave_mvapich2_2.3rc2_gccmpi/hpcx_2.7.0_gcc_10.2_slurm20mpi/hpcx_2.7.0_intel_2020.2_slurm20mpi/mvapich2-2.3.5_gcc_10.2_slurm20mpi/mvapich2-2.3.5_intel_2017.0_slurm20mpi/mvapich2-2.3.5_intel_2020.2_slurm20mpi/openmpi_2.0.3_intel_2020.2_slurm20mpi/openmpi_3.1.6_gcc_10.2_slurm20mpi/openmpi_4.0.0_gccmpi/openmpi_4.0.1_gccmpi/openmpi_4.0.5_gcc_10.2_slurm20mpi/openmpi_4.0.5_intel_2020.2_slurm20mpi4py/3.0.1_py3.6.8
You can just use "
module load mpi" to load the default version which is
mpi/openmpi_4.0.5_gcc_10.2_slurm20. This is the recommended version.
The module naming format is
srun --mpi=pmix to run MPI programs. All MPI implementations listed above except
openmpi_1.10.7_gcc are built with SLURM support. Hence, the programs need to be run using SLURM's
srun command, except if you are using the above mentioned legacy versions.
--mpi=pmix flag is also required to match the configuration with which MPI is installed on Oscar.
To run an MPI program interactively, first create an allocation from the login nodes using the
$ salloc -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes>
For example, to request 4 cores to run 4 tasks (MPI processes):
$ salloc -n 4
Once the allocation is fulfilled, you can run MPI programs with the
$ srun --mpi=pmix ./my-mpi-program ...
When you are finished running MPI commands, you can release the allocation by exiting the shell:
Also, if you only need to run a single MPI program, you can skip the
salloc command and specify the resources in a single
$ srun -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes> --mpi=pmix ./my-mpi-program
This will create the allocation, run the MPI program, and release the allocation.
Note: It is not possible to run MPI programs on compute nodes by using the
salloc documentation: https://slurm.schedmd.com/salloc.html
srun documentation: https://slurm.schedmd.com/srun.html
Here is a sample batch script to run an MPI program:
#!/bin/bash# Request an hour of runtime:#SBATCH --time=1:00:00# Use 2 nodes with 8 tasks each, for 16 MPI tasks:#SBATCH --nodes=2#SBATCH --tasks-per-node=8# Specify a job name:#SBATCH -J MyMPIJob# Specify an output file#SBATCH -o MyMPIJob-%j.out#SBATCH -e MyMPIJob-%j.err# Load required modulesmodule load mpisrun --mpi=pmix MyMPIProgram
If your program has multi-threading capability using OpenMP, you can have several cores attached with a single MPI task using the
-c option with
salloc. The environment variable
OMP_NUM_THREADS governs the number of threads that will be used.
#!/bin/bash# Use 2 nodes with 2 tasks each (4 MPI tasks)# And allocate 4 CPUs to each task for multi-threading#SBATCH --nodes=2#SBATCH --tasks-per-node=2#SBATCH --cpus-per-task=4# Load required modulesmodule load mpiexport OMP_NUM_THREADS=4srun --mpi=pmix ./MyMPIProgram
The above batch script will launch 4 MPI tasks - 2 on each node - and allocate 4 CPUs for each task (total 16 cores for the job). Setting
OMP_NUM_THREADS governs the number of threads to be used, although this can also be set in the program.
The maximum theoretical speedup that can be achieved by a parallel program is governed by the proportion of sequential part in the program (Amdahl's law). Moreover, as the number of MPI processes increases, the communication overhead increases i.e. the amount of time spent in sending and receiving messages among the processes increases. For more than a certain number of processes, this increase starts dominating over the decrease in computational run time. This results in the overall program slowing down instead of speeding up as number of processes are increased.
Hence, MPI programs (or any parallel program) do not run faster as the number of processes are increased beyond a certain point.
If you intend to carry out a lot of runs for a program, the correct approach would be to find out the optimum number of processes which will result in the least run time or a reasonably less run time. Start with a small number of processes like 2 or 4 and first verify the correctness of the results by comparing them with the sequential runs. Then increase the number of processes gradually to find the optimum number beyond which the run time flattens out or starts increasing.