Advanced MPI scheduling

From Alliance Doc
Jump to navigation Jump to search


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.




Most users should submit MPI or distributed memory parallel jobs following the example given at Running jobs. Simply request a number of processes with --ntasks or -n and trust the scheduler to allocate those processes in a way that balances the efficiency of your job with the overall efficiency of the cluster.

If you want more control over how your job is allocated, then SchedMD's page on multicore support is a good place to begin. It describes how many of the options to the sbatch command interact to constrain the placement of processes.

You may find this discussion of What exactly is considered a CPU? in SLURM to be useful.

Hybrid jobs: MPI and OpenMP, or MPI and threads[edit]

To come

MPI and GPUs[edit]

To come

Troubleshooting and performance monitoring[edit]

To come

Why srun instead of mpiexec or mpirun?[edit]

mpirun is a wrapper that enables communication between processes running on different machines. Modern schedulers already provide many things that mpirun needs. With Torque/Moab, for example, there is no need to pass to mpirun the list of nodes on which to run, or the number of processes to launch; this is done automatically by the scheduler. With Slurm, the task affinity is also resolved by the scheduler, so there is no need to specify things like

mpirun --map-by node:pe=4 -n 16  application.exe

As implied in the examples above, srun application.exe will automatically distribute the processes to precisely the resources allocated to the job.

In programming terminology, srun is higher level of abstraction than mpirun. Anything that can be done with mpirun can be done with srun, and more. It is the tool in Slurm to distribute any kind of computations. It replaces Torque’s pbsdsh, for example, and much more. Think of srun as the SLURM "all-around parallel-tasks distributor"; once a particular set of resources is allocated, the nature of your application doesn't matter (MPI, OpenMP, hybrid, serial farming, pipelining, multi-program, etc.), you just have to srun it

Also, as should be expected, srun is fully coupled to Slurm. When you srun an application, a "job step" is started, the environment variables SLURM_STEP_ID and SLURM_PROCID are initialized correctly, and correct accounting information is recorded.

External links[edit]