Advanced MPI scheduling: Difference between revisions

Revision as of 16:30, 27 June 2017

Other languages:

English
français

Most users should submit MPI or distributed memory parallel jobs following the example given at Running jobs. Simply request a number of processes with --ntasks or -n and trust the scheduler to allocate those processes in a way that balances the efficiency of your job with the overall efficiency of the cluster.

If you want more control over how your job is allocated, then SchedMD's page on multicore support is a good place to begin. It describes how many of the options to the sbatch command interact to constrain the placement of processes.

You may find this discussion of What exactly is considered a CPU? in SLURM to be useful.

Note: 2017-06-21 - Use mpiexec in place of srun in the following examples. As of this writing using srun to launch MPI programs does not work.

Examples of common MPI scenarios[edit]

--ntasks=15
--mem-per-cpu=3G
srun application.exe

This will run 15 MPI processes. The cores could be allocated anywhere in the cluster. Since we don’t know in advance how many cores will reside on each node, if we want to specify memory, it should be done by per-cpu.

If for some reason we need all cores in a single node (to avoid communication overhead, for example), then

--nodes=1
--tasks-per-node=15
--mem=45G
srun application.exe

will give us what we need. In this case we could also say --mem-per-cpu=3G. The main difference is that with --mem-per-cpu=3G, the job will be canceled if any of the processes exceeds 3GB, while with --mem=45G, the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB.

Hybrid jobs: MPI and OpenMP, or MPI and threads[edit]

--ntasks=16
--cpus-per-task=4
--mem-per-cpu=3G
srun application.exe

In this example a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, in groups of 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes.

--nodes=4
--tasks-per-node=4
--cpus-per-task=4
--mem=48G
srun application.exe

This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 4 tasks on each of 4 different nodes. Recall that --mem requests memory per node, so we use it instead of --mem-per-cpu for the reason described earlier.

Why srun instead of mpiexec or mpirun?[edit]

mpirun is a wrapper that enables communication between processes running on different machines. Modern schedulers already provide many things that mpirun needs. With Torque/Moab, for example, there is no need to pass to mpirun the list of nodes on which to run, or the number of processes to launch; this is done automatically by the scheduler. With Slurm, the task affinity is also resolved by the scheduler, so there is no need to specify things like

mpirun --map-by node:pe=4 -n 16  application.exe

As implied in the examples above, srun application.exe will automatically distribute the processes to precisely the resources allocated to the job.

In programming terminology, srun is higher level of abstraction than mpirun. Anything that can be done with mpirun can be done with srun, and more. It is the tool in Slurm to distribute any kind of computations. It replaces Torque’s pbsdsh, for example, and much more. Think of srun as the SLURM "all-around parallel-tasks distributor"; once a particular set of resources is allocated, the nature of your application doesn't matter (MPI, OpenMP, hybrid, serial farming, pipelining, multi-program, etc.), you just have to srun it

Also, as you would expect, srun is fully coupled to Slurm. When you srun an application, a "job step" is started, the environment variables SLURM_STEP_ID and SLURM_PROCID are initialized correctly, and correct accounting information is recorded.

External links[edit]

sbatch documentation
srun documentation
Open MPI and SLURM

@@ Line 18: / Line 18: @@
 <!--T:3-->
 You may find this discussion of [https://slurm.schedmd.com/faq.html#cpu_count What exactly is considered a CPU?] in SLURM to be useful.
+<span style="color:#ff0000">Note: 2017-06-21 - Use mpiexec in place of srun in the following examples. As of this writing using srun to launch MPI programs does not work.</span>
 === Examples of common MPI scenarios === <!--T:4-->
@@ Line 24: / Line 26: @@
 --ntasks=15
   --mem-per-cpu=3G
-  mpirun application.exe
+  srun application.exe
 This will run 15 MPI processes. The cores could be allocated anywhere in the cluster. Since we don’t know in advance how many cores will reside on each node, if we want to specify memory, it should be done by per-cpu.
@@ Line 32: / Line 34: @@
   --tasks-per-node=15
   --mem=45G
-  mpirun application.exe
+  srun application.exe
 will give us what we need. In this case we could also say <code>--mem-per-cpu=3G</code>. The main difference is that with <code>--mem-per-cpu=3G</code>, the job will be canceled if any of the processes exceeds 3GB, while with <code>--mem=45G</code>, the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB.
@@ Line 41: / Line 43: @@
   --cpus-per-task=4
   --mem-per-cpu=3G
-  mpirun application.exe
+  srun application.exe
 In this example a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, in groups of 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes.
@@ Line 49: / Line 51: @@
   --cpus-per-task=4
   --mem=48G
-  mpirun application.exe
+  srun application.exe
 This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 4 tasks on each of 4 different nodes. Recall that <code>--mem</code> requests memory ''per node'', so we use it instead of <code>--mem-per-cpu</code> for the reason described earlier.
+=== Why srun instead of mpiexec or mpirun? === <!--T:10-->
+<!--T:11-->
+<code>mpirun</code> is a wrapper that enables communication between processes running on different machines. Modern schedulers already provide many things that <code>mpirun</code> needs. With Torque/Moab, for example, there is no need to pass to <code>mpirun</code> the list of nodes on which to run, or the number of processes to launch; this is done automatically by the scheduler. With Slurm, the task affinity is also resolved by the scheduler, so there is no need to specify things like
+ mpirun --map-by node:pe=4 -n 16  application.exe
+<!--T:12-->
+As implied in the examples above, <code>srun application.exe</code> will automatically distribute the processes to precisely the resources allocated to the job.
+<!--T:13-->
+In programming terminology, <code>srun</code> is higher level of abstraction than <code>mpirun</code>. Anything that can be done with <code>mpirun</code> can be done with <code>srun</code>, and more. It is the tool in Slurm to distribute any kind of computations. It replaces Torque’s <code>pbsdsh</code>, for example, and much more. Think of <code>srun</code> as the SLURM "all-around parallel-tasks distributor"; once a particular set of resources is allocated, the nature of your application doesn't matter (MPI, OpenMP, hybrid, serial farming, pipelining, multi-program, etc.), you just have to <code>srun</code> it
+<!--T:14-->
+Also, as you would expect, <code>srun</code> is fully coupled to Slurm. When you <code>srun</code> an application, a "job step" is started, the environment variables <code>SLURM_STEP_ID</code> and <code>SLURM_PROCID</code> are initialized correctly, and correct accounting information is recorded.
 === External links === <!--T:15-->

Advanced MPI scheduling: Difference between revisions

Revision as of 16:30, 27 June 2017

Contents

Examples of common MPI scenarios[edit]

Hybrid jobs: MPI and OpenMP, or MPI and threads[edit]

Why srun instead of mpiexec or mpirun?[edit]

External links[edit]

Navigation menu

Advanced MPI scheduling: Difference between revisions

Revision as of 16:30, 27 June 2017

Examples of common MPI scenarios[edit]

Hybrid jobs: MPI and OpenMP, or MPI and threads[edit]

Why srun instead of mpiexec or mpirun?[edit]

External links[edit]

Navigation menu

Search