Advanced MPI scheduling: Difference between revisions
No edit summary |
(REPLACE SRUN with mpirun based on early testing) |
||
Line 18: | Line 18: | ||
<!--T:3--> | <!--T:3--> | ||
You may find this discussion of [https://slurm.schedmd.com/faq.html#cpu_count What exactly is considered a CPU?] in SLURM to be useful. | You may find this discussion of [https://slurm.schedmd.com/faq.html#cpu_count What exactly is considered a CPU?] in SLURM to be useful. | ||
=== Examples of common MPI scenarios === <!--T:4--> | === Examples of common MPI scenarios === <!--T:4--> | ||
Line 26: | Line 24: | ||
--ntasks=15 | --ntasks=15 | ||
--mem-per-cpu=3G | --mem-per-cpu=3G | ||
mpirun application.exe | |||
This will run 15 MPI processes. The cores could be allocated anywhere in the cluster. Since we don’t know in advance how many cores will reside on each node, if we want to specify memory, it should be done by per-cpu. | This will run 15 MPI processes. The cores could be allocated anywhere in the cluster. Since we don’t know in advance how many cores will reside on each node, if we want to specify memory, it should be done by per-cpu. | ||
Line 34: | Line 32: | ||
--tasks-per-node=15 | --tasks-per-node=15 | ||
--mem=45G | --mem=45G | ||
mpirun application.exe | |||
will give us what we need. In this case we could also say <code>--mem-per-cpu=3G</code>. The main difference is that with <code>--mem-per-cpu=3G</code>, the job will be canceled if any of the processes exceeds 3GB, while with <code>--mem=45G</code>, the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB. | will give us what we need. In this case we could also say <code>--mem-per-cpu=3G</code>. The main difference is that with <code>--mem-per-cpu=3G</code>, the job will be canceled if any of the processes exceeds 3GB, while with <code>--mem=45G</code>, the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB. | ||
Line 43: | Line 41: | ||
--cpus-per-task=4 | --cpus-per-task=4 | ||
--mem-per-cpu=3G | --mem-per-cpu=3G | ||
mpirun application.exe | |||
In this example a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, in groups of 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes. | In this example a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, in groups of 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes. | ||
Line 51: | Line 49: | ||
--cpus-per-task=4 | --cpus-per-task=4 | ||
--mem=48G | --mem=48G | ||
mpirun application.exe | |||
This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 4 tasks on each of 4 different nodes. Recall that <code>--mem</code> requests memory ''per node'', so we use it instead of <code>--mem-per-cpu</code> for the reason described earlier. | This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 4 tasks on each of 4 different nodes. Recall that <code>--mem</code> requests memory ''per node'', so we use it instead of <code>--mem-per-cpu</code> for the reason described earlier. | ||
=== External links === <!--T:15--> | === External links === <!--T:15--> |
Revision as of 19:31, 23 June 2017
Most users should submit MPI or distributed memory parallel jobs following the example
given at Running jobs. Simply request a number of
processes with --ntasks
or -n
and trust the scheduler
to allocate those processes in a way that balances the efficiency of your job
with the overall efficiency of the cluster.
If you want more control over how your job is allocated, then SchedMD's
page on multicore support is a good
place to begin. It describes how many of the options to the
sbatch
command interact to constrain the placement of processes.
You may find this discussion of What exactly is considered a CPU? in SLURM to be useful.
Examples of common MPI scenarios[edit]
--ntasks=15 --mem-per-cpu=3G mpirun application.exe
This will run 15 MPI processes. The cores could be allocated anywhere in the cluster. Since we don’t know in advance how many cores will reside on each node, if we want to specify memory, it should be done by per-cpu.
If for some reason we need all cores in a single node (to avoid communication overhead, for example), then
--nodes=1 --tasks-per-node=15 --mem=45G mpirun application.exe
will give us what we need. In this case we could also say --mem-per-cpu=3G
. The main difference is that with --mem-per-cpu=3G
, the job will be canceled if any of the processes exceeds 3GB, while with --mem=45G
, the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB.
Hybrid jobs: MPI and OpenMP, or MPI and threads[edit]
--ntasks=16 --cpus-per-task=4 --mem-per-cpu=3G mpirun application.exe
In this example a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, in groups of 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes.
--nodes=4 --tasks-per-node=4 --cpus-per-task=4 --mem=48G mpirun application.exe
This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 4 tasks on each of 4 different nodes. Recall that --mem
requests memory per node, so we use it instead of --mem-per-cpu
for the reason described earlier.