Advanced MPI scheduling: Difference between revisions
No edit summary |
No edit summary |
||
(34 intermediate revisions by 7 users not shown) | |||
Line 10: | Line 10: | ||
<!--T:2--> | <!--T:2--> | ||
If you want more control over how your job is allocated, then [https://slurm.schedmd.com/mc_support.html | If you want more control over how your job is allocated, then SchedMD's | ||
page on [https://slurm.schedmd.com/mc_support.html multicore support] is a good place to begin. It describes how many of the options to the | |||
[https://slurm.schedmd.com/sbatch.html <code>sbatch</code>] | [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>] | ||
command interact to constrain the placement of processes. | command interact to constrain the placement of processes. | ||
<!--T:3--> | <!--T:3--> | ||
You may find this discussion | You may find this discussion on [https://slurm.schedmd.com/faq.html#cpu_count What exactly is considered a CPU?] in Slurm to be useful. | ||
=== Examples of common MPI scenarios === <!--T:4--> | === Examples of common MPI scenarios === <!--T:4--> | ||
==== Few cores, any number of nodes ==== <!--T:18--> | ==== Few cores, any number of nodes ==== <!--T:18--> | ||
In addition to the time limit needed for | In addition to the time limit needed for <i>any</i> Slurm job, an MPI job requires that you specify how many MPI processes Slurm should start. The simplest way to do this is with <code>--ntasks</code>. Since the default memory allocation of 256MB per core is often insufficient, you may also wish to specify how much memory is needed. Using <code>--ntasks</code> you cannot know in advance how many cores will reside on each node, so you should request memory with <code>--mem-per-cpu</code>. For example: | ||
<!--T:5--> | <!--T:5--> | ||
{{File | |||
|name=basic_mpi_job.sh | |||
|lang="sh" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --ntasks=15 | |||
#SBATCH --mem-per-cpu=3G | |||
srun application.exe | |||
}} | |||
This will run 15 MPI processes. The cores could be allocated on one node, on 15 nodes, or on any number in between. | This will run 15 MPI processes. The cores could be allocated on one node, on 15 nodes, or on any number in between. | ||
==== Whole nodes ==== <!--T:17--> | ==== Whole nodes ==== <!--T:17--> | ||
<!--T:25--> | |||
If you have a large parallel job to run, that is, one that can efficiently use 32 cores or more, you should probably request whole nodes. To do so, it helps to know what node types are available at the cluster you are using. | |||
<!--T:24--> | |||
Typical nodes in [[Béluga/en|Béluga]], [[Cedar]], [[Graham]], [[Narval/en|Narval]] and [[Niagara]] have the following CPU and memory configuration: | |||
<!--T:26--> | |||
{| class="wikitable" | |||
|- | |||
! Cluster !! cores !! usable memory !! Notes | |||
|- | |||
| [[Béluga/en|Béluga]] || 40 || 186 GiB (~4.6 GiB/core) || Some are reserved for whole node jobs. | |||
|- | |||
| [[Graham]] || 32 || 125 GiB (~3.9 GiB/core) || Some are reserved for whole node jobs. | |||
|- | |||
| [[Cedar]] (Skylake) || 48 || 187 GiB (~3.9 GiB/core) || Some are reserved for whole node jobs. | |||
|- | |||
| [[Narval/en|Narval]] || 64 || 249 GiB (~3.9 GiB/core) || AMD EPYC Rome processors | |||
|- | |||
| [[Niagara]] || 40 || 188 GiB || Only whole-node jobs are possible at Niagara. | |||
|} | |||
<!--T:27--> | |||
Whole-node jobs are allowed to run on any node. In the table above, <i>Some are reserved for whole-node jobs</i> indicates that there are nodes on which by-core jobs are forbidden. | |||
<!--T:28--> | |||
A job script requesting whole nodes should look like this: | |||
<!--T:29--> | |||
<tabs> | |||
<tab name="Béluga"> | |||
{{File | |||
|name=whole_nodes_beluga.sh | |||
|lang="sh" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --nodes=2 | |||
#SBATCH --ntasks-per-node=40 | |||
#SBATCH --mem=0 | |||
srun application.exe | |||
}}</tab> | |||
<tab name="Cedar"> | |||
{{File | |||
|name=whole_nodes_cedar.sh | |||
|lang="sh" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --nodes=2 | |||
#SBATCH --ntasks-per-node=48 | |||
#SBATCH --mem=0 | |||
srun application.exe | |||
}}</tab> | |||
<tab name="Graham"> | |||
{{File | |||
|name=whole_nodes_graham.sh | |||
|lang="sh" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --nodes=2 | |||
#SBATCH --ntasks-per-node=32 | |||
#SBATCH --mem=0 | |||
srun application.exe | |||
}}</tab> | |||
<tab name="Narval"> | |||
{{File | |||
|name=whole_nodes_narval.sh | |||
|lang="sh" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --nodes=2 | |||
#SBATCH --ntasks-per-node=64 | |||
#SBATCH --mem=0 | |||
srun application.exe | |||
}}</tab> | |||
<tab name="Niagara"> | |||
{{File | |||
|name=whole_nodes_niagara.sh | |||
|lang="sh" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --nodes=2 | |||
#SBATCH --ntasks-per-node=40 # or 80: Hyper-Threading is enabled | |||
#SBATCH --mem=0 | |||
srun application.exe | |||
}}</tab> | |||
</tabs> | |||
<!--T:19--> | <!--T:19--> | ||
Requesting <code>--mem=0</code> is interpreted by Slurm to mean <i>reserve all the available memory on each node assigned to the job.</i> | |||
<!--T:30--> | |||
If you need more memory per node than the smallest node provides (e.g. more than 125 GiB at Graham) then you should not use <code>--mem=0</code>, but request the amount explicitly. Furthermore, some memory on each node is reserved for the operating system. To find the largest amount your job can request and still qualify for a given node type, see the <i>Available memory</i> column of the <i>Node characteristics</i> table for each cluster. | |||
* [[Béluga/en#Node_characteristics|Béluga node characteristics]] | |||
* [[Cedar#Node_characteristics|Cedar node characteristics]] | |||
* [[Graham#Node_characteristics|Graham node characteristics]] | |||
* [[Narval/en#Node_characteristics|Narval node characteristics]] | |||
==== Few cores, single node ==== <!--T:6--> | ==== Few cores, single node ==== <!--T:6--> | ||
If | If you need less than a full node but need all the cores to be on the same node, then you can request, for example, | ||
{{File | |||
|name=less_than_whole_node.sh | |||
|lang="sh" | |||
|contents= | |||
In this case you could also say <code>--mem-per-cpu=3G</code>. The advantage of <code>--mem=45G</code> is that the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB. With <code>--mem-per-cpu=3G</code>, the job will be | #!/bin/bash | ||
#SBATCH --nodes=1 | |||
#SBATCH --ntasks-per-node=15 | |||
#SBATCH --mem=45G | |||
srun application.exe | |||
}} | |||
In this case, you could also say <code>--mem-per-cpu=3G</code>. The advantage of <code>--mem=45G</code> is that the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB. With <code>--mem-per-cpu=3G</code>, the job will be cancelled if any of the processes exceeds 3GB. | |||
==== Large parallel job, not a multiple of | ==== Large parallel job, not a multiple of whole nodes ==== <!--T:20--> | ||
Not every application runs with maximum efficiency on a multiple of 32 cores. Choosing the number of cores to | Not every application runs with maximum efficiency on a multiple of 32 (or 40, or 48) cores. Choosing the number of cores to request—and whether or not to request whole nodes—may be a trade-off between <i>running</i> time (or efficient use of the computer) and <i>waiting</i> time (or efficient use of your time). If you want help evaluating these factors, please contact [[Technical support]]. | ||
=== Hybrid jobs: MPI and OpenMP, or MPI and threads === <!--T:7--> | === Hybrid jobs: MPI and OpenMP, or MPI and threads === <!--T:7--> | ||
<!--T:21--> | <!--T:21--> | ||
It is important to understand that the number of | It is important to understand that the number of <i>tasks</i> requested of Slurm is the number of <i>processes</i> that will be started by <code>srun</code>. So for a hybrid job that will use both MPI processes and OpenMP threads or Posix threads, you should set the MPI process count with <code>--ntasks</code> or <code>-ntasks-per-node</code>, and set the thread count with <code>--cpus-per-task</code>. | ||
<!--T:8--> | <!--T:8--> | ||
Line 60: | Line 160: | ||
--cpus-per-task=4 | --cpus-per-task=4 | ||
--mem-per-cpu=3G | --mem-per-cpu=3G | ||
srun application.exe | srun --cpus-per-task=$SLURM_CPUS_PER_TASK application.exe | ||
In this example a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, with 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes. | In this example, a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, with 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes. Note that you must specify <code>--cpus-per-task=$SLURM_CPUS_PER_TASK</code> for <code>srun</code> as well, as this is a requirement since Slurm 22.05 and does not hurt for older versions. | ||
<!--T:9--> | <!--T:9--> | ||
Line 68: | Line 168: | ||
--cpus-per-task=4 | --cpus-per-task=4 | ||
--mem=96G | --mem=96G | ||
srun application.exe | srun --cpus-per-task=$SLURM_CPUS_PER_TASK application.exe | ||
This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 2 whole nodes. | This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 2 whole nodes. Remember that <code>--mem</code> requests memory <i>per node</i>, so we use it instead of <code>--mem-per-cpu</code> for the reason described earlier. | ||
=== Why srun instead of mpiexec or mpirun? === <!--T:10--> | === Why srun instead of mpiexec or mpirun? === <!--T:10--> | ||
Line 81: | Line 181: | ||
<!--T:13--> | <!--T:13--> | ||
In programming | In programming terms, <code>srun</code> is at a higher level of abstraction than <code>mpirun</code>. Anything that can be done with <code>mpirun</code> can be done with <code>srun</code>, and more. It is the tool in Slurm to distribute any kind of computation. It replaces Torque’s <code>pbsdsh</code>, for example, and much more. Think of <code>srun</code> as the SLURM <i>all-around parallel-tasks distributor</i>; once a particular set of resources is allocated, the nature of your application doesn't matter (MPI, OpenMP, hybrid, serial farming, pipelining, multiprogram, etc.), you just have to <code>srun</code> it. | ||
<!--T:14--> | <!--T:14--> | ||
Also, as you would expect, <code>srun</code> is fully coupled to Slurm. When you <code>srun</code> an application, a | Also, as you would expect, <code>srun</code> is fully coupled to Slurm. When you <code>srun</code> an application, a <i>job step</i> is started, the environment variables <code>SLURM_STEP_ID</code> and <code>SLURM_PROCID</code> are initialized correctly, and correct accounting information is recorded. | ||
<!--T:22--> | |||
For an example of some differences between <code>srun</code> and <code>mpiexec</code>, see [https://mail-archive.com/users@lists.open-mpi.org/msg31874.html this discussion] on the Open MPI support forum. Better performance might be achievable with <code>mpiexec</code> than with <code>srun</code> under certain circumstances, but using <code>srun</code> minimizes the risk of a mismatch between the resources allocated by Slurm and those used by Open MPI. | |||
=== External links === <!--T:15--> | === External links === <!--T:15--> |
Latest revision as of 20:42, 27 April 2023
Most users should submit MPI or distributed memory parallel jobs following the example
given at Running jobs. Simply request a number of
processes with --ntasks
or -n
and trust the scheduler
to allocate those processes in a way that balances the efficiency of your job
with the overall efficiency of the cluster.
If you want more control over how your job is allocated, then SchedMD's
page on multicore support is a good place to begin. It describes how many of the options to the
sbatch
command interact to constrain the placement of processes.
You may find this discussion on What exactly is considered a CPU? in Slurm to be useful.
Examples of common MPI scenarios[edit]
Few cores, any number of nodes[edit]
In addition to the time limit needed for any Slurm job, an MPI job requires that you specify how many MPI processes Slurm should start. The simplest way to do this is with --ntasks
. Since the default memory allocation of 256MB per core is often insufficient, you may also wish to specify how much memory is needed. Using --ntasks
you cannot know in advance how many cores will reside on each node, so you should request memory with --mem-per-cpu
. For example:
#!/bin/bash
#SBATCH --ntasks=15
#SBATCH --mem-per-cpu=3G
srun application.exe
This will run 15 MPI processes. The cores could be allocated on one node, on 15 nodes, or on any number in between.
Whole nodes[edit]
If you have a large parallel job to run, that is, one that can efficiently use 32 cores or more, you should probably request whole nodes. To do so, it helps to know what node types are available at the cluster you are using.
Typical nodes in Béluga, Cedar, Graham, Narval and Niagara have the following CPU and memory configuration:
Cluster | cores | usable memory | Notes |
---|---|---|---|
Béluga | 40 | 186 GiB (~4.6 GiB/core) | Some are reserved for whole node jobs. |
Graham | 32 | 125 GiB (~3.9 GiB/core) | Some are reserved for whole node jobs. |
Cedar (Skylake) | 48 | 187 GiB (~3.9 GiB/core) | Some are reserved for whole node jobs. |
Narval | 64 | 249 GiB (~3.9 GiB/core) | AMD EPYC Rome processors |
Niagara | 40 | 188 GiB | Only whole-node jobs are possible at Niagara. |
Whole-node jobs are allowed to run on any node. In the table above, Some are reserved for whole-node jobs indicates that there are nodes on which by-core jobs are forbidden.
A job script requesting whole nodes should look like this:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --mem=0
srun application.exe
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
#SBATCH --mem=0
srun application.exe
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=0
srun application.exe
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --mem=0
srun application.exe
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40 # or 80: Hyper-Threading is enabled
#SBATCH --mem=0
srun application.exe
Requesting --mem=0
is interpreted by Slurm to mean reserve all the available memory on each node assigned to the job.
If you need more memory per node than the smallest node provides (e.g. more than 125 GiB at Graham) then you should not use --mem=0
, but request the amount explicitly. Furthermore, some memory on each node is reserved for the operating system. To find the largest amount your job can request and still qualify for a given node type, see the Available memory column of the Node characteristics table for each cluster.
- Béluga node characteristics
- Cedar node characteristics
- Graham node characteristics
- Narval node characteristics
Few cores, single node[edit]
If you need less than a full node but need all the cores to be on the same node, then you can request, for example,
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=15
#SBATCH --mem=45G
srun application.exe
In this case, you could also say --mem-per-cpu=3G
. The advantage of --mem=45G
is that the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB. With --mem-per-cpu=3G
, the job will be cancelled if any of the processes exceeds 3GB.
Large parallel job, not a multiple of whole nodes[edit]
Not every application runs with maximum efficiency on a multiple of 32 (or 40, or 48) cores. Choosing the number of cores to request—and whether or not to request whole nodes—may be a trade-off between running time (or efficient use of the computer) and waiting time (or efficient use of your time). If you want help evaluating these factors, please contact Technical support.
Hybrid jobs: MPI and OpenMP, or MPI and threads[edit]
It is important to understand that the number of tasks requested of Slurm is the number of processes that will be started by srun
. So for a hybrid job that will use both MPI processes and OpenMP threads or Posix threads, you should set the MPI process count with --ntasks
or -ntasks-per-node
, and set the thread count with --cpus-per-task
.
--ntasks=16 --cpus-per-task=4 --mem-per-cpu=3G srun --cpus-per-task=$SLURM_CPUS_PER_TASK application.exe
In this example, a total of 64 cores will be allocated, but only 16 MPI processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, with 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes. Note that you must specify --cpus-per-task=$SLURM_CPUS_PER_TASK
for srun
as well, as this is a requirement since Slurm 22.05 and does not hurt for older versions.
--nodes=2 --ntasks-per-node=8 --cpus-per-task=4 --mem=96G srun --cpus-per-task=$SLURM_CPUS_PER_TASK application.exe
This job is the same size as the last one: 16 tasks (that is, 16 MPI processes), each with 4 threads. The difference here is that we are sure of getting exactly 2 whole nodes. Remember that --mem
requests memory per node, so we use it instead of --mem-per-cpu
for the reason described earlier.
Why srun instead of mpiexec or mpirun?[edit]
mpirun
is a wrapper that enables communication between processes running on different machines. Modern schedulers already provide many things that mpirun
needs. With Torque/Moab, for example, there is no need to pass to mpirun
the list of nodes on which to run, or the number of processes to launch; this is done automatically by the scheduler. With Slurm, the task affinity is also resolved by the scheduler, so there is no need to specify things like
mpirun --map-by node:pe=4 -n 16 application.exe
As implied in the examples above, srun application.exe
will automatically distribute the processes to precisely the resources allocated to the job.
In programming terms, srun
is at a higher level of abstraction than mpirun
. Anything that can be done with mpirun
can be done with srun
, and more. It is the tool in Slurm to distribute any kind of computation. It replaces Torque’s pbsdsh
, for example, and much more. Think of srun
as the SLURM all-around parallel-tasks distributor; once a particular set of resources is allocated, the nature of your application doesn't matter (MPI, OpenMP, hybrid, serial farming, pipelining, multiprogram, etc.), you just have to srun
it.
Also, as you would expect, srun
is fully coupled to Slurm. When you srun
an application, a job step is started, the environment variables SLURM_STEP_ID
and SLURM_PROCID
are initialized correctly, and correct accounting information is recorded.
For an example of some differences between srun
and mpiexec
, see this discussion on the Open MPI support forum. Better performance might be achievable with mpiexec
than with srun
under certain circumstances, but using srun
minimizes the risk of a mismatch between the resources allocated by Slurm and those used by Open MPI.