Advanced MPI scheduling: Difference between revisions

amplify advice on whole nodes versus by-core
No edit summary
(amplify advice on whole nodes versus by-core)
Line 20: Line 20:


=== Examples of common MPI scenarios === <!--T:4-->
=== Examples of common MPI scenarios === <!--T:4-->
==== Few cores, any nodes ====
In addition to the time limit needed for ''any'' Slurm job, an MPI job requires that you specify how many MPI processes Slurm should start. The simplest way to do this is with <code>--ntasks</code>. Since the default memory allocation of 256M per core is often insufficient, you may also wish to specify how much memory is needed. Using <code>--ntasks</code> you cannot know in advance how many cores will reside on each node, so you should request memory with <code>--mem-per-cpu</code>. For example:


  <!--T:5-->
  <!--T:5-->
--ntasks=15
--ntasks=15
  --mem-per-cpu=3G
  --mem-per-cpu=3G
  srun application.exe
  srun application.exe
This will run 15 MPI processes. The cores could be allocated anywhere in the cluster. Since we don’t know in advance how many cores will reside on each node, if we want to specify memory, it should be done by per-cpu.  
This will run 15 MPI processes. The cores could be allocated on one node, on 15 nodes, or on any number in between.


==== Whole nodes ====
<!--T:17-->
<!--T:17-->
Most nodes in [[Cedar]] and [[Graham]] have 32 cores and 128GB or more of memory. If we have a large parallel job to run, which can use 32 or a multiple of 32 cores efficiently, we should request whole nodes like so:
Most nodes in [[Cedar]] and [[Graham]] have 32 cores and 128GB or more of memory. If you have a large parallel job to run, which can use 32 or a multiple of 32 cores efficiently, you should request whole nodes like so:
  --nodes=2
  --nodes=2
  --ntasks-per-node=32
  --ntasks-per-node=32
  --mem=128000M
  --mem=128000M
  srun application.exe
  srun application.exe
The above request can be scheduled more efficiently than one that simply requests <code>--ntasks=64</code>.
The above job will probably start sooner than an equivalent one that requests <code>--ntasks=64</code>.  


You should use <code>--mem=128000M</code> rather than <code>--mem=128G</code> when requesting  whole nodes because a small tranche of memory is reserved to the operating system, and requesting precisely 128GB means the job cannot be scheduled on the plentiful 128GB nodes. The job will not be rejected by Slurm, it will just wait much longer to start than it needs to.
==== Few cores, single node ====
<!--T:6-->
<!--T:6-->
If for some reason we need less than 32 cores but they must all in a single node, then we can request
If for some reason you need less than 32 cores but they must all in a single node, then you can request,
  --nodes=1
  --nodes=1
  --ntasks-per-node=15
  --ntasks-per-node=15
  --mem=45G
  --mem=45G
  srun application.exe
  srun application.exe
In this case we could also say <code>--mem-per-cpu=3G</code>. The main difference is that with <code>--mem-per-cpu=3G</code>, the job will be canceled if any of the processes exceeds 3GB, while with <code>--mem=45G</code>, the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB.
In this case you could also say <code>--mem-per-cpu=3G</code>. The advantage of <code>--mem=45G</code> is that the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB. With <code>--mem-per-cpu=3G</code>, the job will be canceled if any of the processes exceeds 3GB.
 
==== Large parallel job, not a multiple of 32 cores ====
Not every application runs with maximum efficiency on a multiple of 32 cores. Choosing the number of cores to request, and whether or not to request whole nodes, may be a trade-off between ''running'' time (or efficient use of the computer) and ''waiting'' time (or efficient use of your time). If you want help evaluating these factors, please write to  [mailto:support@computecanada.ca support@computecanada.ca].


=== Hybrid jobs: MPI and OpenMP, or MPI and threads === <!--T:7-->
=== Hybrid jobs: MPI and OpenMP, or MPI and threads === <!--T:7-->
Bureaucrats, cc_docs_admin, cc_staff
2,879

edits