Best practices for job submission: Difference between revisions

Best practices for job submission (view source)

Revision as of 21:17, 2 September 2022

548 bytes removed , 2 years ago

Parallelism with some bullet points

Plstonge

cc_staff

782

edits

@@ Line 58: / Line 58: @@
 ==Parallelism==
-By default your job will get one core on one node and this is the most sensible policy because most software is serial: it can only ever make use of a single core. Asking for more cores and/or nodes will not make the program run any faster because for it to run in parallel the program's source code needs to be modified, in some cases in a very profound manner requiring a substantial investment of developer time. How can you determine if the software you're using can run in parallel? The best approach is to look in the software's documentation for a section on parallel execution: if you can't find anything, this is usually a sign that this program is serial. You can also contact the development team to ask if the software can be run in parallel and if not, to request that such a feature be added in a future version.
+* By default your job will get one core on one node and this is the most sensible policy because '''most software is serial''': it can only ever make use of a single core.
+** Asking for more cores and/or nodes will not make the serial program run any faster because for it to run in parallel the program's source code needs to be modified, in some cases in a very profound manner requiring a substantial investment of developer time.
+* How can you '''determine if''' the software you're using '''can run in parallel'''?
+** The best approach is to '''look in the software's documentation''' for a section on parallel execution: if you can't find anything, this is usually a sign that this program is serial.
+** You can also '''contact the development team''' to ask if the software can be run in parallel and if not, to request that such a feature be added in a future version.
-If the program can run in parallel, the next question is how to specify the number of CPU cores that the program should use. The right syntax to use will depend on the particular program: it might be an option to be added as a command line argument like <tt>--nthreads=4</tt>, an environment variable you need to set before calling the program (e.g. <tt>export OMP_NUM_THREADS=4</tt>) or perhaps a line you should add to the program's parameter file. Once you know how to specify the number of CPU cores that the program should use, the next logical question is what number of cores to use? It may be tempting to simply reply, "as many as possible" but this is often not the wisest approach. Just as having too many cooks trying to work together in a small kitchen to prepare a single meal can lead to chaos, so too adding an excessive number of CPU cores can have the perverse effect of slowing down a program. To choose the optimal number of CPU cores, you need to study the software's [[Scalability | scalability]].
+* If the program can run in parallel, the next question is '''what number of cores to use'''?
+** Many of the programming techniques used to allow a program to run in parallel assume the existence of a ''shared memory environment'', i.e. multiple cores can be used but they must all be located on the same node. In this case, the maximum number of cores available on a single node provides a ceiling for the number of cores you can use.
+** It may be tempting to simply request "as many cores as possible", but this is often not the wisest approach. Just as having too many cooks trying to work together in a small kitchen to prepare a single meal can lead to chaos, so too adding an excessive number of CPU cores can have the perverse effect of slowing down a program.
+** To choose the optimal number of CPU cores, you need to '''study the [[Scalability|software's scalability]]'''.
-A further complication with parallel execution concerns the use of multiple nodes - many of the programming techniques used to allow a program to run in parallel assume the existence of a shared memory environment, i.e. multiple cores can be used but they must all be located on the same node. In this case, the maximum number of cores available on a single node provides a ceiling for the number of cores you can use. Trying to ask for more cores than this or using more than one node will fail because the software you are running does not support <i>distributed memory parallelism</i>. Most software able to run over more than one node uses the [[MPI]] standard, so if the documentation doesn't mention MPI or consistently refers to threading and thread-based parallelism, this likely means you will need to restrict yourself to a single node. Programs that have been parallelized to run across multiple nodes should be started using <tt>srun</tt> rather than <tt>mpirun</tt>.
+* A further complication with parallel execution concerns '''the use of multiple nodes''' - the software you are running must support ''distributed memory parallelism''.
+** Most software able to run over more than one node uses '''the [[MPI]] standard''', so if the documentation doesn't mention MPI or consistently refers to threading and thread-based parallelism, this likely means you will need to restrict yourself to a single node.
+** Programs that have been parallelized to run across multiple nodes '''should be started using''' <tt>srun</tt> rather than <tt>mpirun</tt>.
-When using multiple nodes, a goal should also be to avoid scattering your parallel processes across more nodes than is necessary: a more compact distribution will usually help your job's performance. If you know the nodes of the cluster you're using have for example 40 cores, you can use syntax like
+* A goal should also be to '''avoid scattering your parallel processes across more nodes than is necessary''': a more compact distribution will usually help your job's performance.
+** Highly fragmented parallel jobs often exhibit poor performance and also make the scheduler's job more complicated. This being the case, you should try to submit jobs where the number of parallel processes is equal to an integral multiple of the number of cores per node, assuming this is compatible with the parallel software your jobs run.
+** So on a cluster with 40 cores/node, you would always submit parallel jobs asking for 40, 80, 120, 160, 240 etc. processes. For example, with the following job script header, all 120 MPI processes would be assigned in the most compact fashion, using three whole nodes.
 <source>
 #SBATCH --nodes=3
 #SBATCH --ntasks-per-node=40
 </source>
-to ensure that your 120 MPI processes will be assigned in the most compact fashion, using three whole nodes. Highly fragmented parallel jobs often exhibit poor performance and also make the scheduler's job more complicated. This being the case, you should try to submit jobs where the number of parallel processes is equal to an integral multiple of the number of cores per node,  assuming this is compatible with the parallel software your jobs run. So on a cluster with 40 cores/node, you would always submit parallel jobs asking for 40, 80, 120, 160, 240 etc. processes.
-Ultimately, the goal should be to ensure that the CPU efficiency of your jobs is very close to 100%, as measured by the field of this name in the output from the <tt>seff</tt> command; any value of CPU efficiency less than 90% is poor and means that your use of whatever software your job executes needs to be improved.
+* Ultimately, the goal should be to '''ensure that the CPU efficiency of your jobs is very close to 100%''', as measured by the field <tt>CPU Efficiency</tt> in the output from the <tt>seff</tt> command.
+** Any value of CPU efficiency less than 90% is poor and means that your use of whatever software your job executes needs to be improved.
 ==Using GPUs==