Best practices for job submission: Difference between revisions

Jump to navigation Jump to search
no edit summary
(Improved IO)
No edit summary
Line 1: Line 1:
<languages />
<translate>
When submitting a job to one of the clusters, it's important to choose appropriate values for various parameters in order to ensure that your job doesn't waste resources or create problems for other users and yourself. This will ensure your job starts more quickly and that it is likely to finish correctly, producing the output you need to move your research forward.
When submitting a job to one of the clusters, it's important to choose appropriate values for various parameters in order to ensure that your job doesn't waste resources or create problems for other users and yourself. This will ensure your job starts more quickly and that it is likely to finish correctly, producing the output you need to move your research forward.


Line 30: Line 32:
* '''Use [[Running_jobs#Completed_jobs|monitoring tools]]''' to see how long completed jobs took.
* '''Use [[Running_jobs#Completed_jobs|monitoring tools]]''' to see how long completed jobs took.
** For example, the <tt>Job Wall-clock time</tt> field in the output of the <tt>seff</tt> command:
** For example, the <tt>Job Wall-clock time</tt> field in the output of the <tt>seff</tt> command:
</translate>
{{Command
{{Command
|seff 1234567
|seff 1234567
Line 44: Line 47:
Memory Utilized: 14.95 GB (estimated maximum)
Memory Utilized: 14.95 GB (estimated maximum)
Memory Efficiency: 11.68% of 128.00 GB (8.00 GB/core)
Memory Efficiency: 11.68% of 128.00 GB (8.00 GB/core)
}}  
}}
<translate>
* '''Increase the estimated duration by 5% or 10%''', just in case.
* '''Increase the estimated duration by 5% or 10%''', just in case.
** It's natural to leave a certain amount of room for error in the estimate, but otherwise it's in your interest for your estimate of the job's duration to be as accurate as possible.
** It's natural to leave a certain amount of room for error in the estimate, but otherwise it's in your interest for your estimate of the job's duration to be as accurate as possible.
Line 79: Line 83:
** Highly fragmented parallel jobs often exhibit poor performance and also make the scheduler's job more complicated. This being the case, you should try to submit jobs where the number of parallel processes is equal to an integral multiple of the number of cores per node, assuming this is compatible with the parallel software your jobs run.
** Highly fragmented parallel jobs often exhibit poor performance and also make the scheduler's job more complicated. This being the case, you should try to submit jobs where the number of parallel processes is equal to an integral multiple of the number of cores per node, assuming this is compatible with the parallel software your jobs run.
** So on a cluster with 40 cores/node, you would always submit parallel jobs asking for 40, 80, 120, 160, 240 etc. processes. For example, with the following job script header, all 120 MPI processes would be assigned in the most compact fashion, using three whole nodes.
** So on a cluster with 40 cores/node, you would always submit parallel jobs asking for 40, 80, 120, 160, 240 etc. processes. For example, with the following job script header, all 120 MPI processes would be assigned in the most compact fashion, using three whole nodes.
</translate>
<source>
<source>
#SBATCH --nodes=3
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=40
#SBATCH --ntasks-per-node=40
</source>
</source>
 
<translate>
* Ultimately, the goal should be to '''ensure that the CPU efficiency of your jobs is very close to 100%''', as measured by the field <tt>CPU Efficiency</tt> in the output from the <tt>seff</tt> command.
* Ultimately, the goal should be to '''ensure that the CPU efficiency of your jobs is very close to 100%''', as measured by the field <tt>CPU Efficiency</tt> in the output from the <tt>seff</tt> command.
** Any value of CPU efficiency less than 90% is poor and means that your use of whatever software your job executes needs to be improved.
** Any value of CPU efficiency less than 90% is poor and means that your use of whatever software your job executes needs to be improved.
Line 101: Line 106:
* We strongly recommend against the use of [[Anaconda/en|Conda]] and its variants on the clusters, in favour of solutions like a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] or [[Singularity]].
* We strongly recommend against the use of [[Anaconda/en|Conda]] and its variants on the clusters, in favour of solutions like a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] or [[Singularity]].
* Read and write operations should be optimized by '''[[Using_node-local_storage|using node-local storage]]'''.
* Read and write operations should be optimized by '''[[Using_node-local_storage|using node-local storage]]'''.
</translate>
Bureaucrats, cc_docs_admin, cc_staff
2,320

edits

Navigation menu