Best practices for job submission: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 21: Line 21:


<!--T:6-->
<!--T:6-->
The best approach is to begin by submitting a few relatively small test jobs, asking for a fairly standard amount of memory (<tt>#SBATCH --mem-per-cpu=2G</tt>) and time, for example one or two hours.
The best approach is to begin by submitting a few relatively small test jobs, asking for a fairly standard amount of memory (<>#SBATCH --mem-per-cpu=2G</code>) and time, for example one or two hours.
* Ideally, you should already know what the answer will be in these test jobs, allowing you to verify that the software is running correctly on the cluster.
* Ideally, you should already know what the answer will be in these test jobs, allowing you to verify that the software is running correctly on the cluster.
* If the job ends before the computation finished, you can increase the duration by doubling it until the job's duration is sufficient.
* If the job ends before the computation finished, you can increase the duration by doubling it until the job's duration is sufficient.
Line 37: Line 37:
** Asking for five days when the computation in reality finishes after just sixteen hours leads to your job spending much more time waiting to start than it would had you given a more accurate estimate of the duration.
** Asking for five days when the computation in reality finishes after just sixteen hours leads to your job spending much more time waiting to start than it would had you given a more accurate estimate of the duration.
* '''Use [[Running_jobs#Completed_jobs|monitoring tools]]''' to see how long completed jobs took.
* '''Use [[Running_jobs#Completed_jobs|monitoring tools]]''' to see how long completed jobs took.
** For example, the <tt>Job Wall-clock time</tt> field in the output of the <tt>seff</tt> command:
** For example, the <code>Job Wall-clock time</code> field in the output of the <code>seff</code> command:
</translate>
</translate>
{{Command
{{Command
Line 64: Line 64:


<!--T:12-->
<!--T:12-->
* Your <tt>Memory Efficiency</tt> in the output from the <tt>seff</tt> command '''should be at least 80% to 85%''' in most cases.
* Your <code>Memory Efficiency</code> in the output from the <code>seff</code> command '''should be at least 80% to 85%''' in most cases.
** Much like with the duration of your job, the goal when requesting the memory is to ensure that the amount is sufficient, with a certain margin of error.
** Much like with the duration of your job, the goal when requesting the memory is to ensure that the amount is sufficient, with a certain margin of error.
* If you plan on using a '''whole node''' for your job, it is natural to also '''use all of its available memory''' which you can express using the line <tt>#SBATCH --mem=0</tt> in your job submission script.
* If you plan on using a '''whole node''' for your job, it is natural to also '''use all of its available memory''' which you can express using the line <code>#SBATCH --mem=0</code> in your job submission script.
** Note however that most of our clusters offer nodes with variable amounts of memory available, so using this approach means your job will likely be assigned a node with less memory.
** Note however that most of our clusters offer nodes with variable amounts of memory available, so using this approach means your job will likely be assigned a node with less memory.
* If your testing has shown that you need a '''large memory node''', then you will want to use a line like <tt>#SBATCH --mem=1500G</tt> for example, to request a node with 1500 GB (or 1.46 TB) of memory.
* If your testing has shown that you need a '''large memory node''', then you will want to use a line like <code>#SBATCH --mem=1500G</code> for example, to request a node with 1500 GB (or 1.46 TB) of memory.
** There are relatively few of these large memory nodes so your job will wait much longer to run - make sure your job really needs all this extra memory.
** There are relatively few of these large memory nodes so your job will wait much longer to run - make sure your job really needs all this extra memory.


Line 102: Line 102:
<translate>
<translate>
<!--T:18-->
<!--T:18-->
* Ultimately, the goal should be to '''ensure that the CPU efficiency of your jobs is very close to 100%''', as measured by the field <tt>CPU Efficiency</tt> in the output from the <tt>seff</tt> command.
* Ultimately, the goal should be to '''ensure that the CPU efficiency of your jobs is very close to 100%''', as measured by the field <code>CPU Efficiency</code> in the output from the <code>seff</code> command.
** Any value of CPU efficiency less than 90% is poor and means that your use of whatever software your job executes needs to be improved.
** Any value of CPU efficiency less than 90% is poor and means that your use of whatever software your job executes needs to be improved.


Line 113: Line 113:
** Because of the high cost of GPU nodes, a job using '''a single GPU''' should run significantly faster than if it was using a full CPU node.
** Because of the high cost of GPU nodes, a job using '''a single GPU''' should run significantly faster than if it was using a full CPU node.
** If your job '''only finishes 5% or 10% more quickly with a GPU, it's probably not worth''' the effort of waiting to get a node with a GPU as it will be idle during much of your job's execution.
** If your job '''only finishes 5% or 10% more quickly with a GPU, it's probably not worth''' the effort of waiting to get a node with a GPU as it will be idle during much of your job's execution.
* '''Other tools for monitoring the efficiency''' of your GPU-based jobs include <tt>[https://developer.nvidia.com/nvidia-system-management-interface nvidia-smi]</tt>, <tt>nvtop</tt> and, if you're using software based on [[TensorFlow]], the [[TensorFlow#TensorBoard|TensorBoard]] utility.
* '''Other tools for monitoring the efficiency''' of your GPU-based jobs include <code>[https://developer.nvidia.com/nvidia-system-management-interface nvidia-smi]</code>, <code>nvtop</code> and, if you're using software based on [[TensorFlow]], the [[TensorFlow#TensorBoard|TensorBoard]] utility.


==Avoid wasting resources== <!--T:21-->
==Avoid wasting resources== <!--T:21-->


<!--T:22-->
<!--T:22-->
* In general, your jobs should never contain the command <tt>sleep</tt>.
* In general, your jobs should never contain the command <code>sleep</code>.
* We strongly recommend against the use of [[Anaconda/en|Conda]] and its variants on the clusters, in favour of solutions like a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] or [[Apptainer]].
* We strongly recommend against the use of [[Anaconda/en|Conda]] and its variants on the clusters, in favour of solutions like a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] or [[Apptainer]].
* Read and write operations should be optimized by '''[[Using_node-local_storage|using node-local storage]]'''.
* Read and write operations should be optimized by '''[[Using_node-local_storage|using node-local storage]]'''.
</translate>
</translate>
rsnt_translations
56,430

edits

Navigation menu