Best practices for job submission: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 32: Line 32:


<!--T:9-->
<!--T:9-->
* For jobs which are not tests, the duration should be '''at least one hour'''.
* For jobs which are not tests, the duration should be <b>at least one hour</b>.
** If your computation requires less than an hour, you should consider using tools like [[GLOST]], [[META:_A_package_for_job_farming | META]] or [[GNU Parallel]] to regroup several of your computations into a single Slurm job with a duration of at least an hour. Hundreds or thousands of very short jobs place undue stress on the scheduler.  
** If your computation requires less than an hour, you should consider using tools like [[GLOST]], [[META:_A_package_for_job_farming | META]] or [[GNU Parallel]] to regroup several of your computations into a single Slurm job with a duration of at least an hour. Hundreds or thousands of very short jobs place undue stress on the scheduler.  
* It is equally important that your estimate of the '''job duration be relatively accurate'''.
* It is equally important that your estimate of the <b>job duration be relatively accurate</b>.
** Asking for five days when the computation in reality finishes after just sixteen hours leads to your job spending much more time waiting to start than it would had you given a more accurate estimate of the duration.
** Asking for five days when the computation in reality finishes after just sixteen hours leads to your job spending much more time waiting to start than it would had you given a more accurate estimate of the duration.
* '''Use [[Running_jobs#Completed_jobs|monitoring tools]]''' to see how long completed jobs took.
* <b>Use [[Running_jobs#Completed_jobs|monitoring tools]]</b> to see how long completed jobs took.
** For example, the <code>Job Wall-clock time</code> field in the output of the <code>seff</code> command:
** For example, the <code>Job Wall-clock time</code> field in the output of the <code>seff</code> command:
</translate>
</translate>
Line 56: Line 56:
<translate>  
<translate>  
<!--T:10-->
<!--T:10-->
* '''Increase the estimated duration by 5% or 10%''', just in case.
* <b>Increase the estimated duration by 5% or 10%</b>, just in case.
** It's natural to leave a certain amount of room for error in the estimate, but otherwise it's in your interest for your estimate of the job's duration to be as accurate as possible.
** It's natural to leave a certain amount of room for error in the estimate, but otherwise it's in your interest for your estimate of the job's duration to be as accurate as possible.
* Longer jobs, such as those with a duration exceeding 48 hours, should '''consider using [[Points_de_contrôle/en|checkpoints]]''' if the software permits this.
* Longer jobs, such as those with a duration exceeding 48 hours, should <b>consider using [[Points_de_contrôle/en|checkpoints]]</b> if the software permits this.
** With a checkpoint, the program writes a snapshot of its state to a diskfile and the program can then be restarted from this diskfile, at that precise point in the calculation. In this way, even if there is a power outage or some other interruption of the compute node(s) being used by your job, you won't necessarily lose much work if your program writes a checkpoint file every six or eight hours.
** With a checkpoint, the program writes a snapshot of its state to a diskfile and the program can then be restarted from this diskfile, at that precise point in the calculation. In this way, even if there is a power outage or some other interruption of the compute node(s) being used by your job, you won't necessarily lose much work if your program writes a checkpoint file every six or eight hours.


rsnt_translations
56,430

edits