cc_staff
823
edits
(Removed one bullet level) |
(Section in bullet points) |
||
Line 21: | Line 21: | ||
==Job duration== | ==Job duration== | ||
For jobs which are not tests, the duration should be at least one hour. If your computation requires less than an hour, you should consider using tools like [[GLOST]], [[META:_A_package_for_job_farming | META]] or [[GNU Parallel]] to regroup several of your computations into a single Slurm job with a duration of at least an hour. Hundreds or thousands of very short jobs place undue stress on the scheduler. | * For jobs which are not tests, the duration should be '''at least one hour'''. | ||
** If your computation requires less than an hour, you should consider using tools like [[GLOST]], [[META:_A_package_for_job_farming | META]] or [[GNU Parallel]] to regroup several of your computations into a single Slurm job with a duration of at least an hour. Hundreds or thousands of very short jobs place undue stress on the scheduler. | |||
It is equally important that your estimate of the job duration be relatively accurate | * It is equally important that your estimate of the '''job duration be relatively accurate'''. | ||
** Asking for five days when the computation in reality finishes after just sixteen hours leads to your job spending much more time waiting to start than it would had you given a more accurate estimate of the duration. | |||
* '''Use [[Running_jobs#Completed_jobs|monitoring tools]]''' to see how long completed jobs took. | |||
** For example, the <tt>Job Wall-clock time</tt> field in the output of the <tt>seff</tt> command: | |||
{{Command | {{Command | ||
|seff 1234567 | |seff 1234567 | ||
Line 39: | Line 42: | ||
Memory Efficiency: 11.68% of 128.00 GB (8.00 GB/core) | Memory Efficiency: 11.68% of 128.00 GB (8.00 GB/core) | ||
}} | }} | ||
in the | * '''Increase the estimated duration by 5% or 10%''', just in case. | ||
** It's natural to leave a certain amount of room for error in the estimate, but otherwise it's in your interest for your estimate of the job's duration to be as accurate as possible. | |||
Longer jobs, such as those with a duration exceeding 48 hours, should consider using [[Points_de_contrôle/en|checkpoints]] if the software permits this. With a checkpoint, the program writes a snapshot of its state to a diskfile and the program can then be restarted from this diskfile, at that precise point in the calculation. In this way, even if there is a power outage or some other interruption of the compute node(s) being used by your job, you won't necessarily lose much work if your program writes a checkpoint file every six or eight hours. | * Longer jobs, such as those with a duration exceeding 48 hours, should '''consider using [[Points_de_contrôle/en|checkpoints]]''' if the software permits this. | ||
** With a checkpoint, the program writes a snapshot of its state to a diskfile and the program can then be restarted from this diskfile, at that precise point in the calculation. In this way, even if there is a power outage or some other interruption of the compute node(s) being used by your job, you won't necessarily lose much work if your program writes a checkpoint file every six or eight hours. | |||
==Memory consumption== | ==Memory consumption== |