Running jobs: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 249: Line 249:


<!--T:160-->
<!--T:160-->
'''Do not''' run <code>squeue</code> from a script or program at high frequency, e.g., every few seconds. Responding to <code>squeue</code> adds load to Slurm, and may interfere with its performance or correct operation.  
<b>Do not</b> run <code>squeue</code> from a script or program at high frequency (e.g., every few seconds). Responding to <code>squeue</code> adds load to Slurm and may interfere with its performance or correct operation.  


==== Email notification ==== <!--T:149-->
==== Email notification ==== <!--T:149-->
Line 322: Line 322:


<!--T:136-->
<!--T:136-->
'''Noteː''' The <code>srun</code> commands shown above work only to monitor a job submitted with <code>sbatch</code>. To monitor an interactive job, create multiple panes with <code>tmux</code> and start each process in its own pane.
<b>Noteː</b> The <code>srun</code> commands shown above work only to monitor a job submitted with <code>sbatch</code>. To monitor an interactive job, create multiple panes with <code>tmux</code> and start each process in its own pane.


==Cancelling jobs== <!--T:37-->
==Cancelling jobs== <!--T:37-->
Line 343: Line 343:
<!--T:75-->
<!--T:75-->
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system,  
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system,  
the application you are running must support [[Points de contrôle/en|checkpointing]]. The application should be able to save its state to a file, called a ''checkpoint file'', and
the application you are running must support [[Points de contrôle/en|checkpointing]]. The application should be able to save its state to a file, called a >i>checkpoint file</i>, and
then it should be able to restart and continue the computation from that saved state.  
then it should be able to restart and continue the computation from that saved state.  


Line 353: Line 353:
<!--T:77-->
<!--T:77-->
Here are two recommended methods of automatic restarting:
Here are two recommended methods of automatic restarting:
* Using SLURM '''job arrays'''.
* Using SLURM <b>job arrays</b>.
* Resubmitting from the end of the job script.
* Resubmitting from the end of the job script.


rsnt_translations
56,420

edits

Navigation menu