Using node-local storage: Difference between revisions

amplilfy signal trapping advice
No edit summary
(amplilfy signal trapping advice)
Line 55: Line 55:
Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the
Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the
job ends.  If a job times out, then the last few lines of the job script might not  
job ends.  If a job times out, then the last few lines of the job script might not  
be executed.  This can be addressed two ways:
be executed.  This can be addressed three ways:
* First, obviously, request enough runtime to let the application finish, although we understand that this isn't always possible.
* First, obviously, request enough runtime to let the application finish, although we understand that this isn't always possible.
* Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>.
* Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>.
* Signal trapping.


== Signal trapping == <!--T:27-->
== Signal trapping == <!--T:27-->
You can arrange that Slurm will send a signal to your job shortly before the runtime expires,
and that when that happens your job will copy your output from <code>$SLURM_TMPDIR</code> back to network storage.
This may be useful if your runtime estimate is uncertain,
or if you are chaining together several Slurm jobs to complete a long calculation.


<!--T:28-->
<!--T:28-->
You can use [https://slurm.schedmd.com/sbatch.html <code>--signal</code>] to get Slurm to send your script a signal shortly before the runtime expires. 
To do so you will need to write a shell function to do the copying,  
To take advantage of this, write a shell function which copies your output from <code>$SLURM_TMPDIR</code> back to network storage,  
and use the <code>trap</code> shell command to associate the function with the signal.
and use the <code>trap</code> shell command to associate the function with the signal.
This may be useful if your runtime estimate is uncertain,
or if you are chaining together several Slurm jobs to complete a long calculation.
However, it will not preserve the contents of <code>$SLURM_TMPDIR</code> in the case of a node failure.
See [https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/ this page] from
See [https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/ this page] from
CRIANN for an example script and detailed guidance.
CRIANN for an example script and detailed guidance.
This method will not preserve the contents of <code>$SLURM_TMPDIR</code> in the case of a node failure,
or certain malfunctions of the network file system.


= Multinode jobs = <!--T:12-->
= Multinode jobs = <!--T:12-->
Bureaucrats, cc_docs_admin, cc_staff
2,879

edits