Using node-local storage: Difference between revisions

Using node-local storage (view source)

Revision as of 17:26, 17 March 2023

66 bytes added , 1 year ago

amplilfy signal trapping advice

Rdickson

Bureaucrats, cc_docs_admin, cc_staff

2,879

edits

@@ Line 55: / Line 55: @@
 Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the
 job ends.  If a job times out, then the last few lines of the job script might not
-be executed.  This can be addressed two ways:
+be executed.  This can be addressed three ways:
 * First, obviously, request enough runtime to let the application finish, although we understand that this isn't always possible.
 * Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>.
+* Signal trapping.
 == Signal trapping == <!--T:27-->
+You can arrange that Slurm will send a signal to your job shortly before the runtime expires,
+and that when that happens your job will copy your output from <code>$SLURM_TMPDIR</code> back to network storage.
+This may be useful if your runtime estimate is uncertain,
+or if you are chaining together several Slurm jobs to complete a long calculation.
 <!--T:28-->
-You can use [https://slurm.schedmd.com/sbatch.html <code>--signal</code>] to get Slurm to send your script a signal shortly before the runtime expires.
+To do so you will need to write a shell function to do the copying,
-To take advantage of this, write a shell function which copies your output from <code>$SLURM_TMPDIR</code> back to network storage,
 and use the <code>trap</code> shell command to associate the function with the signal.
-This may be useful if your runtime estimate is uncertain,
-or if you are chaining together several Slurm jobs to complete a long calculation.
-However, it will not preserve the contents of <code>$SLURM_TMPDIR</code> in the case of a node failure.
 See [https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/ this page] from
 CRIANN for an example script and detailed guidance.
+This method will not preserve the contents of <code>$SLURM_TMPDIR</code> in the case of a node failure,
+or certain malfunctions of the network file system.
 = Multinode jobs = <!--T:12-->