Bureaucrats, cc_docs_admin, cc_staff
2,879
edits
No edit summary |
(amplilfy signal trapping advice) |
||
Line 55: | Line 55: | ||
Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the | Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the | ||
job ends. If a job times out, then the last few lines of the job script might not | job ends. If a job times out, then the last few lines of the job script might not | ||
be executed. This can be addressed | be executed. This can be addressed three ways: | ||
* First, obviously, request enough runtime to let the application finish, although we understand that this isn't always possible. | * First, obviously, request enough runtime to let the application finish, although we understand that this isn't always possible. | ||
* Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>. | * Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>. | ||
* Signal trapping. | |||
== Signal trapping == <!--T:27--> | == Signal trapping == <!--T:27--> | ||
You can arrange that Slurm will send a signal to your job shortly before the runtime expires, | |||
and that when that happens your job will copy your output from <code>$SLURM_TMPDIR</code> back to network storage. | |||
This may be useful if your runtime estimate is uncertain, | |||
or if you are chaining together several Slurm jobs to complete a long calculation. | |||
<!--T:28--> | <!--T:28--> | ||
To do so you will need to write a shell function to do the copying, | |||
and use the <code>trap</code> shell command to associate the function with the signal. | and use the <code>trap</code> shell command to associate the function with the signal. | ||
See [https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/ this page] from | See [https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/ this page] from | ||
CRIANN for an example script and detailed guidance. | CRIANN for an example script and detailed guidance. | ||
This method will not preserve the contents of <code>$SLURM_TMPDIR</code> in the case of a node failure, | |||
or certain malfunctions of the network file system. | |||
= Multinode jobs = <!--T:12--> | = Multinode jobs = <!--T:12--> |