Bureaucrats, cc_docs_admin, cc_staff
2,879
edits
(remove draft tag, mark for translation) |
(Marked this version for translation) |
||
Line 3: | Line 3: | ||
<translate> | <translate> | ||
<!--T:1--> | |||
When [[Running jobs|Slurm]] starts a job, it creates a temporary directory on each node assigned to the job. | When [[Running jobs|Slurm]] starts a job, it creates a temporary directory on each node assigned to the job. | ||
It then sets the full path name of that directory in an environment variable called <code>SLURM_TMPDIR</code>. | It then sets the full path name of that directory in an environment variable called <code>SLURM_TMPDIR</code>. | ||
<!--T:2--> | |||
Because this directory resides on local disk, input and output (I/O) to it | Because this directory resides on local disk, input and output (I/O) to it | ||
is almost always faster than I/O to a [[Storage and file management|network storage]] (/project, /scratch, or /home). | is almost always faster than I/O to a [[Storage and file management|network storage]] (/project, /scratch, or /home). | ||
Line 12: | Line 14: | ||
to run more quickly if it uses <code>$SLURM_TMPDIR</code> instead of network storage. | to run more quickly if it uses <code>$SLURM_TMPDIR</code> instead of network storage. | ||
<!--T:3--> | |||
The temporary character of <code>$SLURM_TMPDIR</code> makes it more trouble to use than | The temporary character of <code>$SLURM_TMPDIR</code> makes it more trouble to use than | ||
network storage. | network storage. | ||
Line 18: | Line 21: | ||
to preserve it for later use. | to preserve it for later use. | ||
= Input = | = Input = <!--T:4--> | ||
<!--T:5--> | |||
In order to ''read'' data from <code>$SLURM_TMPDIR</code>, you must first copy the data there. | In order to ''read'' data from <code>$SLURM_TMPDIR</code>, you must first copy the data there. | ||
In the simplest case you can do this with <code>cp</code> or <code>rsync</code>: | In the simplest case you can do this with <code>cp</code> or <code>rsync</code>: | ||
Line 26: | Line 30: | ||
</pre> | </pre> | ||
<!--T:6--> | |||
This may not work if the input is too large, or if it must be read by processes on different nodes. | This may not work if the input is too large, or if it must be read by processes on different nodes. | ||
See "Amount of space" and "Multi-node jobs" below for more. | See "Amount of space" and "Multi-node jobs" below for more. | ||
== Executable files and libraries == | == Executable files and libraries == <!--T:7--> | ||
<!--T:8--> | |||
A special case of input is the application code itself. | A special case of input is the application code itself. | ||
In order to run the application, the shell started by Slurm must open | In order to run the application, the shell started by Slurm must open | ||
Line 37: | Line 43: | ||
most also need several other files (such as libraries) in order to work. | most also need several other files (such as libraries) in order to work. | ||
<!--T:9--> | |||
We particularly find that using an application in a [[Python]] virtual environment | We particularly find that using an application in a [[Python]] virtual environment | ||
generates a large number of small I/O transactions--- More than it takes | generates a large number of small I/O transactions--- More than it takes | ||
Line 43: | Line 50: | ||
using <code>$SLURM_TMPDIR</code>. | using <code>$SLURM_TMPDIR</code>. | ||
= Output = | = Output = <!--T:10--> | ||
<!--T:11--> | |||
Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the | Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the | ||
job ends. If a job times out, then the last few lines of the job script might not | job ends. If a job times out, then the last few lines of the job script might not | ||
Line 51: | Line 59: | ||
* Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>. | * Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>. | ||
= Multi-node jobs = | = Multi-node jobs = <!--T:12--> | ||
<!--T:13--> | |||
If a job spans multiple nodes and some data is needed on every node, then a simple <code>cp</code> or <code>tar -x</code> will not suffice. | If a job spans multiple nodes and some data is needed on every node, then a simple <code>cp</code> or <code>tar -x</code> will not suffice. | ||
== Copy files == | == Copy files == <!--T:14--> | ||
<!--T:15--> | |||
Copy one or more files to the <tt>SLURM_TMPDIR</tt> directory on every node allocated like this: | Copy one or more files to the <tt>SLURM_TMPDIR</tt> directory on every node allocated like this: | ||
{{Command|pdcp -w $(slurm_hl2hl.py --format PDSH) file [files...] $SLURM_TMPDIR}} | {{Command|pdcp -w $(slurm_hl2hl.py --format PDSH) file [files...] $SLURM_TMPDIR}} | ||
<!--T:16--> | |||
Or use GNU Parallel to do the same: | Or use GNU Parallel to do the same: | ||
{{Command|parallel -S $(slurm_hl2hl.py --format GNU-Parallel) --env SLURM_TMPDIR --workdir $PWD --onall cp file [files...] ::: $SLURM_TMPDIR}} | {{Command|parallel -S $(slurm_hl2hl.py --format GNU-Parallel) --env SLURM_TMPDIR --workdir $PWD --onall cp file [files...] ::: $SLURM_TMPDIR}} | ||
== Compressed Archives == | == Compressed Archives == <!--T:17--> | ||
=== ZIP === | === ZIP === <!--T:18--> | ||
<!--T:19--> | |||
Extract to the <tt>SLURM_TMPDIR</tt>: | Extract to the <tt>SLURM_TMPDIR</tt>: | ||
{{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) unzip archive.zip -d $SLURM_TMPDIR}} | {{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) unzip archive.zip -d $SLURM_TMPDIR}} | ||
=== Tarball === | === Tarball === <!--T:20--> | ||
Extract to the <tt>SLURM_TMPDIR</tt>: | Extract to the <tt>SLURM_TMPDIR</tt>: | ||
{{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) tar -xvf archive.tar.gz -C $SLURM_TMPDIR}} | {{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) tar -xvf archive.tar.gz -C $SLURM_TMPDIR}} | ||
= Amount of space = | = Amount of space = <!--T:21--> | ||
<!--T:22--> | |||
At '''[[Niagara]]''' $SLURM_TMPDIR is implemented as "RAMdisk", | At '''[[Niagara]]''' $SLURM_TMPDIR is implemented as "RAMdisk", | ||
so the amount of space available is limited by the memory on the node, | so the amount of space available is limited by the memory on the node, | ||
Line 81: | Line 94: | ||
See [[Data_management_at_Niagara#.24SLURM_TMPDIR_.28RAM.29|Data management at Niagara]] for more. | See [[Data_management_at_Niagara#.24SLURM_TMPDIR_.28RAM.29|Data management at Niagara]] for more. | ||
<!--T:23--> | |||
At the general-purpose clusters [[Béluga/en|Béluga]], [[Cedar]], and [[Graham]], | At the general-purpose clusters [[Béluga/en|Béluga]], [[Cedar]], and [[Graham]], | ||
the amount of space available depends on the cluster and the node to which your job is assigned. | the amount of space available depends on the cluster and the node to which your job is assigned. | ||
<!--T:24--> | |||
{| class="wikitable sortable" | {| class="wikitable sortable" | ||
! cluster !! space in $SLURM_TMPDIR !! size of disk | ! cluster !! space in $SLURM_TMPDIR !! size of disk | ||
Line 94: | Line 109: | ||
|} | |} | ||
<!--T:25--> | |||
The table above gives the amount of space in $SLURM_TMPDIR on the ''smallest'' node in each cluster. | The table above gives the amount of space in $SLURM_TMPDIR on the ''smallest'' node in each cluster. | ||
If your job reserves [[Advanced_MPI_scheduling#Whole_nodes|whole nodes]] | If your job reserves [[Advanced_MPI_scheduling#Whole_nodes|whole nodes]] | ||
Line 100: | Line 116: | ||
(but not the same directory!), reducing the space available to your job. | (but not the same directory!), reducing the space available to your job. | ||
<!--T:26--> | |||
Some nodes at each site have more local disk than shown above. | Some nodes at each site have more local disk than shown above. | ||
See "Node characteristics" at the appropriate page ([[Béluga/en|Béluga]], [[Cedar]], [[Graham]]) for guidance. | See "Node characteristics" at the appropriate page ([[Béluga/en|Béluga]], [[Cedar]], [[Graham]]) for guidance. | ||
</translate> | </translate> |