Revision as of 18:29, 31 July 2020

Other languages:

English
français

When Slurm starts a job, it creates a temporary directory on each node assigned to the job. It then sets the full path name of that directory in an environment variable called SLURM_TMPDIR.

Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/project, /scratch, or /home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $SLURM_TMPDIR instead of network storage.

The temporary character of $SLURM_TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $SLURM_TMPDIR before it can be read, and output must be copied from $SLURM_TMPDIR back to network storage before the job ends to preserve it for later use.

Input

In order to read data from $SLURM_TMPDIR, you must first copy the data there. In the simplest case you can do this with cp or rsync:

cp /project/def-someone/you/input.files.* $SLURM_TMPDIR/

This may not work if the input is too large, or if it must be read by processes on different nodes. See "Amount of space" and "Multi-node jobs" below for more.

Executable files and libraries

A special case of input is the application code itself. In order to run the application, the shell started by Slurm must open at least an application file, which it typically reads from network storage. But few applications these days consist of exactly one file; most also need several other files (such as libraries) in order to work.

We particularly find that using an application in a Python virtual environment generates a large number of small I/O transactions--- More than it takes to create the virtual environment in the first place. This is why we recommend creating virtual environments inside your jobs using $SLURM_TMPDIR.

Output

Output data must be copied from $SLURM_TMPDIR back to some permanent storage before the job ends. If a job times out, then the last few lines of the job script might not be executed. This can be addressed two ways:

First, obviously, request enough run time to let the application finish. We understand that this isn't always possible.
Write checkpoints to network storage, not to $SLURM_TMPDIR.

Multi-node jobs

If a job spans multiple nodes and some data is needed on every node, then a simple cp or tar -x will not suffice.

Copy files

Copy one or more files to the SLURM_TMPDIR directory on every node allocated like this:

[name@server ~]$ pdcp -w $(slurm_hl2hl.py --format PDSH) file [files...] $SLURM_TMPDIR

Or use GNU Parallel to do the same:

[name@server ~]$ parallel -S $(slurm_hl2hl.py --format GNU-Parallel) --env SLURM_TMPDIR --workdir $PWD --onall cp file [files...] ::: $SLURM_TMPDIR

Compressed Archives

ZIP

Extract to the SLURM_TMPDIR:

[name@server ~]$ pdsh -w $(slurm_hl2hl.py --format PDSH) unzip archive.zip -d $SLURM_TMPDIR

Tarball

Extract to the SLURM_TMPDIR:

[name@server ~]$ pdsh -w $(slurm_hl2hl.py --format PDSH) tar -xvf archive.tar.gz -C $SLURM_TMPDIR

Amount of space

At Niagara $SLURM_TMPDIR is implemented as "RAMdisk", so the amount of space available is limited by the memory on the node, less the amount of RAM used by your application. See Data management at Niagara for more.

At the general-purpose clusters Béluga, Cedar, and Graham, the amount of space available depends on the cluster and the node to which your job is assigned.

cluster	space in $SLURM_TMPDIR	size of disk
Béluga	370G	480G
Cedar	840G	960G
Graham	750G	960G

The table above gives the amount of space in $SLURM_TMPDIR on the smallest node in each cluster. If your job reserves whole nodes then you can reasonably assume that this much space is available to you in $SLURM_TMPDIR on each node. However, if the job requests less than a whole node, then other jobs may also write to the same filesystem (but not the same directory!), reducing the space available to your job.

Some nodes at each site have more local disk than shown above. See "Node characteristics" at the appropriate page (Béluga, Cedar, Graham) for guidance.

@@ Line 3: / Line 3: @@
 <translate>
+<!--T:1-->
 When [[Running jobs|Slurm]] starts a job, it creates a temporary directory on each node assigned to the job.
 It then sets the full path name of that directory in an environment variable called <code>SLURM_TMPDIR</code>.
+<!--T:2-->
 Because this directory resides on local disk, input and output (I/O) to it
 is almost always faster than I/O to a [[Storage and file management|network storage]] (/project, /scratch, or /home).
@@ Line 12: / Line 14: @@
 to run more quickly if it uses <code>$SLURM_TMPDIR</code> instead of network storage.
+<!--T:3-->
 The temporary character of <code>$SLURM_TMPDIR</code> makes it more trouble to use than
 network storage.
@@ Line 18: / Line 21: @@
 to preserve it for later use.
-= Input =
+= Input = <!--T:4-->
+<!--T:5-->
 In order to ''read'' data from <code>$SLURM_TMPDIR</code>, you must first copy the data there.
 In the simplest case you can do this with <code>cp</code> or <code>rsync</code>:
@@ Line 26: / Line 30: @@
 </pre>
+<!--T:6-->
 This may not work if the input is too large, or if it must be read by processes on different nodes.
 See "Amount of space" and "Multi-node jobs" below for more.
-== Executable files and libraries ==
+== Executable files and libraries == <!--T:7-->
+<!--T:8-->
 A special case of input is the application code itself.
 In order to run the application, the shell started by Slurm must open
@@ Line 37: / Line 43: @@
 most also need several other files (such as libraries) in order to work.
+<!--T:9-->
 We particularly find that using an application in a [[Python]] virtual environment
 generates a large number of small I/O transactions--- More than it takes
@@ Line 43: / Line 50: @@
 using <code>$SLURM_TMPDIR</code>.
-= Output =
+= Output = <!--T:10-->
+<!--T:11-->
 Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the
 job ends.  If a job times out, then the last few lines of the job script might not
@@ Line 51: / Line 59: @@
 * Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>.
-= Multi-node jobs =
+= Multi-node jobs = <!--T:12-->
+<!--T:13-->
 If a job spans multiple nodes and some data is needed on every node, then a simple <code>cp</code> or <code>tar -x</code> will not suffice.
-== Copy files ==
+== Copy files == <!--T:14-->
+<!--T:15-->
 Copy one or more files to the <tt>SLURM_TMPDIR</tt> directory on every node allocated like this:
 {{Command|pdcp -w $(slurm_hl2hl.py --format PDSH) file [files...] $SLURM_TMPDIR}}
+<!--T:16-->
 Or use GNU Parallel to do the same:
 {{Command|parallel -S $(slurm_hl2hl.py --format GNU-Parallel) --env SLURM_TMPDIR --workdir $PWD --onall cp file [files...] ::: $SLURM_TMPDIR}}
-== Compressed Archives ==
+== Compressed Archives == <!--T:17-->
-=== ZIP ===
+=== ZIP === <!--T:18-->
+<!--T:19-->
 Extract to the <tt>SLURM_TMPDIR</tt>:
 {{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) unzip archive.zip -d $SLURM_TMPDIR}}
-=== Tarball ===
+=== Tarball === <!--T:20-->
 Extract to the <tt>SLURM_TMPDIR</tt>:
 {{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) tar -xvf archive.tar.gz -C $SLURM_TMPDIR}}
-= Amount of space =
+= Amount of space = <!--T:21-->
+<!--T:22-->
 At '''[[Niagara]]''' $SLURM_TMPDIR is implemented as "RAMdisk",
 so the amount of space available is limited by the memory on the node,
@@ Line 81: / Line 94: @@
 See [[Data_management_at_Niagara#.24SLURM_TMPDIR_.28RAM.29|Data management at Niagara]] for more.
+<!--T:23-->
 At the general-purpose clusters [[Béluga/en|Béluga]], [[Cedar]], and [[Graham]],
 the amount of space available depends on the cluster and the node to which your job is assigned.
+<!--T:24-->
 {| class="wikitable sortable"
 ! cluster !! space in $SLURM_TMPDIR !! size of disk
@@ Line 94: / Line 109: @@
 |}
+<!--T:25-->
 The table above gives the amount of space in $SLURM_TMPDIR on the ''smallest'' node in each cluster.
 If your job reserves [[Advanced_MPI_scheduling#Whole_nodes|whole nodes]]
@@ Line 100: / Line 116: @@
 (but not the same directory!), reducing the space available to your job.
+<!--T:26-->
 Some nodes at each site have more local disk than shown above.
 See "Node characteristics" at the appropriate page ([[Béluga/en|Béluga]], [[Cedar]], [[Graham]]) for guidance.
 </translate>

Using node-local storage: Difference between revisions

Revision as of 18:29, 31 July 2020

Contents

Input

Executable files and libraries

Output

Multi-node jobs

Copy files

Compressed Archives

ZIP

Tarball

Amount of space

Navigation menu

Using node-local storage: Difference between revisions

Revision as of 18:29, 31 July 2020

Input

Executable files and libraries

Output

Multi-node jobs

Copy files

Compressed Archives

ZIP

Tarball

Amount of space

Navigation menu

Search