Using node-local storage: Difference between revisions

Using node-local storage (view source)

Revision as of 18:29, 31 July 2020

303 bytes added , 4 years ago

Marked this version for translation

Rdickson

Bureaucrats, cc_docs_admin, cc_staff

2,879

edits

@@ Line 3: / Line 3: @@
 <translate>
+<!--T:1-->
 When [[Running jobs|Slurm]] starts a job, it creates a temporary directory on each node assigned to the job.
 It then sets the full path name of that directory in an environment variable called <code>SLURM_TMPDIR</code>.
+<!--T:2-->
 Because this directory resides on local disk, input and output (I/O) to it
 is almost always faster than I/O to a [[Storage and file management|network storage]] (/project, /scratch, or /home).
@@ Line 12: / Line 14: @@
 to run more quickly if it uses <code>$SLURM_TMPDIR</code> instead of network storage.
+<!--T:3-->
 The temporary character of <code>$SLURM_TMPDIR</code> makes it more trouble to use than
 network storage.
@@ Line 18: / Line 21: @@
 to preserve it for later use.
-= Input =
+= Input = <!--T:4-->
+<!--T:5-->
 In order to ''read'' data from <code>$SLURM_TMPDIR</code>, you must first copy the data there.
 In the simplest case you can do this with <code>cp</code> or <code>rsync</code>:
@@ Line 26: / Line 30: @@
 </pre>
+<!--T:6-->
 This may not work if the input is too large, or if it must be read by processes on different nodes.
 See "Amount of space" and "Multi-node jobs" below for more.
-== Executable files and libraries ==
+== Executable files and libraries == <!--T:7-->
+<!--T:8-->
 A special case of input is the application code itself.
 In order to run the application, the shell started by Slurm must open
@@ Line 37: / Line 43: @@
 most also need several other files (such as libraries) in order to work.
+<!--T:9-->
 We particularly find that using an application in a [[Python]] virtual environment
 generates a large number of small I/O transactions--- More than it takes
@@ Line 43: / Line 50: @@
 using <code>$SLURM_TMPDIR</code>.
-= Output =
+= Output = <!--T:10-->
+<!--T:11-->
 Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the
 job ends.  If a job times out, then the last few lines of the job script might not
@@ Line 51: / Line 59: @@
 * Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>.
-= Multi-node jobs =
+= Multi-node jobs = <!--T:12-->
+<!--T:13-->
 If a job spans multiple nodes and some data is needed on every node, then a simple <code>cp</code> or <code>tar -x</code> will not suffice.
-== Copy files ==
+== Copy files == <!--T:14-->
+<!--T:15-->
 Copy one or more files to the <tt>SLURM_TMPDIR</tt> directory on every node allocated like this:
 {{Command|pdcp -w $(slurm_hl2hl.py --format PDSH) file [files...] $SLURM_TMPDIR}}
+<!--T:16-->
 Or use GNU Parallel to do the same:
 {{Command|parallel -S $(slurm_hl2hl.py --format GNU-Parallel) --env SLURM_TMPDIR --workdir $PWD --onall cp file [files...] ::: $SLURM_TMPDIR}}
-== Compressed Archives ==
+== Compressed Archives == <!--T:17-->
-=== ZIP ===
+=== ZIP === <!--T:18-->
+<!--T:19-->
 Extract to the <tt>SLURM_TMPDIR</tt>:
 {{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) unzip archive.zip -d $SLURM_TMPDIR}}
-=== Tarball ===
+=== Tarball === <!--T:20-->
 Extract to the <tt>SLURM_TMPDIR</tt>:
 {{Command|pdsh -w $(slurm_hl2hl.py --format PDSH) tar -xvf archive.tar.gz -C $SLURM_TMPDIR}}
-= Amount of space =
+= Amount of space = <!--T:21-->
+<!--T:22-->
 At '''[[Niagara]]''' $SLURM_TMPDIR is implemented as "RAMdisk",
 so the amount of space available is limited by the memory on the node,
@@ Line 81: / Line 94: @@
 See [[Data_management_at_Niagara#.24SLURM_TMPDIR_.28RAM.29|Data management at Niagara]] for more.
+<!--T:23-->
 At the general-purpose clusters [[Béluga/en|Béluga]], [[Cedar]], and [[Graham]],
 the amount of space available depends on the cluster and the node to which your job is assigned.
+<!--T:24-->
 {| class="wikitable sortable"
 ! cluster !! space in $SLURM_TMPDIR !! size of disk
@@ Line 94: / Line 109: @@
 |}
+<!--T:25-->
 The table above gives the amount of space in $SLURM_TMPDIR on the ''smallest'' node in each cluster.
 If your job reserves [[Advanced_MPI_scheduling#Whole_nodes|whole nodes]]
@@ Line 100: / Line 116: @@
 (but not the same directory!), reducing the space available to your job.
+<!--T:26-->
 Some nodes at each site have more local disk than shown above.
 See "Node characteristics" at the appropriate page ([[Béluga/en|Béluga]], [[Cedar]], [[Graham]]) for guidance.
 </translate>