Using node-local storage: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 24: Line 24:


<!--T:5-->
<!--T:5-->
In order to ''read'' data from <code>$SLURM_TMPDIR</code>, you must first copy the data there.   
In order to <i>read</i> data from <code>$SLURM_TMPDIR</code>, you must first copy the data there.   
In the simplest case you can do this with <code>cp</code> or <code>rsync</code>:
In the simplest case, you can do this with <code>cp</code> or <code>rsync</code>:
<pre>
<pre>
cp /project/def-someone/you/input.files.* $SLURM_TMPDIR/
cp /project/def-someone/you/input.files.* $SLURM_TMPDIR/
Line 32: Line 32:
<!--T:6-->
<!--T:6-->
This may not work if the input is too large, or if it must be read by processes on different nodes.
This may not work if the input is too large, or if it must be read by processes on different nodes.
See "Amount of space" and "Multi-node jobs" below for more.
See [[Using node-local storage#Multinode_jobs|Multinode jobs]] and [[Using node-local storage#Amount_of_space|Amount of space</i>]] below for more.


== Executable files and libraries == <!--T:7-->
== Executable files and libraries == <!--T:7-->
Line 45: Line 45:
<!--T:9-->
<!--T:9-->
We particularly find that using an application in a [[Python]] virtual environment  
We particularly find that using an application in a [[Python]] virtual environment  
generates a large number of small I/O transactions--- More than it takes  
generates a large number of small I/O transactions—more than it takes  
to create the virtual environment in the first place.  This is why we recommend   
to create the virtual environment in the first place.  This is why we recommend   
[[Python#Creating virtual environments inside of your jobs|creating virtual environments inside your jobs]]
[[Python#Creating virtual environments inside of your jobs|creating virtual environments inside your jobs]]
Line 55: Line 55:
Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the
Output data must be copied from <code>$SLURM_TMPDIR</code> back to some permanent storage before the
job ends.  If a job times out, then the last few lines of the job script might not  
job ends.  If a job times out, then the last few lines of the job script might not  
be executed.  This can be addressed two ways:
be executed.  This can be addressed three ways:
* First, obviously, request enough run time to let the application finish.  We understand that this isn't always possible.
* request enough runtime to let the application finish, although we understand that this isn't always possible;
* Write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>.
* write [[Points_de_contrôle/en|checkpoints]] to network storage, not to <code>$SLURM_TMPDIR</code>;
* write a signal trapping function.


== Signal trapping == <!--T:27-->
== Signal trapping == <!--T:27-->
<!--T:29-->
You can arrange that Slurm will send a signal to your job shortly before the runtime expires,
and that when that happens your job will copy your output from <code>$SLURM_TMPDIR</code> back to network storage.
This may be useful if your runtime estimate is uncertain,
or if you are chaining together several Slurm jobs to complete a long calculation.


<!--T:28-->
<!--T:28-->
You can ask Slurm to send your script a signal a short time before the run-time expires with
To do so you will need to write a shell function to do the copying,
[https://slurm.schedmd.com/sbatch.html <code>--signal</code>], and write a shell function which
and use the <code>trap</code> shell command to associate the function with the signal.
copies your output from <code>$SLURM_TMPDIR</code> back to network storage when that happens.
See [https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/ this page] from
This will not address unexpected node failures, but may be useful if your run-time estimate is uncertain,
CRIANN for an example script and detailed guidance.
or if you are chaining together several Slurm jobs to complete a long calculation. 
 
See [https://services.criann.fr/en/services/hpc/cluster-myria/guide/signals-sent-by-slurm/ this page]
<!--T:30-->
from le Centre Régional Informatique et d'Applications Numériques de Normandie (CRIANN)
This method will not preserve the contents of <code>$SLURM_TMPDIR</code> in the case of a node failure,
for an example script and detailed guidance.
or certain malfunctions of the network file system.


= Multi-node jobs = <!--T:12-->
= Multinode jobs = <!--T:12-->


<!--T:13-->
<!--T:13-->
Line 97: Line 104:


<!--T:22-->
<!--T:22-->
At '''[[Niagara]]''' $SLURM_TMPDIR is implemented as "RAMdisk",  
At <b>[[Niagara]]</b>, $SLURM_TMPDIR is implemented as <i>RAMdisk</i>,  
so the amount of space available is limited by the memory on the node,
so the amount of space available is limited by the memory on the node,
less the amount of RAM used by your application.
less the amount of RAM used by your application.
Line 103: Line 110:


<!--T:23-->
<!--T:23-->
At the general-purpose clusters  
At the general-purpose clusters,
the amount of space available depends on the cluster and the node to which your job is assigned.
the amount of space available depends on the cluster and the node to which your job is assigned.


Line 120: Line 127:


<!--T:25-->
<!--T:25-->
The table above gives the amount of space in $SLURM_TMPDIR on the ''smallest'' node in each cluster.   
The table above gives the amount of space in $SLURM_TMPDIR on the <i>smallest</i> node in each cluster.   
If your job reserves [[Advanced_MPI_scheduling#Whole_nodes|whole nodes]]  
If your job reserves [[Advanced_MPI_scheduling#Whole_nodes|whole nodes]],
then you can reasonably assume that this much space is available to you in $SLURM_TMPDIR on each node.
then you can reasonably assume that this much space is available to you in $SLURM_TMPDIR on each node.
However, if the job requests less than a whole node, then other jobs may also write to the same filesystem
However, if the job requests less than a whole node, then other jobs may also write to the same filesystem
(but not the same directory!), reducing the space available to your job.
(but a different directory!), reducing the space available to your job.


<!--T:26-->
<!--T:26-->
Some nodes at each site have more local disk than shown above.   
Some nodes at each site have more local disk than shown above.   
See "Node characteristics" at the appropriate cluster's page ([[Béluga/en|Béluga]], [[Cedar]], [[Graham]], [[Narval]]) for guidance.
See <i>Node characteristics</i> at the appropriate cluster's page ([[Béluga/en|Béluga]], [[Cedar]], [[Graham]], [[Narval/en|Narval]]) for guidance.


</translate>
</translate>

Latest revision as of 19:49, 17 March 2023

Other languages:


When Slurm starts a job, it creates a temporary directory on each node assigned to the job. It then sets the full path name of that directory in an environment variable called SLURM_TMPDIR.

Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/project, /scratch, or /home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $SLURM_TMPDIR instead of network storage.

The temporary character of $SLURM_TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $SLURM_TMPDIR before it can be read, and output must be copied from $SLURM_TMPDIR back to network storage before the job ends to preserve it for later use.

Input

In order to read data from $SLURM_TMPDIR, you must first copy the data there. In the simplest case, you can do this with cp or rsync:

cp /project/def-someone/you/input.files.* $SLURM_TMPDIR/

This may not work if the input is too large, or if it must be read by processes on different nodes. See Multinode jobs and Amount of space below for more.

Executable files and libraries

A special case of input is the application code itself. In order to run the application, the shell started by Slurm must open at least an application file, which it typically reads from network storage. But few applications these days consist of exactly one file; most also need several other files (such as libraries) in order to work.

We particularly find that using an application in a Python virtual environment generates a large number of small I/O transactions—more than it takes to create the virtual environment in the first place. This is why we recommend creating virtual environments inside your jobs using $SLURM_TMPDIR.

Output

Output data must be copied from $SLURM_TMPDIR back to some permanent storage before the job ends. If a job times out, then the last few lines of the job script might not be executed. This can be addressed three ways:

  • request enough runtime to let the application finish, although we understand that this isn't always possible;
  • write checkpoints to network storage, not to $SLURM_TMPDIR;
  • write a signal trapping function.

Signal trapping

You can arrange that Slurm will send a signal to your job shortly before the runtime expires, and that when that happens your job will copy your output from $SLURM_TMPDIR back to network storage. This may be useful if your runtime estimate is uncertain, or if you are chaining together several Slurm jobs to complete a long calculation.

To do so you will need to write a shell function to do the copying, and use the trap shell command to associate the function with the signal. See this page from CRIANN for an example script and detailed guidance.

This method will not preserve the contents of $SLURM_TMPDIR in the case of a node failure, or certain malfunctions of the network file system.

Multinode jobs

If a job spans multiple nodes and some data is needed on every node, then a simple cp or tar -x will not suffice.

Copy files

Copy one or more files to the SLURM_TMPDIR directory on every node allocated like this:

Question.png
[name@server ~]$ srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 cp file [files...] $SLURM_TMPDIR

Compressed archives

ZIP

Extract to the SLURM_TMPDIR:

Question.png
[name@server ~]$ srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 unzip archive.zip -d $SLURM_TMPDIR

Tarball

Extract to the SLURM_TMPDIR:

Question.png
[name@server ~]$ srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 tar -xvf archive.tar.gz -C $SLURM_TMPDIR

Amount of space

At Niagara, $SLURM_TMPDIR is implemented as RAMdisk, so the amount of space available is limited by the memory on the node, less the amount of RAM used by your application. See Data management at Niagara for more.

At the general-purpose clusters, the amount of space available depends on the cluster and the node to which your job is assigned.

cluster space in $SLURM_TMPDIR size of disk
Béluga 370G 480G
Cedar 840G 960G
Graham 750G 960G
Narval 800G 960G

The table above gives the amount of space in $SLURM_TMPDIR on the smallest node in each cluster. If your job reserves whole nodes, then you can reasonably assume that this much space is available to you in $SLURM_TMPDIR on each node. However, if the job requests less than a whole node, then other jobs may also write to the same filesystem (but a different directory!), reducing the space available to your job.

Some nodes at each site have more local disk than shown above. See Node characteristics at the appropriate cluster's page (Béluga, Cedar, Graham, Narval) for guidance.