cc_staff
150
edits
mNo edit summary |
No edit summary |
||
Line 117: | Line 117: | ||
* Both the standard Compute Canada software stack as well as cluster-specific software tuned for Niagara will be available. | * Both the standard Compute Canada software stack as well as cluster-specific software tuned for Niagara will be available. | ||
* In contrast with Cedar and Graham, no modules will be loaded by default to prevent accidental conflicts in versions. There will be a simple mechanism to load the software stack that a user would see on Graham and Cedar. | * In contrast with Cedar and Graham, no modules will be loaded by default to prevent accidental conflicts in versions. There will be a simple mechanism to load the software stack that a user would see on Graham and Cedar. | ||
= Migration to Niagara = | |||
== Migration for Existing Users of the GPC == | |||
* Accounts, $HOME & $PROJECT of active GPC users transferred to Niagara (except dot-files in ~). | |||
* Data stored in $SCRATCH will not be transfered automatically. | |||
* Users are to clean up $SCRATCH on the GPC as much as possible (remember it's temporary data!). Then they can transfer what they need using datamover nodes. Let us know if you need help. | |||
* To enable this transfer, there will be a short period during which you can have access to Niagara as well as to the GPC storage resources. This period will end no later than May 9, 2018. | |||
== For Non-GPC Users == | |||
<ul> | |||
<li><p>Those of you new to SciNet, but with 2018 RAC allocations on Niagara, will have your accounts created and ready for you to login.</p></li> | |||
<li><p>New, non-RAC users: we are still working out the procedure to get access. If you can't wait, for now, you can follow the old route of requesting a SciNet Consortium Account on the [https://ccsb.computecanada.ca CCDB site].</p></li></ul> | |||
=Using Niagara= | |||
== Logging in == | |||
As with all SciNet and CC (Compute Canada) compute systems, access to Niagara is via ssh (secure shell) only. | |||
To access SciNet systems, first open a terminal window (e.g. MobaXTerm on Windows). | |||
Then ssh into the Niagara login nodes with your CC credentials: | |||
<source lang="bash"> | |||
$ ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca</source> | |||
or | |||
<source lang="bash">$ ssh -Y MYCCUSERNAME@niagara.computecanada.ca</source> | |||
* The Niagara login nodes are where you develop, edit, compile, prepare and submit jobs. | |||
* These login nodes are not part of the Niagara compute cluster, but have the same architecture, operating system, and software stack. | |||
* The optional <code>-Y</code> is needed to open windows from the Niagara command-line onto your local X server. | |||
* To run on Niagara's compute nodes, you must submit a batch job. | |||
== Storage Systems and Locations == | |||
=== Home and scratch === | |||
You have a home and scratch directory on the system, whose locations will be given by | |||
<code>$HOME=/home/g/groupname/myccusername</code> | |||
<code>$SCRATCH=/scratch/g/groupname/myccusername</code> | |||
<source lang="bash">nia-login07:~$ pwd | |||
/home/s/scinet/rzon | |||
nia-login07:~$ cd $SCRATCH | |||
nia-login07:rzon$ pwd | |||
/scratch/s/scinet/rzon</source> | |||
=== Project location === | |||
Users from groups with a RAC allocation will also have a project directory. | |||
<code>$PROJECT=/project/g/groupname/myccusername</code> | |||
'''''IMPORTANT: Future-proof your scripts''''' | |||
Use the environment variables (HOME, SCRATCH, PROJECT) instead of the actual paths! The paths may change in the future. | |||
=== Storage Limits on Niagara === | |||
{| class="wikitable" | |||
! location | |||
! quota | |||
!align="right"| block size | |||
! expiration time | |||
! backed up | |||
! on login | |||
! on compute | |||
|- | |||
| $HOME | |||
| 100 GB | |||
|align="right"| 1 MB | |||
| | |||
| yes | |||
| yes | |||
| read-only | |||
|- | |||
| $SCRATCH | |||
| 25 TB | |||
|align="right"| 16 MB | |||
| 2 months | |||
| no | |||
| yes | |||
| yes | |||
|- | |||
| $PROJECT | |||
| by group allocation | |||
|align="right"| 16 MB | |||
| | |||
| yes | |||
| yes | |||
| yes | |||
|- | |||
| $ARCHIVE | |||
| by group allocation | |||
|align="right"| | |||
| | |||
| dual-copy | |||
| no | |||
| no | |||
|- | |||
| $BBUFFER | |||
| ? | |||
|align="right"| 1 MB | |||
| very short | |||
| no | |||
| ? | |||
| ? | |||
|} | |||
<ul> | |||
<li>Compute nodes do not have local storage.</li> | |||
<li>Archive space is on [https://wiki.scinet.utoronto.ca/wiki/index.php/HPSS HPSS].</li> | |||
<li>Backup means a recent snapshot, not an achive of all data that ever was.</li> | |||
<li><p><code>$BBUFFER</code> stands for the Burst Buffer, a functionality that is still being set up. This will be a faster parallel storage tier for temporary data.</p></li></ul> | |||
=== Moving data === | |||
'''''Move amounts less than 10GB through the login nodes.''''' | |||
* Only Niagara login nodes visible from outside SciNet. | |||
* Use scp or rsync to niagara.scinet.utoronto.ca or niagara.computecanada.ca (no difference). | |||
* This will time out for amounts larger than about 10GB. | |||
'''''Move amounts larger than 10GB through the datamover node.''''' | |||
* From a Niagara login node, ssh to <code>nia-datamover1</code>. | |||
* Transfers must originate from this datamover. | |||
* The other side (e.g. your machine) must be reachable from the outside. | |||
* If you do this often, consider using Globus, a web-based tool for data transfer. | |||
'''''Moving data to HPSS/Archive/Nearline using the scheduler.''''' | |||
* [https://wiki.scinet.utoronto.ca/wiki/index.php/HPSS HPSS] is a tape-based storage solution, and is SciNet's nearline a.k.a. archive facility. | |||
* Storage space on HPSS is allocated through the annual [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions Compute Canada RAC allocation]. | |||
== Software and Libraries == | |||
=== Modules === | |||
Once you are on one of the login nodes, what software is already installed? | |||
* Other than essentials, all installed software is made available using module commands. | |||
* These set environment variables (<code>PATH</code>, etc.) | |||
* Allows multiple, conflicting versions of a given package to be available. | |||
* module spider shows the available software. | |||
<source lang="bash">nia-login07:~$ module spider | |||
--------------------------------------------------- | |||
The following is a list of the modules currently av | |||
--------------------------------------------------- | |||
CCEnv: CCEnv | |||
NiaEnv: NiaEnv/2018a | |||
anaconda2: anaconda2/5.1.0 | |||
anaconda3: anaconda3/5.1.0 | |||
autotools: autotools/2017 | |||
autoconf, automake, and libtool | |||
boost: boost/1.66.0 | |||
cfitsio: cfitsio/3.430 | |||
cmake: cmake/3.10.2 cmake/3.10.3 | |||
...</source> | |||
<ul> | |||
<li><p><code>module load <module-name></code></p> | |||
<p>use particular software</p></li> | |||
<li><p><code>module purge</code></p> | |||
<p>remove currently loaded modules</p></li> | |||
<li><p><code>module spider</code></p> | |||
<p>(or <code>module spider <module-name></code>)</p> | |||
<p>list available software packages</p></li> | |||
<li><p><code>module avail</code></p> | |||
<p>list loadable software packages</p></li> | |||
<li><p><code>module list</code></p> | |||
<p>list loaded modules</p></li></ul> | |||
On Niagara, there are really two software stacks: | |||
<ol style="list-style-type: decimal;"> | |||
<li><p>A Niagara software stack tuned and compiled for this machine. This stack is available by default, but if not, can be reloaded with</p> | |||
<source lang="bash">module load NiaEnv</source></li> | |||
<li><p>The same software stack available on Compute Canada's General Purpose clusters [https://docs.computecanada.ca/wiki/Graham Graham] and [https://docs.computecanada.ca/wiki/Cedar Cedar], compiled (for now) for a previous generation of CPUs:</p> | |||
<source lang="bash">module load CCEnv</source> | |||
<p>If you want the same default modules loaded as on Cedar and Graham, then afterwards also <code>module load StdEnv</code>.</p></li></ol> | |||
Note: the <code>*Env</code> modules are '''''sticky'''''; remove them by <code>--force</code>. | |||
=== Tips for loading software === | |||
<ul> | |||
<li><p>We advise '''''against''''' loading modules in your .bashrc.</p> | |||
<p>This could lead to very confusing behaviour under certain circumstances.</p></li> | |||
<li><p>Instead, load modules by hand when needed, or by sourcing a separate script.</p></li> | |||
<li><p>Load run-specific modules inside your job submission script.</p></li> | |||
<li><p>Short names give default versions; e.g. <code>intel</code> → <code>intel/2018.2</code>.</p> | |||
<p>It is usually better to be explicit about the versions, for future reproducibility.</p></li> | |||
<li><p>Handy abbreviations:</p></li></ul> | |||
<pre class="sh"> ml → module list | |||
ml NAME → module load NAME # if NAME is an existing module | |||
ml X → module X</pre> | |||
* Modules sometimes require other modules to be loaded first.<br /> | |||
Solve these dependencies by using <code>module spider</code>. | |||
=== Module spider === | |||
Oddly named, the module subcommand spider is the search-and-advice facility for modules. | |||
<source lang="bash">nia-login07:~$ module load openmpi | |||
Lmod has detected the error: These module(s) exist but cannot be loaded as requested: "openmpi" | |||
Try: "module spider openmpi" to see how to load the module(s).</source> | |||
<source lang="bash">nia-login07:~$ module spider openmpi | |||
------------------------------------------------------------------------------------------------------ | |||
openmpi: | |||
------------------------------------------------------------------------------------------------------ | |||
Versions: | |||
openmpi/2.1.3 | |||
openmpi/3.0.1 | |||
openmpi/3.1.0rc3 | |||
------------------------------------------------------------------------------------------------------ | |||
For detailed information about a specific "openmpi" module (including how to load the modules) use | |||
the module s full name. | |||
For example: | |||
$ module spider openmpi/3.1.0rc3 | |||
------------------------------------------------------------------------------------------------------</source> | |||
<source lang="bash">nia-login07:~$ module spider openmpi/3.1.0rc3 | |||
------------------------------------------------------------------------------------------------------ | |||
openmpi: openmpi/3.1.0rc3 | |||
------------------------------------------------------------------------------------------------------ | |||
You will need to load all module(s) on any one of the lines below before the "openmpi/3.1.0rc3" | |||
module is available to load. | |||
NiaEnv/2018a gcc/7.3.0 | |||
NiaEnv/2018a intel/2018.2 | |||
</source> | |||
<source lang="bash">nia-login07:~$ module load NiaEnv/2018a intel/2018.2 # note: NiaEnv is usually already loaded | |||
nia-login07:~$ module load openmpi/3.1.0rc3</source> | |||
<source lang="bash">nia-login07:~$ module list | |||
Currently Loaded Modules: | |||
1) NiaEnv/2018a (S) 2) intel/2018.2 3) openmpi/3.1.0.rc3 | |||
Where: | |||
S: Module is Sticky, requires --force to unload or purge</source> | |||
== Can I Run Commercial Software? == | |||
* Possibly, but you have to bring your own license for it. | |||
* SciNet and Compute Canada have an extremely large and broad user base of thousands of users, so we cannot provide licenses for everyone's favorite software. | |||
* Thus, the only commercial software installed on Niagara is software that can benefit everyone: Compilers, math libraries and debuggers. | |||
* That means no Matlab, Gaussian, IDL, | |||
* Open source alternatives like Octave, Python, R are available. | |||
* We are happy to help you to install commercial software for which you have a license. | |||
* In some cases, if you have a license, you can use software in the Compute Canada stack. | |||
== Compiling on Niagara: Example == | |||
<source lang="bash">nia-login07:~$ module list | |||
Currently Loaded Modules: | |||
1) NiaEnv/2018a (S) | |||
Where: | |||
S: Module is Sticky, requires --force to unload or purge | |||
nia-login07:~$ module load intel/2018.2 gsl/2.4 | |||
nia-login07:~$ ls | |||
main.c module.c | |||
nia-login07:~$ icc -c -O3 -xHost -o main.o main.c | |||
nia-login07:~$ icc -c -O3 -xHost -o module.o module.c | |||
nia-login07:~$ icc -o main module.o main.o -lgsl -mkl | |||
nia-login07:~$ ./main</source> | |||
== Testing == | |||
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need. | |||
<ul> | |||
<li><p>Small test jobs can be run on the login nodes.</p> | |||
<p>Rule of thumb: couple of minutes, taking at most about 1-2GB of memory, couple of cores.</p></li> | |||
<li><p>You can run the the ddt debugger on the login nodes after <code>module load ddt</code>.</p></li> | |||
<li><p>Short tests that do not fit on a login node, or for which you need a dedicated node, request an<br /> | |||
interactive debug job with the salloc command</p> | |||
<source lang="bash">nia-login07:~$ salloc -pdebug --nodes N --time=1:00:00</source> | |||
<p>where N is the number of nodes. The duration of your interactive debug session can be at most one hour, can use at most 4 nodes, and each user can only have one such session at a time.</p></li></ul> | |||
== Submitting jobs == | |||
<ul> | |||
<li><p>Niagara uses SLURM as its job scheduler.</p></li> | |||
<li><p>You submit jobs from a login node by passing a script to the sbatch command:</p> | |||
<source lang="bash">nia-login07:~$ sbatch jobscript.sh</source></li> | |||
<li><p>This puts the job in the queue. It will run on the compute nodes in due course.</p></li> | |||
<li><p>Jobs will run under their group's RRG allocation, or, if the group has none, under a RAS allocation (previously called `default' allocation).</p></li></ul> | |||
Keep in mind: | |||
<ul> | |||
<li><p>Scheduling is by node, so in multiples of 40-cores.</p></li> | |||
<li><p>Maximum walltime is 24 hours.</p></li> | |||
<li><p>Jobs must write to your scratch or project directory (home is read-only on compute nodes).</p></li> | |||
<li><p>Compute nodes have no internet access.</p> | |||
<p>Download data you need beforehand on a login node.</p></li></ul> | |||
=== Example submission script (OpenMP) === | |||
<source lang="bash">#!/bin/bash | |||
#SBATCH --nodes=1 | |||
#SBATCH --cpus-per-task=40 | |||
#SBATCH --time=1:00:00 | |||
#SBATCH --job-name openmp_job | |||
#SBATCH --output=openmp_output_%j.txt | |||
cd $SLURM_SUBMIT_DIR | |||
module load intel/2018.2 | |||
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK | |||
./openmp_example | |||
# or "srun ./openmp_example". | |||
</source> | |||
<source lang="bash">nia-login07:~$ sbatch openmp_job.sh</source> | |||
* First line indicates that this is a bash script. | |||
* Lines starting with <code>#SBATCH</code> go to SLURM. | |||
* sbatch reads these lines as a job request (which it gives the name <code>openmp_job</code>) . | |||
* In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour. | |||
* Once it found such a node, it runs the script: | |||
** Change to the submission directory; | |||
** Loads modules; | |||
** Sets an environment variable; | |||
** Runs the <code>openmp_example</code> application. | |||
=== Example submission script (MPI) === | |||
<source lang="bash">#!/bin/bash | |||
#SBATCH --nodes=8 | |||
#SBATCH --ntasks=320 | |||
#SBATCH --time=1:00:00 | |||
#SBATCH --job-name mpi_job | |||
#SBATCH --output=mpi_output_%j.txt | |||
cd $SLURM_SUBMIT_DIR | |||
module load intel/2018.2 | |||
module load openmpi/3.1.0rc3 | |||
mpirun ./mpi_example | |||
# or "srun ./mpi_example" | |||
</source> | |||
<source lang="bash">nia-login07:~$ sbatch mpi_job.sh</source> | |||
<ul> | |||
<li><p>First line indicates that this is a bash script.</p></li> | |||
<li><p>Lines starting with <code>#SBATCH</code> go to SLURM.</p></li> | |||
<li><p>sbatch reads these lines as a job request (which it gives the name <code>mpi_job</code>)</p></li> | |||
<li><p>In this case, SLURM looks for 8 nodes with 40 cores on which to run 320 tasks, for 1 hour.</p></li> | |||
<li><p>Once it found such a node, it runs the script:</p> | |||
<ul> | |||
<li>Change to the submission directory;</li> | |||
<li>Loads modules;</li> | |||
<li>Runs the <code>mpi_example</code> application.</li></ul> | |||
<p></p></li></ul> | |||
== Monitoring queued jobs == | |||
Once the job is incorporated into the queue, there are some command you can use to monitor its progress. | |||
<ul> | |||
<li><p><code>squeue</code> to show the job queue (<code>squeue -u $USER</code> for just your jobs);</p></li> | |||
<li><p><code>squeue -j JOBID</code> to get information on a specific job</p> | |||
<p>(alternatively, <code>scontrol show job JOBID</code>, which is more verbose).</p></li> | |||
<li><p><code>squeue -j JOBID -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S"</code> to get an estimate for when a job will run.</p></li> | |||
<li><p><code>scancel -i JOBID</code> to cancel the job.</p></li> | |||
<li><p><code>sinfo -pcompute</code> to look at available nodes.</p></li> | |||
<li><p>More utilities like those that were available on the GPC are under development.</p></li></ul> | |||
== Data Management and I/O Tips == | |||
* $HOME, $SCRATCH, and $PROJECT all use the parallel file system called GPFS. | |||
* Your files can be seen on all Niagara login and compute nodes. | |||
* GPFS is a high-performance file system which provides rapid reads and writes to large data sets in parallel from many nodes. | |||
* But accessing data sets which consist of many, small files leads to poor performance. | |||
* Avoid reading and writing lots of small amounts of data to disk.<br /> | |||
* Many small files on the system would waste space and would be slower to access, read and write. | |||
* Write data out in binary. Faster and takes less space. | |||
* Burst buffer (to come) is better for i/o heavy jobs and to speed up checkpoints. | |||
</translate> | </translate> |