Running jobs: Difference between revisions

Revision as of 18:08, 10 August 2017

Other languages:

English
français

This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters. If you have not worked on a large shared computer cluster before, you should probably read What is a scheduler? first.

On Compute Canada clusters, the job scheduler is the Slurm Workload Manager. Comprehensive documentation for Slurm is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of corresponding commands useful.

Use `sbatch` to submit jobs[edit]

The command to submit a job is sbatch:

[someuser@host ~]$ sbatch simple_job.sh
Submitted batch job 123456

A minimal Slurm job script looks like this:

File : simple_job.sh

#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --account=def-someuser
echo 'Hello, world!'
sleep 30

Directives (or "options") in the job script are prefixed with #SBATCH and must precede all executable commands. All available directives are described on the sbatch page. Compute Canada policies require that you supply at least a time limit (--time) and an account name (--account) for each job. (See #Accounts and projects below.)

You can also specify directives as command-line arguments to sbatch. So for example,

[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh

will submit the above job script with a time limit of 30 minutes.

Use `squeue` to list jobs[edit]

The squeue command lists pending and running jobs. Supply your username as an argument with -u to list only your own jobs:

[someuser@host ~]$ squeue -u $USER
      JOBID PARTITION      NAME     USER ST   TIME  NODES NODELIST(REASON)
     123456 cpubase_b  simple_j someuser  R   0:03      1 cdr234
     123457 cpubase_b  simple_j someuser PD             1 (Priority)

The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the squeue page for more on selecting, formatting, and interpreting the squeue output.

Where does the output go?[edit]

By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", e.g. slurm-123456.out, in the directory from which the job was submitted. You can use --output to specify a different name or location. Certain replacement symbols can be used in the filename, e.g. %j will be replaced by the job ID number. See sbatch for a complete list.

The following sample script sets a job name (which appears in squeue output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j).

File : name_output.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --job-name=test
#SBATCH --output=%x-%j.out
echo 'Hello, world!'

Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use --error.

Accounts and projects[edit]

Every job must have an associated account name corresponding to a Compute Canada Resource Allocation Project, specified using the --account directive:

#SBATCH --account=def-user-ab

If you try to submit a job with sbatch without supplying an account name, you will be shown a list of valid account names to chose from. If you have access to several Resource Allocation Projects and want to know which account name corresponds to a given Resource Allocation Project, log in to CCDB and visit the page for that project. The second field in the display, the group name, is the string you should use with the --account directive. Note that a Resource Allocation Project may only apply to a specific cluster (or set of clusters) and therefore may not be transferable from one cluster to another.

In the illustration below, jobs which are to be accounted against RAP wnp-003-ac should be submitted with --account=def-rdickson-ac.

Finding the group name for a Resource Allocation Project (RAP)

If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the SLURM_ACCOUNT and SBATCH_ACCOUNT environment variables in your ~/.bashrc file, like so:

export SLURM_ACCOUNT=def-someuser
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SALLOC_ACCOUNT=$SLURM_ACCOUNT

Slurm will use the value of SBATCH_ACCOUNT in place of the --account directive in the job script. Note that even if you supply an account name inside the job script, the environment variable takes priority. In order to override the environment variable you must supply an account name as a command-line argument to sbatch.

SLURM_ACCOUNT plays the same role as SBATCH_ACCOUNT, but for the srun command instead of sbatch. The same idea holds for SALLOC_ACCOUNT.

Examples of job scripts[edit]

MPI job[edit]

This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes.

File : mpi_job.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --ntasks=4               # number of MPI processes
#SBATCH --mem-per-cpu=1024M      # memory; default unit is megabytes
#SBATCH --time=0-00:05           # time (DD-HH:MM)
srun ./mpi_program               # mpirun or mpiexec also work

One can have detailed control over the location of MPI processes by, for example, requesting a specific number of processes per node. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see Advanced MPI scheduling.

Threaded or OpenMP job[edit]

This example script launches a single process with six CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. gcc -fopenmp ... or icc -openmp ...

File : openmp_job.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0-0:5
#SBATCH --cpus-per-task=6
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./ompHello

For more on writing and running parallel programs with OpenMP, see OpenMP.

GPU job[edit]

This example is a serial job with one GPU allocated, a memory limit of 4000 MB per node, and a run-time limit of 5 hours. The output filename will include the name of the first node used and the job ID number.

File : simple_gpu_job.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --gres=gpu:1              # request GPU "generic resource"
#SBATCH --mem=4000M               # memory per node
#SBATCH --time=0-05:00            # time (DD-HH:MM)
#SBATCH --output=%N-%j.out        # %N for node name, %j for jobID
nvidia-smi

Because no node count is specified in the above example, one node will be allocated. If you were to add --nodes=3, the total memory allocated would be 12000M. The same goes for --gres: If you request three nodes, you will get one GPU per node, for a total of three.

This example is a parallel job with 4 GPUs allocated on the same node.

File : simple_gpu_job.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --gres=gpu:4              # request GPU "generic resource"
#SBATCH --mem=4000M               # memory per node
#SBATCH --time=0-05:00            # time (DD-HH:MM)
#SBATCH --output=%N-%j.out        # %N for node name, %j for jobID
nvidia-smi

This example is a whole node gpu job on the large gpu node on cedar with all 4 GPUs allocated. You must request all 4 gpus to use these resources.

File : large_gpu_job.sh

#!/bin/bash
#SBATCH --ntasks=1                    # Number of tasks
#SBATCH --cpus-per-task=24            # Number of CPU cores per task
#SBATCH --nodes=1                     # Number of nodes, ensure that all cores are on one machine
#SBATCH --gres=gpu:lgpu:4             # ask for 4 gpus per node of the large-gpu node variaty
#SBATCH --time=0-00:10                # Runtime in D-HH:MM
#SBATCH -o large_gpu-%j.out           # File to which STDOUT will be written
#SBATCH --mail-type=ALL               # Type of email notification- BEGIN,END,FAIL,ALL

# The large GPUs nodes have 4 Tesla P100 16GB (as opposed to 12GB cards for the rest of the cluster)
# These GPUs are sitting on the same PCI switch so the inter-GPU communication is faster
# These nodes have 256 GB RAM as opposed to 128GB for the rest of the cluster
# You have to specify lgpu and use all 4 on a node to be able to submit jobs to these resources

hostname
sleep 500

For more on running GPU jobs, see Using GPUs with SLURM.

Array job[edit]

Also known as a task array, an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, $SLURM_ARRAY_TASK_ID, which is set to a different value for each instance of the job.

sbatch --array=0-7 ...      # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive
sbatch --array=1,3,5,7 ...  # $SLURM_ARRAY_TASK_ID will take the listed values
sbatch --array=1-7:2 ...    # Another way to do the same thing
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously

Interactive jobs[edit]

Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:

Data exploration at the command line
Interactive "console tools" like R and iPython
Significant software development, debugging, or compiling

You can start an interactive session on a compute node with salloc. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:

[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser
salloc: Granted job allocation 1234567
[name@node01 ~]$ ...             # do some work
[name@node01 ~]$ exit            # terminate the allocation
salloc: Relinquishing job allocation 1234567

For more details see Interactive jobs.

Monitoring jobs[edit]

By default squeue will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with

squeue -u <username>

You can show only running jobs, or only pending jobs:

squeue -u <username> -t RUNNING
squeue -u <username> -t PENDING

You can show detailed information for a specific job with scontrol:

scontrol show job -dd <jobid>

Find information about a completed job with sacct, and optionally, control what it prints using --format:

sacct -j <jobid>
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

If a node fails while running a job, the job may be restarted. sacct will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the --duplicates option.

Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest resident set size for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.

The sstat command works on a running job much the same way that sacct works on a completed job.

You can ask to be notified by email of certain job conditions by supplying options to sbatch:

#SBATCH --mail-user=<email_address>
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=REQUEUE
#SBATCH --mail-type=ALL

Cancelling jobs[edit]

Use scancel with the job ID to cancel a job:

 scancel <jobid>

You can also use it to cancel all your jobs, or all your pending jobs:

scancel -u <username>
scancel -t PENDING -u <username>

Resubmitting jobs for long running computations[edit]

When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, the software has to support checkpointing. The software should be able to save its complete state to a file, called a checkpoint, and then it should be able to restart and continue the computation from that saved state.

If there is only few such restarts are required this can be easily done manually, but sometimes multiple running simulations may require numerous restarts. Then, some kind of automation technique may be employed to simplify resubmission of multi step jobs.

Currently, on Compute Canada systems there are two recommended methods of resubmission

Using SLURM job arrays;
Resubmitting from the end of the job script.

Resubmission using job arrays[edit]

This way, one can submit several jobs with the same parameters, an array of jobs, with a condition that only one job of the array will run at any given time. The same job script will be executed a predefined number of times, so that the script should include all the necessary commands to ensure that the last checkpoint is used for the next job.

For example, there is a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. We can split the simulation into 10 smaller jobs of 100 000 steps, one after another.

An example of a job script with resubmission:

#!/bin/bash
# ---------------------------------------------------------------------
# SLURM script for job resubmission on a Compute Canada cluster. 
# ---------------------------------------------------------------------
#SBATCH --job-name=job_array

#SBATCH --account=def-rozmanov

#SBATCH --cpus-per-task=1
#SBATCH --time=0-10:00
#SBATCH --mem=100M

# Run a 10 job array, one job at a time.
#SBATCH --array=1-10%1

# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
echo ""
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."
echo ""
# ---------------------------------------------------------------------
# Run your simulation step here...

if test -e state.cpt; then 
     # There is a checkpoint file, restart;
     mdrun --restart state.cpt
else
     # There is no checkpoint file, start a new simulation.
     mdrun
fi

# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
# ---------------------------------------------------------------------

Resubmission from the job script[edit]

In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. Once the calculation is done before the allocated time at the end of the job script one has to check if the end of the simulation has been reached and if not a new job will be submitted to work on the next chunk of work.

An example of a job script with resubmission:

#!/bin/bash
# ---------------------------------------------------------------------
# SLURM script for job resubmission on a Compute Canada cluster. 
# ---------------------------------------------------------------------
#SBATCH --job-name=job_chain

#SBATCH --account=def-rozmanov

#SBATCH --cpus-per-task=1
#SBATCH --time=0-10:00
#SBATCH --mem=100M

# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
# Run your simulation step here...

if test -e state.cpt; then 
     # There is a checkpoint file, restart;
     mdrun --restart state.cpt
else
     # There is no checkpoint file, start a new simulation.
     mdrun
fi

# Resubmit if not all work has been done yet.
if end_is_not_reached; then
     sbatch ${BASH_SOURCE[0]}
fi

# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
# ---------------------------------------------------------------------

Troubleshooting[edit]

Avoid hidden characters in job scripts[edit]

Preparing a job script with a word processor instead of a text editor is a common cause of trouble. Best practice is to prepare your job script on the cluster using an editor such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:

Windows users:
- Use a text editor such as Notepad or Notepad++.
- After uploading the script, use dos2unix to change Windows end-of-line characters to Linux end-of-line characters.
Mac users:
- Open a terminal window and use an editor such as nano, vim, or emacs.

Running jobs: Difference between revisions

Revision as of 18:08, 10 August 2017

Contents

Use `sbatch` to submit jobs[edit]

Use `squeue` to list jobs[edit]

Where does the output go?[edit]

Accounts and projects[edit]

Examples of job scripts[edit]

MPI job[edit]

Threaded or OpenMP job[edit]

GPU job[edit]

Array job[edit]

Interactive jobs[edit]

Monitoring jobs[edit]

Cancelling jobs[edit]

Resubmitting jobs for long running computations[edit]

Resubmission using job arrays[edit]

Resubmission from the job script[edit]

Troubleshooting[edit]

Avoid hidden characters in job scripts[edit]

Further reading[edit]

Navigation menu

Running jobs: Difference between revisions

Revision as of 18:08, 10 August 2017

Use sbatch to submit jobs[edit]

Use squeue to list jobs[edit]

Where does the output go?[edit]

Accounts and projects[edit]

Examples of job scripts[edit]

MPI job[edit]

Threaded or OpenMP job[edit]

GPU job[edit]

Array job[edit]

Interactive jobs[edit]

Monitoring jobs[edit]

Cancelling jobs[edit]

Resubmitting jobs for long running computations[edit]

Resubmission using job arrays[edit]

Resubmission from the job script[edit]

Troubleshooting[edit]

Avoid hidden characters in job scripts[edit]

Further reading[edit]

Navigation menu

Search

Use `sbatch` to submit jobs[edit]

Use `squeue` to list jobs[edit]