Running jobs: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
(Split off "What is a scheduler?")
Line 2: Line 2:
<translate>
<translate>


This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.


== Overview == <!--T:2-->
On Compute Canada clusters, the job scheduler is the
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.


=== What's a job? === <!--T:3-->
==Use <code>sbatch</code> to submit jobs==
On computers we are most often familiar with graphical user interfaces (GUIs). There are windows, menus, buttons; we click here and there and the system responds. On Compute Canada servers the environment is different. To begin with, you control it by typing, not clicking. This is called a [[Linux introduction|command line interface]]. Furthermore, a program you would like to run may not begin immediately, but may instead be put in a waiting list. When the necessary CPU cores are available it will begin, otherwise jobs would interfere with each other leading to performance loss.
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:
 
<source lang="bash">
<!--T:4-->
[someuser@host ~]$ sbatch simple_job.sh
You prepare a small text file called a ''job script'' that basically says what program to run, where to get the input, and where to put the output. You ''submit'' this job script to a piece of software called the ''scheduler'' which decides when and where it will run. Once the job has finished you can retrieve the results of the calculation. Normally there is no interaction between you and the program while the job is running, although you can check on its progress if you wish.
Submitted batch job 123456
</source>


<!--T:5-->
A minimal Slurm job script looks like this:
Here's a very simple job script:
{{File
{{File
   |name=simple_job.sh
   |name=simple_job.sh
Line 19: Line 23:
#!/bin/bash
#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --time=00:01:00
#SBATCH --account=def-someuser
echo 'Hello, world!'
echo 'Hello, world!'
sleep 30   
sleep 30   
}}
}}
It runs the programs <code>echo</code> and <code>sleep</code>, there is no input, and the output will go to a default location. Lines starting with <code>#SBATCH</code> are directives to the scheduler, telling it things about what the job needs to run. This job, for example, only needs one minute of run time (00:01:00).


=== The job scheduler === <!--T:6-->
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the <code>sbatch</code> [https://slurm.schedmd.com/sbatch.html manual page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) and an account name (<code>--account</code>) for each job. (See [[#Accounts and projects]] below.)
The job scheduler is a piece of software with multiple responsibilities. It must
* maintain a database of jobs,
* enforce policies regarding limits and priorities,
* ensure resources are not overloaded, for example by only assigning each CPU core to one job at a time,
* decide which jobs to run and on which compute nodes,
* launch them on those nodes, and
* clean up after each job finishes.


<!--T:7-->
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,
On Compute Canada clusters, these responsibilities are handled by the [https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager]. All the examples and syntax shown on this page are for Slurm.
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh
will submit the above job script with a time limit of 30 minutes.


=== Requesting resources === <!--T:8-->
==Use <code>squeue</code> to list jobs==
You use the job script to ask for the resources needed to run your calculation. Among the resources associated with a job are ''time'' and ''number of processors''. In the example above, the time requested is one minute and there will be one processor allocated by default since no specific number is given.  We describe below other types of requests such as multiple processors, memory capacity and special processors such as [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units GPUs].


<!--T:9-->
The [https://slurm.schedmd.com/squeue.html <code>squeue</code>] command lists pending and running jobs. Supply your username as an argument with <code>-u</code> to list only your own jobs:
It is important to specify those parameters well. If you ask for less than the calculation needs, the job will be killed for exceeding the requested time or memory limit. If you ask for more than it needs, the job may wait longer than necessary before it starts, and once running it will needlessly prevent others from using those resources.


==A basic SLURM job== <!--T:10-->
<!--T:11-->
We can submit the job script <code>simple_job.sh</code> shown above with [https://slurm.schedmd.com/sbatch.html sbatch]:
<source lang="bash">
<source lang="bash">
[someuser@host ~]$ sbatch simple_job.sh
[someuser@host ~]$ squeue -u $USER
Submitted batch job 1234
      JOBID PARTITION     NAME    USER ST   TIME  NODES NODELIST(REASON)
[someuser@host ~]$ squeue
    123456 cpubase_b  simple_j someuser  R   0:03      1 cdr234
            JOBID PARTITION     NAME    USER ST       TIME  NODES NODELIST(REASON)
    123457 cpubase_b  simple_j someuser PD            1 (Priority)
              1234 mem12_sho simple_j someuser  R       0:03      1 zeno001
[someuser@host ~]$ cat slurm-1234.out
Hello, world!
</source>
</source>


<!--T:12-->
<!--T:12-->
Look at the ST column in the output of [[Running_jobs#Monitoring_jobs | squeue]] to determine the status of your jobs. The two most common states are "PD" for "pending" or "R" for "running". When the job has finished it no longer appears in the <code>squeue</code> output.
The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the [https://slurm.schedmd.com/squeue.html man page]
for more on selecting, formatting, and interpreting the <code>squeue</code> output.


<!--T:13-->
==Where does the output go?==
Notice that each job is assigned a "job ID", a unique identification number printed when you submit the job --- 1234 in this example. You can have more than one job in the system at a time, and the ID number can be used to distinguish them even if they have the same name. And finally, because we didn't specify anywhere else to put it the output is placed in a file named with the same job ID number, <code>slurm&#8209;1234.out</code>.


<!--T:45-->
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.
You can also specify options to <code>sbatch</code> on the command line. So for example,
You can use <code>--output</code> to specify a different name or location.
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced
will change the time limit of the job to 30 minutes. Any option can be overridden in this way.
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.


===Choosing where the output goes=== <!--T:14-->
The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j).  
If you want the output file to have a more distinctive name than <code>slurm&#8209;1234.out</code>, you can use <code>--output</code> to change it.
The following script sets a ''job name'' which will appear in <code>squeue</code> output, and sends the output to a file prefixed with the job name and containing the job ID number.  


<!--T:15-->
<!--T:15-->
Line 77: Line 64:
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --time=00:01:00
#SBATCH --job-name=test
#SBATCH --job-name=test
#SBATCH --output=test-%J.out
#SBATCH --output=%x-%j.out
echo 'Hello, world!'
echo 'Hello, world!'
}}
}}


<!--T:16-->
<!--T:16-->
Error output will normally appear in the same file, just as it would if you were typing commands interactively. If you wish you can split the standard error channel (stderr) from the standard output channel (stdout) by specifying a file name with the <code>&#8209;e</code> option.
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.


==Accounts and projects== <!--T:46-->
==Accounts and projects==
Information about your job, like how long it waited, how long it ran, and how many cores it used, is recorded so we can monitor our quality of service and so we can report to our funders how their money is spent. Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap resource allocation project]. Most users only work on one project at a time. If this is you, you don't need to supply an account name; we will quietly attach it to your jobs. However, if you are allowed to submit jobs as part of two or more resource allocation projects, you will have to supply an account name for each job using the <code>--account</code> option.
 
Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project],  
specified using the <code>--account</code> directive:
  #SBATCH --account=def-user-ab
  #SBATCH --account=def-user-ab


<!--T:47-->
If you try to submit a job with <code>sbatch</code> without supplying an account name,  
If you try to submit a job with <code>sbatch</code> without supplying an account name, and one is needed, you will be shown a list of valid account names to chose from.
you will be shown a list of valid account names to chose from. If you have access to
several Resource Allocation Projects and want to know which account name corresponds
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB]
and visit the page for that project. The second field in the display, the '''group name''',
is the string you should use with the <code>--account</code> directive. Note that a Resource
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore
may not be transferable from one cluster to another.
 
In the illustration below, jobs which are to be accounted against RAP wnp-003-ac
should be submitted with <code>--account=def-rdickson-ac</code>.
 
[[File:Find-group-name-annotated.png|750px|left| Finding the group name for a RAP]]
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->


== Examples of job scripts == <!--T:17-->
== Examples of job scripts == <!--T:17-->

Revision as of 20:02, 17 July 2017

Other languages:

This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters. If you have not worked on a large shared computer cluster before, you should probably read What is a scheduler? first.

On Compute Canada clusters, the job scheduler is the Slurm Workload Manager. Comprehensive documentation for Slurm is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of corresponding commands useful.

Use sbatch to submit jobs[edit]

The command to submit a job is sbatch:

[someuser@host ~]$ sbatch simple_job.sh
Submitted batch job 123456

A minimal Slurm job script looks like this:

File : simple_job.sh

#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --account=def-someuser
echo 'Hello, world!'
sleep 30


Directives (or "options") in the job script are prefixed with #SBATCH and must precede all executable commands. All available directives are described on the sbatch manual page. Compute Canada policies require that you supply at least a time limit (--time) and an account name (--account) for each job. (See #Accounts and projects below.)

You can also specify directives as command-line arguments to sbatch. So for example,

[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh 

will submit the above job script with a time limit of 30 minutes.

Use squeue to list jobs[edit]

The squeue command lists pending and running jobs. Supply your username as an argument with -u to list only your own jobs:

[someuser@host ~]$ squeue -u $USER
      JOBID PARTITION      NAME     USER ST   TIME  NODES NODELIST(REASON)
     123456 cpubase_b  simple_j someuser  R   0:03      1 cdr234
     123457 cpubase_b  simple_j someuser PD             1 (Priority)

The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the man page for more on selecting, formatting, and interpreting the squeue output.

Where does the output go?[edit]

By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", e.g. slurm-123456.out, in the directory from which the job was submitted. You can use --output to specify a different name or location. Certain replacement symbols can be used in the filename, e.g. %j will be replaced by the job ID number. See sbatch for a complete list.

The following sample script sets a job name (which appears in squeue output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j).


File : name_output.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --job-name=test
#SBATCH --output=%x-%j.out
echo 'Hello, world!'


Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use --error.

Accounts and projects[edit]

Every job must have an associated account name corresponding to a Compute Canada Resource Allocation Project, specified using the --account directive:

#SBATCH --account=def-user-ab

If you try to submit a job with sbatch without supplying an account name, you will be shown a list of valid account names to chose from. If you have access to several Resource Allocation Projects and want to know which account name corresponds to a given Resource Allocation Project, log in to CCDB and visit the page for that project. The second field in the display, the group name, is the string you should use with the --account directive. Note that a Resource Allocation Project may only apply to a specific cluster (or set of clusters) and therefore may not be transferable from one cluster to another.

In the illustration below, jobs which are to be accounted against RAP wnp-003-ac should be submitted with --account=def-rdickson-ac.

Finding the group name for a RAP


Examples of job scripts[edit]

MPI job[edit]

This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes.


File : mpi_job.sh

#!/bin/bash
#SBATCH --ntasks=4               # number of MPI processes
#SBATCH --mem-per-cpu=1024M      # memory; default unit is megabytes
#SBATCH --time=0-00:05           # time (DD-HH:MM)
srun ./mpi_program               # mpirun or mpiexec also work


One can have detailed control over the location of MPI processes by, for example, requesting a specific number of processes per node. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see Advanced MPI scheduling.

Threaded or OpenMP job[edit]

This example script launches a single process with six CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. gcc -fopenmp ... or icc -openmp ...


File : openmp_job.sh

#!/bin/bash
#SBATCH --time=0-0:5
#SBATCH --cpus-per-task=6
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./ompHello


For more on writing and running parallel programs with OpenMP, see OpenMP.

GPU job[edit]

This example is a serial job with one GPU allocated, a memory limit of 4000 MB per node, and a run-time limit of 5 hours. The output filename will include the name of the first node used and the job ID number.


File : simple_gpu_job.sh

#!/bin/bash
#SBATCH --gres=gpu:1              # request GPU "generic resource"
#SBATCH --mem=4000M               # memory per node
#SBATCH --time=0-05:00            # time (DD-HH:MM)
#SBATCH --output=%N-%j.out        # %N for node name, %j for jobID
nvidia-smi


Because no node count is specified in the above example, one node will be allocated. If you were to add --nodes=3, the total memory allocated would be 12000M. The same goes for --gres: If you request three nodes, you will get one GPU per node, for a total of three.

For more on running GPU jobs, see Using GPUs with SLURM.

Array job[edit]

Also known as a task array, an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, $SLURM_ARRAY_TASK_ID, which is set to a different value for each instance of the job.

sbatch --array=0-7 ...      # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive
sbatch --array=1,3,5,7 ...  # $SLURM_ARRAY_TASK_ID will take the listed values
sbatch --array=1-7:2 ...    # Another way to do the same thing
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously

Interactive jobs[edit]

Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:

  • Data exploration at the command line
  • Interactive "console tools" like R and iPython
  • Significant software development, debugging, or compiling

You can start an interactive session on a compute node with salloc. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:

[name@login ~]$ salloc --time=1:0:0 --ntasks=2 
salloc: Granted job allocation 1234567

Then we start a shell (bash) with timeout disabled (--wait 0):

[name@login ~]$ srun --wait 0 --pty bash
[name@node01 ~]$ ...             # do some work
[name@node01 ~]$ exit            # log out of the compute node (terminate srun)
[name@login ~]$ exit             # terminate the allocation
salloc: Relinquishing job allocation 1234567

For more details see Interactive jobs.

Monitoring jobs[edit]

By default squeue will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with

squeue -u <username>

You can show only running jobs, or only pending jobs:

squeue -u <username> -t RUNNING
squeue -u <username> -t PENDING

You can show detailed information for a specific job with scontrol:

scontrol show job -dd <jobid>

Find information about a completed job with sacct, and optionally, control what it prints using --format:

sacct -j <jobid>
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest resident set size for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.

The sstat command works on a running job much the same way that sacct works on a completed job.

You can ask to be notified by email of certain job conditions by supplying options to sbatch:

#SBATCH --mail-user=<email_address>
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=REQUEUE
#SBATCH --mail-type=ALL

Cancelling jobs[edit]

Use scancel with the job ID to cancel a job:

 scancel <jobid>

You can also use it to cancel all your jobs, or all your pending jobs:

scancel -u <username>
scancel -t PENDING -u <username>

Troubleshooting[edit]

Avoid hidden characters in job scripts[edit]

Preparing a job script with a word processor instead of a text editor is a common cause of trouble. Best practice is to prepare your job script on the cluster using an editor such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:

  • Windows users:
    • Use a text editor such as Notepad or Notepad++.
    • After uploading the script, use dos2unix to change Windows end-of-line characters to Linux end-of-line characters.
  • Mac users:
    • Open a terminal window and use an editor such as nano, vim, or emacs.

Further reading[edit]