Running jobs: Difference between revisions

Jump to navigation Jump to search
remove ~ from prompts, following Cedar no-jobs-from-home policy
(Marked this version for translation)
(remove ~ from prompts, following Cedar no-jobs-from-home policy)
Line 19: Line 19:
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:
<source lang="bash">
<source lang="bash">
[someuser@host ~]$ sbatch simple_job.sh
$ sbatch simple_job.sh
Submitted batch job 123456
Submitted batch job 123456
</source>
</source>
Line 41: Line 41:
<!--T:59-->
<!--T:59-->
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,
  [someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh  
  $ sbatch --time=00:30:00 simple_job.sh  
will submit the above job script with a time limit of 30 minutes. The acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
will submit the above job script with a time limit of 30 minutes. The acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".


Line 65: Line 65:
<!--T:62-->
<!--T:62-->
<source lang="bash">
<source lang="bash">
[someuser@host ~]$ squeue -u $USER
$ squeue -u $USER
       JOBID PARTITION      NAME    USER ST  TIME  NODES NODELIST(REASON)
       JOBID PARTITION      NAME    USER ST  TIME  NODES NODELIST(REASON)
     123456 cpubase_b  simple_j someuser  R  0:03      1 cdr234
     123456 cpubase_b  simple_j someuser  R  0:03      1 cdr234
Line 228: Line 228:
<!--T:29-->
<!--T:29-->
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:
  [name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser
  $ salloc --time=1:0:0 --ntasks=2 --account=def-someuser
  salloc: Granted job allocation 1234567
  salloc: Granted job allocation 1234567
  [name@node01 ~]$ ...            # do some work
  $ ...            # do some work
  [name@node01 ~]$ exit            # terminate the allocation
  $ exit            # terminate the allocation
  salloc: Relinquishing job allocation 1234567
  salloc: Relinquishing job allocation 1234567


Line 243: Line 243:
<!--T:32-->
<!--T:32-->
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with
  squeue -u $USER
  $ squeue -u $USER


<!--T:33-->
<!--T:33-->
You can show only running jobs, or only pending jobs:
You can show only running jobs, or only pending jobs:
  squeue -u <username> -t RUNNING
  $ squeue -u <username> -t RUNNING
  squeue -u <username> -t PENDING
  $ squeue -u <username> -t PENDING


<!--T:34-->
<!--T:34-->
Line 287: Line 287:
<!--T:35-->
<!--T:35-->
Find more detailed information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:
Find more detailed information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:
  sacct -j <jobid>
  $ sacct -j <jobid>
  sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
  $ sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed


<!--T:153-->
<!--T:153-->
Line 312: Line 312:


<!--T:132-->
<!--T:132-->
{{Command2
$ srun --jobid 123456 --pty watch -n 30 nvidia-smi
|srun --jobid 123456 --pty watch -n 30 nvidia-smi}}


<!--T:133-->
<!--T:133-->
Line 319: Line 318:


<!--T:134-->
<!--T:134-->
{{Command2
$ srun --jobid 123456 --pty tmux new-session -d 'htop -u $USER' \; split-window -h 'watch nvidia-smi' \; attach
|srun --jobid 123456 --pty tmux new-session -d 'htop -u $USER' \; split-window -h 'watch nvidia-smi' \; attach}}


<!--T:135-->
<!--T:135-->
Line 333: Line 331:
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:


  <!--T:39-->
<!--T:39-->
scancel <jobid>
$ scancel <jobid>


<!--T:40-->
<!--T:40-->
You can also use it to cancel all your jobs, or all your pending jobs:
You can also use it to cancel all your jobs, or all your pending jobs:


<!--T:41-->
<!--T:41-->
  scancel -u $USER
  $ scancel -u $USER
  scancel -t PENDING -u $USER
  $ scancel -t PENDING -u $USER


== Resubmitting jobs for long running computations == <!--T:74-->
== Resubmitting jobs for long running computations == <!--T:74-->
Line 529: Line 527:
<!--T:119-->
<!--T:119-->
<source lang="console">
<source lang="console">
[name@server]$ module load gcc
$ module load gcc
[name@server]$ module load quantumespresso/6.1
$ module load quantumespresso/6.1
Lmod has detected the following error:  These module(s) exist but cannot be loaded as requested: "quantumespresso/6.1"
Lmod has detected the following error:  These module(s) exist but cannot be loaded as requested: "quantumespresso/6.1"
   Try: "module spider quantumespresso/6.1" to see how to load the module(s).
   Try: "module spider quantumespresso/6.1" to see how to load the module(s).
[name@server]$ module spider quantumespresso/6.1
$ module spider quantumespresso/6.1


<!--T:120-->
<!--T:120-->
Bureaucrats, cc_docs_admin, cc_staff
2,879

edits

Navigation menu