Job scheduling policies: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
(96 intermediate revisions by 11 users not shown)
Line 1: Line 1:
{{Draft}}
<languages/>
<languages/>
<translate>
<translate>


<!--T:1-->
''Parent page: [[Running jobs]]''
''Parent page: [[Running jobs]]''


You can do much work on [[Cedar]] or [[Graham]] by [[Running jobs|submitting jobs]]  
<!--T:2-->
that specify only the number of cores, the associated memory, and a run-time limit.
You can do much work on our clusters by [[Running jobs|submitting jobs]]  
that specify only the number of cores and a runtime limit.
However if you submit large numbers of jobs, or jobs that require large
However if you submit large numbers of jobs, or jobs that require large
amounts of resources, you may be able to improve your productivity
amounts of resources, you may be able to improve your productivity
by understanding the policies affecting job scheduling.
by understanding the policies affecting job scheduling.


=== Priority ===
===Priority and fair-share === <!--T:7-->


The order in which jobs are considered for scheduling is determined by ''priority''. There are many factors effecting the priority of jobs. The SLURM documentation shows the calculation of job priority as the sum of many different weighted factors (see SLURM docs on [https://slurm.schedmd.com/priority_multifactor.html#general priority]).
<!--T:4-->
The order in which jobs are considered for scheduling is determined by ''priority''. Priority on our systems is determined using the [https://slurm.schedmd.com/fair_tree.html Fair Tree] algorithm.<ref>A detailed description of Fair Tree can be found at https://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf, with references to early rock'n'roll music.</ref>


Job_priority =
<!--T:38-->
(PriorityWeightAge) * (age_factor) +
Each job is charged to a Resource Allocation Project (RAP).
(PriorityWeightFairshare) * (fair-share_factor) +
You specify the project with the <code>--account</code> argument to <code>sbatch</code>.
(PriorityWeightJobSize) * (job_size_factor) +
The project might hold a grant of CPU or GPU time from a [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ Resource Allocation Competition], in which case the account code will probably begin with <code>rrg-</code> or <code>rpp-</code>. Or it could be a non-RAC project, also known as a Rapid Access Service project, in which case the account code will probably begin with <code>def-</code>. See [[Running_jobs#Accounts_and_projects|Accounts and Projects]] for how to determine what account codes you can use.
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
    TRES_weight_<type> * TRES_factor_<type>,
    ...)


On our systems the <code>fair-share_factor</code> has by far the largest weight and has a significant impact on job priority.
<!--T:39-->
Every project has a target usage level. Non-RAC projects all have equal target usage, while RAC projects have target usages determined by the number of CPU-years or GPU-years granted with each RAC award.


===The fair-share factor===
<!--T:42-->
When submitting jobs you must choose an account to "bill" the job to. This could be to an account resulting from a [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ Resource Allocation Competition] (e.g. running <code>sbatch</code> with the <code>--account</code> option and specifying an account beginning with <code>rrg</code> or <code>rpp</code>) or to a non-RAC account (e.g. an account starting with <code>def</code>).
As an example let us imagine a research group with the account code <code>def-prof1</code>. Members of this imaginary group have user names <code>prof1, grad2</code> and <code>postdoc3</code>. We can examine the group's usage and share information with the <code>sshare</code> command as shown below. Note that we must append <code>_cpu</code> or <code>_gpu</code> to the end of the account code, as appropriate, since CPU and GPU use are tracked separately.


Only about 10% of each of the clusters is reserved for non-RAC jobs. Priority of jobs in the non-RAC allocation will be adjusted such that on average only about 10% of the cluster is being used for these jobs. If fewer non-RAC jobs are run, non-RAC job priorities will go up, conversely if more non-RAC jobs are run, their priorities will go down.
<!--T:59-->
[prof1@gra-login4 ~]$ sshare -l -A def-prof1_cpu -u prof1,grad2,postdoc3
        Account      User  RawShares  NormShares  RawUsage  ... EffectvUsage  ...    LevelFS  ...
-------------- ---------- ---------- -----------  --------  ... ------------  ... ----------  ...
<span style="color:#ff0000">def-prof1_cpu                434086    0.001607  1512054  ...    0.000043  ...  37.357207  ...</span>
  def-prof1_cpu      prof1          1    0.100000        0  ...    0.000000  ...        inf  ... 
  def-prof1_cpu      grad2          1    0.100000    54618  ...    0.036122  ...  2.768390  ...
  def-prof1_cpu  postdoc3          1    0.100000    855517  ...    0.565798  ...  0.176741  ...


You can see the fair-share factor, <code>FairShare</code>, for all accounts on a given cluster by running the <code>sshare</code> command:
<!--T:43-->
<source lang="console">
The output shown above has been simplified by removing several fields which are not relevant to this discussion.  
[name@server]$ sshare
Furthermore the line that is ''most'' important for scheduling is the first one, highlighted in red.
            Account      User  RawShares  NormShares          RawUsage  EffectvUsage  FairShare
This line describes the status of the project relative to all other projects using the cluster.
-------------------- ---------- ---------- -----------      ----------- ------------- ----------
In this example, the research group share is 0.1607% and they have used 0.0043% of the resources on the cluster, the groups Level fairshare is 37 which is quite high as the group has used a small fraction of their allocated share of resources. We would expect that jobs submitted by this group to have a fairly high priority.
root                                          1.000000 84241308075674887      1.000000  0.500000
no_rac_cpu                          3103    0.124374 80949587847079770      0.960925  0.004723
  ras_basic_cpu                      3103    0.124334 80949587847079770      0.960925  0.004715
  cc-debug_cpu                          1    0.000031            179535      0.000239  0.004715
    cc-debug_cpu          name          1    0.000000                16      0.000001  0.004715
  def-user01_cpu                        1    0.000031      831862838324      0.000249  0.003778
  def-user02_cpu                        1    0.000031                0      0.000239  0.004715
...
</source>
In the above output, notice the last line, the <code>RawUsage</code> is 0, however the <code>EffectivUsage</code> is not. This is because the effective usage takes into account that other users have been running jobs in other sub accounts to the parent <code>no_rac_cpu</code> account, which is limited to have only about 10% of the cluster. When the total usage of this parent account goes up, so too does the effective usage of the child accounts. However, if other sibling accounts, for example <code>def-user01_cpu</code>, use a relatively significant amount of resources, it will increase their effective usage relative to other accounts under the same parent account and conversely decrease their <code>FairShare</code> factor. If jobs in the queue are mostly <code>def-</code>, or your peers, then it makes more sense to compare to <code>FairShare</code> factors of your peers rather than to <code>0.5</code>.


To see share information for a single account:
<!--T:60-->
<source lang="console">
Successive lines describe the status of each user relative to other users ''in this project''.
[name@server]$ sshare -A def-user01_cpu
Reading the 3rd line, grad2 has 1 share allocated within his group representing 10% of the group's allocation is responsible for only 3.6122% of the group's recent resource use and therefore has a higher than average level fairshare within the group. We would expect that jobs submitted by grad2 to have slightly more priority than jobs submitted by postdoc3 but less priority than jobs submitted by prof1.
            Account      User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
The priority of jobs belonging to def-prof1 group as compared to the priority of jobs belonging to other research groups is determined solely by the group’s fairshare and not the users fairshare within the group.
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
def-user01_cpu                          1   0.000031 831862838324      0.000249  0.003778
</source>


For more information on SLURM's fair share see the [https://slurm.schedmd.com/priority_multifactor.html#fairshare fairshare] section of the SLURM documentation.
<!--T:61-->
The project by itself, or the user within a project, is referred to as an "association" in the Slurm documentation.
* <code>Account</code>, obviously, is the project name with <code>_cpu</code> or <code>_gpu</code> appended.
* <code>User</code>: Notice that the first line of output, the highlighted line, does not include a user name.
* <code>RawShares</code> is proportional to the number of CPU-years that was granted to the project for use on this cluster in the Resource Allocation Competition. All non-RAC accounts have small equal numbers of shares.  For numeric reasons, inactive accounts (which do not have pending or running jobs) are given only one share.  Activity is checked periodically, so if you submit a job with an inactive account, it may take up to 15 minutes before the account shows the expected <code>RawShares</code> and <code>LevelFS</code>.
* <code>NormShares</code> is the number of shares assigned to the user or account divided by the total number of assigned shares within the level. So for the first line, the NormShares of 0.001607 is the fraction of the shares held by the project, relative to all other projects. The NormShares of 0.10000 on the other three lines are the fraction of shares held by each member of the project relative to the other members. (This project has ten members, but we only asked for information about three.)
* <code>RawUsage</code> is calculated from the total number of resource-seconds (that is, CPU time, GPU time, and memory) that have been charged to this account.  Past usage is discounted with a [https://en.wikipedia.org/wiki/Half-life half-life] of one week, so usage more than a few weeks in the past will have only a small effect on priority.
* <code>EffectvUsage</code> is the association's usage normalized with its parent; that is, the project's usage relative to other projects, the user's relative to other users in that project. In this example, <code>postdoc3</code> has 56.6% of the project's usage, and <code>grad2</code> has 3.6%.
* <code>LevelFS</code> is the association's fairshare value compared to its siblings, calculated as NormShares / EffectvUsage. If an association is over-served, the value is between 0 and 1. If an association is under-served, the value is greater than 1. Associations with no usage receive the highest possible value, infinity. For inactive accounts, as described above for <code>RawShares</code>, this value equals a meaningless small number close to 0.0001.


=== Whole nodes versus cores ===
<!--T:40-->
A project which consistently uses its target amount will have a LevelFS near 1.0. If the project uses more than its target, then its LevelFS will be below 1.0 and the priority of new jobs belonging to that project will also be low. If the project uses less than its target usage  then its LevelFS will be greater than 1.0 and new jobs will enjoy high priority.


Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on '''whole nodes'''. Some of the nodes in each cluster are reserved for jobs which request one or more entire nodes. The nodes in Cedar and Graham have 32 cores each (except for Cedar's GPU nodes, which have 24 conventional cores each). Therefore parallel work requiring multiples of 32 cores should request
<!--T:54-->
--nodes=N
'''See also:''' [[Allocations and compute scheduling]].
--ntasks-per-node=32


If you have huge amounts of serial work and can efficiently use [[GNU Parallel]]
=== Whole nodes versus cores === <!--T:12-->
or [https://wiki.scinet.utoronto.ca/wiki/index.php/User_Serial other techniques] to pack
serial processes onto a single node, you may similarly use <code>--nodes</code>.


Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of any whole-node scheduling advantage will be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request <code>--ntasks=16</code>, not <code>--nodes=1 --ntasks=32</code>. Similarly, using whole nodes commits the user to a specific amount of memory - submitting whole-node jobs that underutilize memory is as abusive as underutilizing cores.
<!--T:13-->
Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on '''whole nodes'''. Part of a cluster may be reserved for jobs which request one or more entire nodes. See [[Advanced MPI scheduling#Whole_nodes|whole nodes]] on the page [[Advanced MPI scheduling]] for example scripts and further discussion.


=== Time limits ===
<!--T:15-->
Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of whole-node scheduling advantage will be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request <code>--ntasks=16</code>, not <code>--nodes=1 --ntasks-per-node=32</code>. (Although <code>--nodes=1 --ntasks-per-node=16</code> is fine if you need all the tasks to be on the same node.)  Similarly, using whole nodes commits the user to a specific amount of memory--- submitting whole-node jobs that underutilize memory is as abusive as underutilizing cores.


Cedar and Graham will accept jobs of up to 28 days in run-time. However, jobs of that length will be restricted to use only a small fraction of the cluster. (Approximately 10%, but this fraction is subject to change without notice.)
<!--T:14-->
If you have huge amounts of serial work and can efficiently use [[GNU Parallel]], [[GLOST]],
or [https://docs.scinet.utoronto.ca/index.php/Running_Serial_Jobs_on_Niagara other techniques] to pack
serial processes onto a single node, you are also welcome to use whole-node scheduling.


There are several partitions for jobs of shorter and shorter run-times.  
=== Time limits === <!--T:16-->
Currently there are partitions for jobs of
 
<!--T:17-->
[[Niagara]] accepts jobs of up to 24 hours run-time, [[Béluga/en|Béluga]], [[Graham]] and [[Narval]] up to 7 days, and [[Cedar]] up to 28 days.  
 
<!--T:18-->
On the general-purpose clusters, longer jobs are restricted to use only a fraction of the cluster by ''partitions''. There are partitions for jobs of
* 3 hours or less,
* 3 hours or less,
* 12 hours or less,
* 12 hours or less,
Line 81: Line 89:
* 72 hours (3 days) or less,
* 72 hours (3 days) or less,
* 7 days or less, and
* 7 days or less, and
* 28 days or less.
* 28 days or less
Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.
Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.


=== Backfilling ===
<!--T:55-->
At Béluga a minimum time limit of 1 hour is also imposed.
 
=== Backfilling === <!--T:19-->


<!--T:20-->
The scheduler employs [https://slurm.schedmd.com/sched_config.html backfilling] to improve
The scheduler employs [https://slurm.schedmd.com/sched_config.html backfilling] to improve
overall system usage.
overall system usage.


<!--T:21-->
<blockquote>
<blockquote>
Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.
Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.
</blockquote>
</blockquote>


<!--T:22-->
Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.
Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.


=== Preemption ===
== Percentage of the nodes you have access to == <!--T:26-->
This section aims at giving some insight into how the general-purpose clusters (Cedar and Graham) are partitioned.
 
<!--T:27-->
First, the nodes are partitioned into four different categories:
* Base nodes, which have 4 or 8 GB of memory per core
* Large memory nodes, which have 16 to 96 GB of memory per core
* GPU nodes
* Large GPU nodes (on Cedar only)
Upon submission, your job will be routed to one of these categories based on what resources are requested.
 
<!--T:28-->
Second, within each of the above categories, some nodes are reserved for jobs which can make use of complete nodes (i.e. jobs which use all of the resources available on the allocated nodes). If your job only uses a few cores (or a single core) out of each node, it is only allowed to use a subset of the category. These are referred to as "by-node" and "by-core" partitions.
 
<!--T:29-->
Finally, the nodes are partitioned based on the walltime requested by your job. Shorter jobs have access to more resources. For example, a job with less than 3 hours of requested walltime can run on any node that allows 12 hours, but there are nodes which accept 3 hour jobs that do *not* accept 12 hour jobs.
 
<!--T:30-->
The utility <code>partition-stats</code> shows
* how many jobs are waiting to run ("queued") in each partition,
* how many jobs are currently running,
* how many nodes are currently idle, and
* how many nodes are assigned to each partition.
Here is some sample output from <code>partition-stats</code>:
 
<!--T:47-->
<pre>
[user@gra-login3 ~]$ partition-stats
 
<!--T:48-->
Node type |                    Max walltime
          |  3 hr  |  12 hr  |  24 hr  |  72 hr  |  168 hr |  672 hr |
----------|-------------------------------------------------------------
      Number of Queued Jobs by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular  |  12:170 |  69:7066|  70:7335| 386:961 |  59:509 |  5:165 |
Large Mem |    0:0  |  0:0  |  0:0  |  0:15  |  0:1  |  0:4  |
GPU      |    5:14  |  3:8  |  21:1  | 177:110 |  1:5  |  1:1  |
----------|-------------------------------------------------------------
      Number of Running Jobs by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular  |    8:32  |  10:854 |  84:10  |  15:65  |  0:674 |  1:26  |
Large Mem |    0:0  |  0:0  |  0:0  |  0:1  |  0:0  |  0:0  |
GPU      |    5:0  |  2:13  |  47:20  |  19:18  |  0:3  |  0:0  |
----------|-------------------------------------------------------------
        Number of Idle nodes by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular  |  16:9  |  15:8  |  15:8  |  7:0  |  2:0  |  0:0  |
Large Mem |    3:1  |  3:1  |  0:0  |  0:0  |  0:0  |  0:0  |
GPU      |    0:0  |  0:0  |  0:0  |  0:0  |  0:0  |  0:0  |
----------|-------------------------------------------------------------
      Total Number of nodes by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular  |  871:431 | 851:411 | 821:391 | 636:276 | 281:164 |  90:50  |
Large Mem |  27:12  |  27:12  |  24:11  |  20:3  |  4:3  |  3:2  |
GPU      |  156:78  | 156:78  | 144:72  | 104:52  |  13:12  |  13:12  |
----------|-------------------------------------------------------------
</pre>
 
<!--T:49-->
Looking at the first entry in the table, at the upper left, the numbers <tt>12:170, 0:0</tt>, and <tt>5:14</tt> mean that there were
* 12 jobs waiting to run which requested
** whole nodes,
** less than 8GB of memory per core, and
** 3 hours or less of run time.
* 170 jobs waiting which requested
** less than whole nodes and were therefore waiting to be scheduled on individual cores,
** less than 8GB memory per core, and
** 3 hours or less of run time.
* 5 jobs waiting which requested
** a whole node equipped with GPUs and
** 3 hours or less of run time.
* 14 jobs waiting which requested
** single GPUs and
** 3 hours or less of run time.
There were no jobs running or waiting which requested large-memory nodes and 3 hours of run time.
 
<!--T:50-->
At the bottom of the table we find the division of resources by policy, independent of the immediate number of jobs. Hence there are 871 base nodes, called "regular" here (that is, nodes with 4 to 8 GB memory per core), which may receive whole-node jobs of up to 3 hours. Of those 871,
* 431 of them may also receive by-core jobs of up to three hours,
* 851 of them may receive whole-node jobs of up to 12 hours,
* and so on.


You can access more resources if your application can be checkpointed, stopped, and restarted efficiently.  
<!--T:51-->
It may help to think of these partitions as being like [https://en.wikipedia.org/wiki/Matryoshka_doll Matryoshka (Russian) dolls]. The 3-hour partition contains the nodes for the 12-hour partition as a subset. The 12-hour partition in turn contains the 24-hour partition, and so on.


''TODO: Instructions on submitting a preemptible job''
<!--T:52-->
The <code>partition-stats</code> utility does not give information about the number of cores represented by running or waiting jobs, nor the number of cores free in partly-assigned nodes in by-core partitions, nor about available memory associated with free cores in by-core partitions.


== Percentage of the nodes you have access to ==
<!--T:53-->
=== CPU base nodes ===
Running <code>partition-stats</code> is somewhat costly to the scheduler. Please do not write a script which automatically calls <code>partition-stats</code> repeatedly. If you have a workflow which you believe would benefit from automatic parsing of the information from <code>partition-stats</code>, please contact [[Technical support]] and ask for guidance.
{| class="wikitable"
|-
! Type of request \ Duration !! < 3h !! [3-12]h !! [12-24]h !! [24-72]h !! [72-168]h (3-7 d) !! [7-28]d
|-
| By nodes || 100% || 90% || 80% || 70% || 50% || 10%
|-
| By cores || 50% || 45% || 40% || 35% || 25% || 5%
|}


=== Large memory nodes ===
== Number of jobs == <!--T:56-->
{| class="wikitable"
|-
! Type of request \ Duration !! < 3h !! [3-12]h !! [12-24]h !! [24-72]h !! [72-168]h (3-7 d) !! [7-28]d
|-
| By nodes || 100% || 90% || 80% || 70% || 50% || 10%
|-
| By cores || 50% || 45% || 40% || 35% || 25% || 5%
|}


=== GPU base nodes ===
<!--T:57-->
{| class="wikitable"
There may be a limit on the number of jobs you can have in the system at any one time. 
|-
! Type of request \ Duration !! < 3h !! [3-12]h !! [12-24]h !! [24-72]h !! [72-168]h (3-7 d) !! [7-28]d
|-
| By nodes || 100% || 90% || 80% || 70% || 50% || 10%
|-
| By gpu || 50% || 45% || 40% || 35% || 25% || 5%
|}


=== Large GPU nodes ===
<!--T:62-->
{| class="wikitable"
On [[Graham]] and [[Béluga/en|Béluga]], normal accounts can have no more than 1000 jobs in a pending or running state at any time.  Each task of a [[job arrays|job array]] counts as one job.  The limit is applied using Slurm's [https://slurm.schedmd.com/sacctmgr.html MaxSubmit] parameter.
|-
! Type of request \ Duration !! < 3h !! [3-12]h !! [12-24]h !! [24-72]h !! [72-168]h (3-7 d) !! [7-28]d
|-
| By nodes || 100% || 90% || 80% || 70% || 50% || 10%
|-
| By gpu || 50% || 45% || 40% || 35% || 25% || 5%
|}


<!--T:58-->
[[Category:SLURM]]
[[Category:SLURM]]
</translate>
</translate>

Latest revision as of 21:28, 7 September 2023

Other languages:

Parent page: Running jobs

You can do much work on our clusters by submitting jobs that specify only the number of cores and a runtime limit. However if you submit large numbers of jobs, or jobs that require large amounts of resources, you may be able to improve your productivity by understanding the policies affecting job scheduling.

Priority and fair-share[edit]

The order in which jobs are considered for scheduling is determined by priority. Priority on our systems is determined using the Fair Tree algorithm.[1]

Each job is charged to a Resource Allocation Project (RAP). You specify the project with the --account argument to sbatch. The project might hold a grant of CPU or GPU time from a Resource Allocation Competition, in which case the account code will probably begin with rrg- or rpp-. Or it could be a non-RAC project, also known as a Rapid Access Service project, in which case the account code will probably begin with def-. See Accounts and Projects for how to determine what account codes you can use.

Every project has a target usage level. Non-RAC projects all have equal target usage, while RAC projects have target usages determined by the number of CPU-years or GPU-years granted with each RAC award.

As an example let us imagine a research group with the account code def-prof1. Members of this imaginary group have user names prof1, grad2 and postdoc3. We can examine the group's usage and share information with the sshare command as shown below. Note that we must append _cpu or _gpu to the end of the account code, as appropriate, since CPU and GPU use are tracked separately.

[prof1@gra-login4 ~]$ sshare -l -A def-prof1_cpu -u prof1,grad2,postdoc3
       Account       User  RawShares  NormShares  RawUsage  ... EffectvUsage  ...    LevelFS  ...
-------------- ---------- ---------- -----------  --------  ... ------------  ... ----------  ...
def-prof1_cpu                 434086    0.001607   1512054  ...     0.000043  ...  37.357207  ...
 def-prof1_cpu      prof1          1    0.100000         0  ...     0.000000  ...        inf  ...   
 def-prof1_cpu      grad2          1    0.100000     54618  ...     0.036122  ...   2.768390  ...
 def-prof1_cpu   postdoc3          1    0.100000    855517  ...     0.565798  ...   0.176741  ...

The output shown above has been simplified by removing several fields which are not relevant to this discussion. Furthermore the line that is most important for scheduling is the first one, highlighted in red. This line describes the status of the project relative to all other projects using the cluster. In this example, the research group share is 0.1607% and they have used 0.0043% of the resources on the cluster, the groups Level fairshare is 37 which is quite high as the group has used a small fraction of their allocated share of resources. We would expect that jobs submitted by this group to have a fairly high priority.

Successive lines describe the status of each user relative to other users in this project. Reading the 3rd line, grad2 has 1 share allocated within his group representing 10% of the group's allocation is responsible for only 3.6122% of the group's recent resource use and therefore has a higher than average level fairshare within the group. We would expect that jobs submitted by grad2 to have slightly more priority than jobs submitted by postdoc3 but less priority than jobs submitted by prof1. The priority of jobs belonging to def-prof1 group as compared to the priority of jobs belonging to other research groups is determined solely by the group’s fairshare and not the users fairshare within the group.

The project by itself, or the user within a project, is referred to as an "association" in the Slurm documentation.

  • Account, obviously, is the project name with _cpu or _gpu appended.
  • User: Notice that the first line of output, the highlighted line, does not include a user name.
  • RawShares is proportional to the number of CPU-years that was granted to the project for use on this cluster in the Resource Allocation Competition. All non-RAC accounts have small equal numbers of shares. For numeric reasons, inactive accounts (which do not have pending or running jobs) are given only one share. Activity is checked periodically, so if you submit a job with an inactive account, it may take up to 15 minutes before the account shows the expected RawShares and LevelFS.
  • NormShares is the number of shares assigned to the user or account divided by the total number of assigned shares within the level. So for the first line, the NormShares of 0.001607 is the fraction of the shares held by the project, relative to all other projects. The NormShares of 0.10000 on the other three lines are the fraction of shares held by each member of the project relative to the other members. (This project has ten members, but we only asked for information about three.)
  • RawUsage is calculated from the total number of resource-seconds (that is, CPU time, GPU time, and memory) that have been charged to this account. Past usage is discounted with a half-life of one week, so usage more than a few weeks in the past will have only a small effect on priority.
  • EffectvUsage is the association's usage normalized with its parent; that is, the project's usage relative to other projects, the user's relative to other users in that project. In this example, postdoc3 has 56.6% of the project's usage, and grad2 has 3.6%.
  • LevelFS is the association's fairshare value compared to its siblings, calculated as NormShares / EffectvUsage. If an association is over-served, the value is between 0 and 1. If an association is under-served, the value is greater than 1. Associations with no usage receive the highest possible value, infinity. For inactive accounts, as described above for RawShares, this value equals a meaningless small number close to 0.0001.

A project which consistently uses its target amount will have a LevelFS near 1.0. If the project uses more than its target, then its LevelFS will be below 1.0 and the priority of new jobs belonging to that project will also be low. If the project uses less than its target usage then its LevelFS will be greater than 1.0 and new jobs will enjoy high priority.

See also: Allocations and compute scheduling.

Whole nodes versus cores[edit]

Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on whole nodes. Part of a cluster may be reserved for jobs which request one or more entire nodes. See whole nodes on the page Advanced MPI scheduling for example scripts and further discussion.

Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of whole-node scheduling advantage will be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request --ntasks=16, not --nodes=1 --ntasks-per-node=32. (Although --nodes=1 --ntasks-per-node=16 is fine if you need all the tasks to be on the same node.) Similarly, using whole nodes commits the user to a specific amount of memory--- submitting whole-node jobs that underutilize memory is as abusive as underutilizing cores.

If you have huge amounts of serial work and can efficiently use GNU Parallel, GLOST, or other techniques to pack serial processes onto a single node, you are also welcome to use whole-node scheduling.

Time limits[edit]

Niagara accepts jobs of up to 24 hours run-time, Béluga, Graham and Narval up to 7 days, and Cedar up to 28 days.

On the general-purpose clusters, longer jobs are restricted to use only a fraction of the cluster by partitions. There are partitions for jobs of

  • 3 hours or less,
  • 12 hours or less,
  • 24 hours (1 day) or less,
  • 72 hours (3 days) or less,
  • 7 days or less, and
  • 28 days or less

Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.

At Béluga a minimum time limit of 1 hour is also imposed.

Backfilling[edit]

The scheduler employs backfilling to improve overall system usage.

Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.

Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.

Percentage of the nodes you have access to[edit]

This section aims at giving some insight into how the general-purpose clusters (Cedar and Graham) are partitioned.

First, the nodes are partitioned into four different categories:

  • Base nodes, which have 4 or 8 GB of memory per core
  • Large memory nodes, which have 16 to 96 GB of memory per core
  • GPU nodes
  • Large GPU nodes (on Cedar only)

Upon submission, your job will be routed to one of these categories based on what resources are requested.

Second, within each of the above categories, some nodes are reserved for jobs which can make use of complete nodes (i.e. jobs which use all of the resources available on the allocated nodes). If your job only uses a few cores (or a single core) out of each node, it is only allowed to use a subset of the category. These are referred to as "by-node" and "by-core" partitions.

Finally, the nodes are partitioned based on the walltime requested by your job. Shorter jobs have access to more resources. For example, a job with less than 3 hours of requested walltime can run on any node that allows 12 hours, but there are nodes which accept 3 hour jobs that do *not* accept 12 hour jobs.

The utility partition-stats shows

  • how many jobs are waiting to run ("queued") in each partition,
  • how many jobs are currently running,
  • how many nodes are currently idle, and
  • how many nodes are assigned to each partition.

Here is some sample output from partition-stats:

[user@gra-login3 ~]$ partition-stats

Node type |                     Max walltime
          |   3 hr   |  12 hr  |  24 hr  |  72 hr  |  168 hr |  672 hr |
----------|-------------------------------------------------------------
       Number of Queued Jobs by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular   |   12:170 |  69:7066|  70:7335| 386:961 |  59:509 |   5:165 |
Large Mem |    0:0   |   0:0   |   0:0   |   0:15  |   0:1   |   0:4   |
GPU       |    5:14  |   3:8   |  21:1   | 177:110 |   1:5   |   1:1   |
----------|-------------------------------------------------------------
      Number of Running Jobs by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular   |    8:32  |  10:854 |  84:10  |  15:65  |   0:674 |   1:26  |
Large Mem |    0:0   |   0:0   |   0:0   |   0:1   |   0:0   |   0:0   |
GPU       |    5:0   |   2:13  |  47:20  |  19:18  |   0:3   |   0:0   |
----------|-------------------------------------------------------------
        Number of Idle nodes by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular   |   16:9   |  15:8   |  15:8   |   7:0   |   2:0   |   0:0   |
Large Mem |    3:1   |   3:1   |   0:0   |   0:0   |   0:0   |   0:0   |
GPU       |    0:0   |   0:0   |   0:0   |   0:0   |   0:0   |   0:0   |
----------|-------------------------------------------------------------
       Total Number of nodes by partition Type (by node:by core)
----------|-------------------------------------------------------------
Regular   |  871:431 | 851:411 | 821:391 | 636:276 | 281:164 |  90:50  |
Large Mem |   27:12  |  27:12  |  24:11  |  20:3   |   4:3   |   3:2   |
GPU       |  156:78  | 156:78  | 144:72  | 104:52  |  13:12  |  13:12  |
----------|-------------------------------------------------------------

Looking at the first entry in the table, at the upper left, the numbers 12:170, 0:0, and 5:14 mean that there were

  • 12 jobs waiting to run which requested
    • whole nodes,
    • less than 8GB of memory per core, and
    • 3 hours or less of run time.
  • 170 jobs waiting which requested
    • less than whole nodes and were therefore waiting to be scheduled on individual cores,
    • less than 8GB memory per core, and
    • 3 hours or less of run time.
  • 5 jobs waiting which requested
    • a whole node equipped with GPUs and
    • 3 hours or less of run time.
  • 14 jobs waiting which requested
    • single GPUs and
    • 3 hours or less of run time.

There were no jobs running or waiting which requested large-memory nodes and 3 hours of run time.

At the bottom of the table we find the division of resources by policy, independent of the immediate number of jobs. Hence there are 871 base nodes, called "regular" here (that is, nodes with 4 to 8 GB memory per core), which may receive whole-node jobs of up to 3 hours. Of those 871,

  • 431 of them may also receive by-core jobs of up to three hours,
  • 851 of them may receive whole-node jobs of up to 12 hours,
  • and so on.

It may help to think of these partitions as being like Matryoshka (Russian) dolls. The 3-hour partition contains the nodes for the 12-hour partition as a subset. The 12-hour partition in turn contains the 24-hour partition, and so on.

The partition-stats utility does not give information about the number of cores represented by running or waiting jobs, nor the number of cores free in partly-assigned nodes in by-core partitions, nor about available memory associated with free cores in by-core partitions.

Running partition-stats is somewhat costly to the scheduler. Please do not write a script which automatically calls partition-stats repeatedly. If you have a workflow which you believe would benefit from automatic parsing of the information from partition-stats, please contact Technical support and ask for guidance.

Number of jobs[edit]

There may be a limit on the number of jobs you can have in the system at any one time.

On Graham and Béluga, normal accounts can have no more than 1000 jobs in a pending or running state at any time. Each task of a job array counts as one job. The limit is applied using Slurm's MaxSubmit parameter.

  1. A detailed description of Fair Tree can be found at https://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf, with references to early rock'n'roll music.