Job scheduling policies
Parent page: Running jobs
You can do much work on Cedar or Graham by submitting jobs that specify only the number of cores, the amount of memory, and a runtime limit. However if you submit large numbers of jobs, or jobs that require large amounts of resources, you may be able to improve your productivity by understanding the policies affecting job scheduling.
[edit]
The order in which jobs are considered for scheduling is determined by priority. Priority on our systems is determined using the Fair Tree algorithm.[1]
Each job is billed to an account, corresponding to a Resource Allocation Project (RAP).
You specify the account with the --account
argument to sbatch
.
This could be an account resulting from a Resource Allocation Competition, in which case the account name will probably begin with rrg-
or rpp-
, or it could be a non-RAC account, in which case the account name will probably begin with def-
. See Accounts and Projects for how to determine
what accounts you can use. (These accounts are called "associations" in the Slurm documentation.)
Every account has a target usage level. Non-RAC projects all have equal target usage, while RAC projects have target usages determined by the number of CPU-years (or GPU-years) granted with each RAC award. About 10% of each cluster is reserved for non-RAC jobs.
An account which has been consistently using its target amount should have a fair-share factor of 0.50. If the account has used more than its target usage in recent weeks, then its fair-share factor will be depressed below 0.50 and the priority of new jobs billed to that account will also be low. If a given account has used less than its target usage in recent weeks, then its fair-share factor will be greater than 0.50 and new jobs will enjoy high priority. Past usage is discounted with a half-life of two weeks.
Non-RAC jobs all share in the 10% pool reserved for them. Consequently, the more non-RAC jobs that are run, the lower will be the priority of all other non-RAC jobs on that cluster in succeeding weeks.
To see share information for a research group:
[rdickson@gra-login4 ~]$ sshare -l -A def-prof1_cpu -u prof1,grad2,postdoc3
Account User RawShares NormShares ... EffectvUsage ... LevelFS ...
-------------- ---------- ---------- ----------- ... ------------ ... ---------- ...
def-prof1_cpu 1 0.000233 ... 0.000002 ... 120.013884 ...
def-prof1_cpu prof1 1 0.111111 ... 0.000000 ... inf ...
def-prof1_cpu grad2 1 0.111111 ... 0.055622 ... 1.997620 ...
def-prof1_cpu postdoc3 1 0.111111 ... 0.944378 ... 0.117655 ...
The actual output includes many fields that are not relevant; the above has been greatly simplified.
RawShares
is the number of CPU-years that was granted togroup1
for use on this cluster in the Resource Allocation Competition. Default allocations haveRawShares
of 1 (as in this example).NormShares
is the shares assigned to the user or account normalized to the total number of assigned shares within the level. So for the first line, the NormShares of 0.000233 is the fraction of the shares held by the group, relative to all other groups. The NormShares of 0.111111 on the other three lines are the fraction of shares held by each member of the group relative to the other members. (This group has nine members, some not shown.)EffectvUsage
is the association's usage normalized with its parent; that is, the group's usage relative to other groups, the user's relative to other users in that group.LevelFS
is the association's fairshare value compared to its siblings, calculated as Norm Shares / Effectv Usage. If an association is over-served, the value is between 0 and 1. If an association is under-served, the value is greater than 1. Associations with no usage receive the highest possible value, infinity.
Whole nodes versus cores[edit]
Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on whole nodes. Some of the nodes in each cluster are reserved for jobs which request one or more entire nodes. The nodes in Cedar and Graham have 32 cores each (except for Cedar's GPU nodes, which have 24 conventional cores each). Therefore parallel work requiring multiples of 32 cores should request
--nodes=N --ntasks-per-node=32
If you have huge amounts of serial work and can efficiently use GNU Parallel
or other techniques to pack
serial processes onto a single node, you may similarly use --nodes
.
Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of any whole-node scheduling advantage will be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request --ntasks=16
, not --nodes=1 --ntasks-per-node=32
. (Although --nodes=1 --ntasks-per-node=16
is fine if you need all the tasks to be on the same node.) Similarly, using whole nodes commits the user to a specific amount of memory--- submitting whole-node jobs that underutilize memory is as abusive as underutilizing cores.
Whole-node memory[edit]
The most common compute nodes at Cedar and Graham have 128GB of memory, but a small piece of that memory is reserved for the use of the operating system. If you request --mem=128G
your job will not qualify to run on these "base" nodes, and therefore may wait longer than necessary to start. A memory request of --mem=128000M
will allow your job to run on these nodes and therefore probably start sooner.
Requesting --mem=0
is a special case and grants the job access to all of the memory on each node.
Time limits[edit]
Cedar and Graham will accept jobs of up to 28 days in run-time. However, jobs of that length will be restricted to use only a small fraction of the cluster. (Approximately 10%, but this fraction is subject to change without notice.)
There are several partitions for jobs of shorter and shorter run-times. Currently there are partitions for jobs of
- 3 hours or less,
- 12 hours or less,
- 24 hours (1 day) or less,
- 72 hours (3 days) or less,
- 7 days or less, and
- 28 days or less.
Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.
Backfilling[edit]
The scheduler employs backfilling to improve overall system usage.
Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.
Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.
Preemption[edit]
You can access more resources if your application can be checkpointed, stopped, and restarted efficiently.
TODO: Instructions on submitting a preemptible job
Percentage of the nodes you have access to[edit]
This section aims at giving some insight into how Cedar and Graham are partitioned.
First, the nodes are partitioned into four different categories:
- base nodes (which have 4 or 8 GB of memory per core)
- large memory nodes (which have 16 to 96 GB of memory per core)
- GPU nodes
- large GPU nodes (on Cedar only)
Upon submission, your job will be routed to one of these categories based on what resources are requested.
Second, within each of the above categories, some nodes are reserved for jobs which can make use of complete nodes (i.e. jobs which use all of the resources available on the allocated nodes). If your job only uses a few cores (or a single core) out of each node, it is only allowed to use a subset of the category.
Finally, the nodes are partitioned based on the walltime requested by your job. These partitions are organized much like Matryoshka (Russian) dolls, with shorter walltime being able to fit in larger walltime categories. For example, a job with less than 3 hours of requested walltime can run on a node that allows 12 hours, but not the other way around.
For each of the four categories, we here list the rough percentage of the nodes in this category that you can use depending on the walltime requested and whether your job requests complete nodes or cores/GPUs. The percentages are rounded to the closest 5%, and may get adjusted in the future.
CPU base nodes (less than ~7.5 GB/core)[edit]
Number of nodes of this type on Cedar: 691
Number of nodes of this type on Graham: 851
Type of request \ Time limit | <= 3h | ]3-12]h | ]12-24]h | ]1-3]d | ]3-7]d | ]7-28]d |
---|---|---|---|---|---|---|
By node (Cedar) | 100% | 90% | 80% | 70% | 35% | 20% |
By core (Cedar) | 45% | 40% | 30% | 30% | 15% | 5% |
By node (Graham) | 100% | 95% | 90% | 75% | 20% | 10% |
By core (Graham) | 50% | 45% | 40% | 30% | 10% | 5% |
Large memory nodes (more than ~7.5 GB/core)[edit]
Number of nodes of this type on Cedar: 50
Number of nodes of this type on Graham: 27
Type of request \ Time limit | <= 3h | ]3-12]h | ]12-24]h | ]1-3]d | ]3-7]d | ]7-28]d |
---|---|---|---|---|---|---|
By node (Cedar) | 100% | 100% | 100% | 90% | 35% | 5% |
By core (Cedar) | 10% | 10% | 10% | 10% | 5% | 5% |
By node (Graham) | 100% | 90% | 90% | 75% | 10% | 5% |
By core (Graham) | 45% | 40% | 40% | 10% | 10% | 5% |
GPU Base nodes[edit]
Number of nodes of this type on Cedar: 112 (4 GPU per node)
Number of nodes of this type on Graham: 156 (2 GPU per node)
Type of request \ Time limit | <= 3h | ]3-12]h | ]12-24]h | ]1-3]d | ]3-7]d | ]7-28]d |
---|---|---|---|---|---|---|
By node (Cedar) | 100% | 85% | 85% | 55% | 30% | 14% |
By GPU (Cedar) | 60% | 60% | 40% | 30% | 5% | 5% |
By node (Graham) | 100% | 90% | 75% | 65% | 10% | 10% |
By GPU (Graham) | 50% | 45% | 40% | 30% | 10% | 10% |
Large GPU nodes[edit]
Number of nodes of this type on Cedar: 32 (4 GPU per node, 256GB of memory)
Number of nodes of this type on Graham: 0
Type of request \ Time limit | <= 3h | ]3-12]h | ]12-24]h | ]1-3]d | ]3-7]d | ]7-28]d |
---|---|---|---|---|---|---|
By node | 100% | 90% | 75% | 65% | 25% | 15% |
By GPU | 0% | 0% | 0% | 0% | 0% | 0% |
- ↑ A detailed description Fair Tree can be found at https://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf, with references to early rock music.