Job scheduling policies
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
Parent page: Running jobs
You can do much work on Cedar or Graham by submitting jobs that specify only the number of cores, the associated memory, and a run-time limit. However if you submit large numbers of jobs, or jobs that require large amounts of resources, you may be able to improve your productivity by understanding the policies affecting job scheduling.
Priority[edit]
The order in which jobs are considered for scheduling is determined by priority. There are many factors effecting the priority of jobs. The SLURM documentation shows the calculation of job priority as the sum of many different weighted factors (see SLURM docs on priority).
Job_priority = (PriorityWeightAge) * (age_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor) + SUM(TRES_weight_cpu * TRES_factor_cpu, TRES_weight_<type> * TRES_factor_<type>, ...)
On our systems the fair-share_factor
has by far the largest weight and has a significant impact on job priority.
[edit]
When submitting jobs you must choose an account to "bill" the job to. This could be to an account resulting from a Resource Allocation Competition (e.g. running sbatch
with the --account
option and specifying an account beginning with rrg
or rpp
) or to a non-RAC account (e.g. an account starting with def
).
Only about 10% of each of the clusters is reserved for non-RAC jobs. Priority of jobs in the non-RAC allocation will be adjusted such that on average only about 10% of the cluster is being used for these jobs. If fewer non-RAC jobs are run, non-RAC job priorities will go up, conversely if more non-RAC jobs are run, their priorities will go down.
You can see the fair-share factor, FairShare
, for all accounts on a given cluster by running the sshare
command:
[name@server]$ sshare
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 1.000000 84241308075674887 1.000000 0.500000
no_rac_cpu 3103 0.124374 80949587847079770 0.960925 0.004723
ras_basic_cpu 3103 0.124334 80949587847079770 0.960925 0.004715
cc-debug_cpu 1 0.000031 179535 0.000239 0.004715
cc-debug_cpu name 1 0.000000 16 0.000001 0.004715
def-user01_cpu 1 0.000031 831862838324 0.000249 0.003778
def-user02_cpu 1 0.000031 0 0.000239 0.004715
...
In the above output, notice the last line, the RawUsage
is 0, however the EffectivUsage
is not. This is because the effective usage takes into account that other users have been running jobs in other sub accounts to the parent no_rac_cpu
account, which is limited to have only about 10% of the cluster. When the total usage of this parent account goes up, so too does the effective usage of the child accounts. However, if other sibling accounts, for example def-user01_cpu
, use a relatively significant amount of resources, it will increase their effective usage relative to other accounts under the same parent account and conversely decrease their FairShare
factor. If jobs in the queue are mostly def-
, or your peers, then it makes more sense to compare to FairShare
factors of your peers rather than to 0.5
.
To see share information for a single account:
[name@server]$ sshare -A def-user01_cpu
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
def-user01_cpu 1 0.000031 831862838324 0.000249 0.003778
For more information on SLURM's fair share see the fairshare section of the SLURM documentation.
Whole nodes versus cores[edit]
Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on whole nodes. Some of the nodes in each cluster are reserved for jobs which request one or more entire nodes. The nodes in Cedar and Graham have 32 cores each (except for Cedar's GPU nodes, which have 24 conventional cores each). Therefore parallel work requiring multiples of 32 cores should request
--nodes=N --ntasks-per-node=32
If you have huge amounts of serial work and can efficiently use GNU Parallel
or other techniques to pack
serial processes onto a single node, you may similarly use --nodes
.
Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of any whole-node scheduling advantage will be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request --ntasks=16
, not --nodes=1 --ntasks=32
. Similarly, using whole nodes commits the user to a specific amount of memory - submitting whole-node jobs that underutilize memory is as abusive as underutilizing cores.
Time limits[edit]
Cedar and Graham will accept jobs of up to 28 days in run-time. However, jobs of that length will be restricted to use only a small fraction of the cluster. (Approximately 10%, but this fraction is subject to change without notice.)
There are several partitions for jobs of shorter and shorter run-times. Currently there are partitions for jobs of
- 3 hours or less,
- 12 hours or less,
- 24 hours (1 day) or less,
- 72 hours (3 days) or less,
- 7 days or less, and
- 28 days or less.
Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.
Backfilling[edit]
The scheduler employs backfilling to improve overall system usage.
Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.
Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.
Preemption[edit]
You can access more resources if your application can be checkpointed, stopped, and restarted efficiently.
TODO: Instructions on submitting a preemptible job
Percentage of the nodes you have access to[edit]
CPU base nodes[edit]
Number of this node type nodes on Cedar : 691 Number of this node type on Graham : 801
Type of request \ Duration | <= 3h | ]3-12]h | ]12-24]h | ]1-3]d | ]3-7]d | ]7-28]d |
---|---|---|---|---|---|---|
By nodes (Cedar) | 100% | 90% | 80% | 70% | 35% | 20% |
By cores (Cedar) | 45% | 40% | 30% | 30% | 15% | 5% |
By nodes (Graham) | 100% | 95% | 90% | 75% | 20% | 10% |
By cores (Graham) | 50% | 45% | 40% | 30% | 10% | 5% |
Large memory nodes[edit]
Type of request \ Duration | < 3h | [3-12]h | [12-24]h | [24-72]h | [72-168]h (3-7 d) | [7-28]d |
---|---|---|---|---|---|---|
By nodes | 100% | 90% | 80% | 70% | 50% | 10% |
By cores | 50% | 45% | 40% | 35% | 25% | 5% |
GPU base nodes[edit]
Type of request \ Duration | < 3h | [3-12]h | [12-24]h | [24-72]h | [72-168]h (3-7 d) | [7-28]d |
---|---|---|---|---|---|---|
By nodes | 100% | 90% | 80% | 70% | 50% | 10% |
By gpu | 50% | 45% | 40% | 35% | 25% | 5% |
Large GPU nodes[edit]
Type of request \ Duration | < 3h | [3-12]h | [12-24]h | [24-72]h | [72-168]h (3-7 d) | [7-28]d |
---|---|---|---|---|---|---|
By nodes | 100% | 90% | 80% | 70% | 50% | 10% |
By gpu | 50% | 45% | 40% | 35% | 25% | 5% |