Job scheduling policies: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
Line 22: Line 22:
=== Whole nodes versus cores ===
=== Whole nodes versus cores ===


A job may request any number of cores, but a large fraction of the nodes in the cluster
Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on '''whole nodes'''. Some of the nodes in each cluster are reserved for jobs which request one or more entire nodes. The nodes in Cedar and Graham have 32 cores each (except for Cedar's GPU nodes, which have 24 conventional cores each). Therefore parallel work requiring multiples of 32 cores should request
are reserved to jobs which request one or more entire nodes. The nodes in Cedar and  
--nodes=N
Graham have 32 cores each (except for Cedar's GPU nodes, which have 24 conventional cores each).
--ntasks-per-node=32
Therefore parallel work requiring 32 cores or more will enjoy best scheduling if
the number of cores requested is an integer multiple of 32.


''TODO: scaling tests to see what size job you _should_ run''
If you have huge amounts of serial work and can efficiently use [[GNU Parallel]]
or [https://wiki.scinet.utoronto.ca/wiki/index.php/User_Serial other techniques] to pack
serial processes onto a single node, you may similarly use <code>--nodes</code>.


''TODO: Illustrate with #SBATCH directives''
Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of any whole-node scheduling advantage may be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request <code>--ntasks=16</code>, not <code>--nodes=1 --ntasks=32</code>.
 
If you have huge amounts of serial work and can take advantage of [[GNU Parallel]]
or [https://wiki.scinet.utoronto.ca/wiki/index.php/User_Serial other techniques] for
packing serial processes onto a single node, you might enjoy better scheduling.


=== Time limits ===
=== Time limits ===

Revision as of 19:12, 17 July 2017


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.



Other languages:

Parent page: Running jobs

You can do much work on Cedar or Graham by submitting jobs that specify only the number of cores, the associated memory, and a run-time limit. However if you submit large numbers of jobs, or jobs that require large amounts of resources, you may be able to improve your productivity by understanding the policies affecting job scheduling.

Priority[edit]

The order in which jobs are considered for scheduling is determined by priority. Priority is in turn principally determined by Resource Allocation Competition grants. Usage greater than the project's RAC share will temporarily decrease the priority for jobs belonging to that project; usage less than the project's RAC share will temporarily increase the priority.

Whole nodes versus cores[edit]

Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on whole nodes. Some of the nodes in each cluster are reserved for jobs which request one or more entire nodes. The nodes in Cedar and Graham have 32 cores each (except for Cedar's GPU nodes, which have 24 conventional cores each). Therefore parallel work requiring multiples of 32 cores should request

--nodes=N
--ntasks-per-node=32

If you have huge amounts of serial work and can efficiently use GNU Parallel or other techniques to pack serial processes onto a single node, you may similarly use --nodes.

Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of any whole-node scheduling advantage may be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request --ntasks=16, not --nodes=1 --ntasks=32.

Time limits[edit]

Cedar and Graham will accept jobs of up to 28 days in run-time. However, jobs of that length will be restricted to use only a small fraction of the cluster. (Approximately 10%, but this fraction is subject to change without notice.)

There are several partitions for jobs of shorter and shorter run-times. Currently there are partitions for jobs of

  • 3 hours or less,
  • 12 hours or less,
  • 24 hours (1 day) or less,
  • 72 hours (3 days) or less,
  • 7 days or less, and
  • 28 days or less.

Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.

Backfilling[edit]

The scheduler employs backfilling to improve overall system usage.

Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.

Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.

Preemption[edit]

You can access more resources if your application can be checkpointed, stopped, and restarted efficiently.

TODO: Instructions on submitting a preemptible job