Job scheduling policies: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
Line 21: Line 21:


==Why won't my jobs run?==
==Why won't my jobs run?==
There are a number of factors which can be important for dictating what your job priority will be, and thus how soon your job will run. First, are you submitting your jobs to a RAC allocation (e.g. running <code>sbatch</code>
There are a number of factors which can be important for dictating what your job priority will be, and thus how soon your job will run.  
with the <code>--account</code> option and specifying an account beginning with <code>rrg</code> or <code>rpp</code>)? Only about 10% of the clusters are reserved for non-RAC jobs. Priority of jobs in the non-RAC allocation will be adjusted such that on average only about 10% of the cluster is being used for these jobs. If fewer non-RAC jobs are run, non-RAC job priorities will go up, conversely if more non-RAC jobs are run, their priorities will go down.  


To measure how close an account is to using their fair share SLURM defines a fair-share factor, <code>F=2^(-U/S)</code>, where U is an accounts usage normalized to the total usage of the cluster taking into account a half-life decay of 14 days. <code>S</code> is your share normalized to the total number of shares. So if we take the non-Rac allocation of 10% as an example, then <code>S=0.1</code>. If the jobs submitted to the cluster in the recent past have used nearly 10% of the cluster <code>U=0.1</code> and it follows that <code>F=0.5</code>. Thus if you are matching your usage to your share your fair share factor should be 0.5. If your usage is above your share <code>F</code> will be less than 0.5, if you usage is below your fair share <code>F</code> will be greater than 0.5.
Are you submitting your jobs to a RAC allocation (e.g. running <code>sbatch</code> with the <code>--account</code> option and specifying an account beginning with <code>rrg</code> or <code>rpp</code>)? Only about 10% of each of the clusters is reserved for non-RAC jobs. Priority of jobs in the non-RAC allocation will be adjusted such that on average only about 10% of the cluster is being used for these jobs. If fewer non-RAC jobs are run, non-RAC job priorities will go up, conversely if more non-RAC jobs are run, their priorities will go down.


To measure how close an account is to using their fair share SLURM defines a fair-share factor, <code>F=2^(-U/S)</code>, where U is an accounts usage normalized to the total usage of the cluster taking into account a half-life decay of 14 days. <code>S</code> is your share normalized to the total number of shares. So if we take the non-RAC allocation of 10% as an example, then <code>S=0.1</code>. If the jobs submitted to the cluster in the recent past have used nearly 10% of the cluster <code>U=0.1</code> and it follows that <code>F=0.5</code>. Thus if you are matching your usage to your share your fair share factor should be 0.5. If your usage is above your share <code>F</code> will be less than 0.5, if you usage is below your fair share <code>F</code> will be greater than 0.5.


 
You can see the fair-share factor, <code>FairShare</code>, for all accounts on a given cluster by running the <code>sshare</code> command:
 
<source lang="console">
$ sshare -A def-username_cpu
[name@server]$ sshare
            Account      User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
def-username_cpu                        1    0.000023          0      0.000111  0.034056
 
 
Jobs are schedualed using what is called "FairShare" meaning that if you use above your share, job priorities are decreased, and if you use below your share job priorities are increased.
 
root:              1.000000
  no_rac_cpu:      0.091695
  no_rac_gpu:      0.002264
 
 
The command <code>sshare</code> shows usage of accounts. Note that accounts can be subdivided by user:
 
cgeroux@gra-login3:~$ sshare -A cc-debug_cpu
             Account      User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
             Account      User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
cc-debug_cpu                             1    0.000023 385296758720     0.000112   0.032901
root                                          1.000000 84241308075674887      1.000000  0.500000
cc-debug_cpu          cgeroux         1    0.000000 372498318711     0.000002   0.000013
no_rac_cpu                          3103    0.124374 80949587847079770      0.960925  0.004723
  ras_basic_cpu                      3103    0.124334 80949587847079770      0.960925  0.004715
  cc-debug_cpu                         1    0.000031      179535     0.000239   0.004715
    cc-debug_cpu          name         1    0.000000         16      0.000001  0.004715
  def-a28wong_cpu                      1    0.000031          0     0.000239   0.004715
...
</source>


<code>RawShares</code> shows the number of shares allocated to that account, in this case <code>cc-dbug_cpu</code>. <code>RawUsage</code> shows how many CPU seconds were used by that account. <code>EffectivUsage</code> shows a usage normalized by all CPU seconds run on the cluster. Finally <code>FairShare</code> is related to the NormShares and


Jobs are scheduled based on their priority. Priority is determined using a number of factors, including usage history and [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ Resource Allocation Competition] (RAC). RAC determines how many shares you get on a cluster whic As you and others use an account to submit jobs more CPUs and RAM usage is "charged" to that account


=== Whole nodes versus cores ===
=== Whole nodes versus cores ===

Revision as of 19:01, 14 August 2017


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.



Other languages:

Parent page: Running jobs

You can do much work on Cedar or Graham by submitting jobs that specify only the number of cores, the associated memory, and a run-time limit. However if you submit large numbers of jobs, or jobs that require large amounts of resources, you may be able to improve your productivity by understanding the policies affecting job scheduling.

Priority[edit]

The order in which jobs are considered for scheduling is determined by priority. Priority is in turn principally determined by Resource Allocation Competition grants. Usage greater than the project's RAC share will temporarily decrease the priority for jobs belonging to that project; usage less than the project's RAC share will temporarily increase the priority.

Why won't my jobs run?[edit]

There are a number of factors which can be important for dictating what your job priority will be, and thus how soon your job will run.

Are you submitting your jobs to a RAC allocation (e.g. running sbatch with the --account option and specifying an account beginning with rrg or rpp)? Only about 10% of each of the clusters is reserved for non-RAC jobs. Priority of jobs in the non-RAC allocation will be adjusted such that on average only about 10% of the cluster is being used for these jobs. If fewer non-RAC jobs are run, non-RAC job priorities will go up, conversely if more non-RAC jobs are run, their priorities will go down.

To measure how close an account is to using their fair share SLURM defines a fair-share factor, F=2^(-U/S), where U is an accounts usage normalized to the total usage of the cluster taking into account a half-life decay of 14 days. S is your share normalized to the total number of shares. So if we take the non-RAC allocation of 10% as an example, then S=0.1. If the jobs submitted to the cluster in the recent past have used nearly 10% of the cluster U=0.1 and it follows that F=0.5. Thus if you are matching your usage to your share your fair share factor should be 0.5. If your usage is above your share F will be less than 0.5, if you usage is below your fair share F will be greater than 0.5.

You can see the fair-share factor, FairShare, for all accounts on a given cluster by running the sshare command:

[name@server]$ sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root                                          1.000000 84241308075674887      1.000000   0.500000
 no_rac_cpu                           3103    0.124374 80949587847079770      0.960925   0.004723
  ras_basic_cpu                       3103    0.124334 80949587847079770      0.960925   0.004715
   cc-debug_cpu                          1    0.000031      179535      0.000239   0.004715
    cc-debug_cpu           name          1    0.000000          16      0.000001   0.004715
   def-a28wong_cpu                       1    0.000031           0      0.000239   0.004715
...


Whole nodes versus cores[edit]

Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on whole nodes. Some of the nodes in each cluster are reserved for jobs which request one or more entire nodes. The nodes in Cedar and Graham have 32 cores each (except for Cedar's GPU nodes, which have 24 conventional cores each). Therefore parallel work requiring multiples of 32 cores should request

--nodes=N
--ntasks-per-node=32

If you have huge amounts of serial work and can efficiently use GNU Parallel or other techniques to pack serial processes onto a single node, you may similarly use --nodes.

Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of any whole-node scheduling advantage will be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request --ntasks=16, not --nodes=1 --ntasks=32. Similarly, using whole nodes commits the user to a specific amount of memory - submitting whole-node jobs that underutilize memory is as abusive as underutilizing cores.

Time limits[edit]

Cedar and Graham will accept jobs of up to 28 days in run-time. However, jobs of that length will be restricted to use only a small fraction of the cluster. (Approximately 10%, but this fraction is subject to change without notice.)

There are several partitions for jobs of shorter and shorter run-times. Currently there are partitions for jobs of

  • 3 hours or less,
  • 12 hours or less,
  • 24 hours (1 day) or less,
  • 72 hours (3 days) or less,
  • 7 days or less, and
  • 28 days or less.

Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.

Backfilling[edit]

The scheduler employs backfilling to improve overall system usage.

Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.

Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.

Preemption[edit]

You can access more resources if your application can be checkpointed, stopped, and restarted efficiently.

TODO: Instructions on submitting a preemptible job