Translations:Frequently Asked Questions/20/en: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Importing a new version from external source)
(Importing a new version from external source)
 
(One intermediate revision by the same user not shown)
Line 2: Line 2:
You can see why your jobs are in the <tt>PD</tt> (pending) state by running the <tt>squeue -u <username></tt> command on the cluster.<br><br>
You can see why your jobs are in the <tt>PD</tt> (pending) state by running the <tt>squeue -u <username></tt> command on the cluster.<br><br>
The <tt>(REASON)</tt> column typically has the values <tt>Resources</tt> or <tt>Priority</tt>.
The <tt>(REASON)</tt> column typically has the values <tt>Resources</tt> or <tt>Priority</tt>.
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. nodes, memory, time).
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. CPUs/nodes, GPUs, memory, time).
*  <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]].
*  <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]]. The <tt>LevelFS</tt> column gives you information about your over- or under-consumption of cluster resources: when <tt>LevelFS</tt> is greater than one, you are consuming fewer resources than your fair share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your pending jobs decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of <tt>LevelFS</tt> is unique to the specific cluster.

Latest revision as of 19:45, 10 July 2018

Information about message (contribute)
This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.
Message definition (Frequently Asked Questions)
== Why are my jobs taking so long to start? ==
You can see why your jobs are in the <tt>PD</tt> (pending) state by running the <tt>squeue -u <username></tt> command on the cluster.<br><br>
The <tt>(REASON)</tt> column typically has the values <tt>Resources</tt> or <tt>Priority</tt>.
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. CPUs/nodes, GPUs, memory, time).
*  <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]]. The <tt>LevelFS</tt> column gives you information about your over- or under-consumption of cluster resources: when <tt>LevelFS</tt> is greater than one, you are consuming fewer resources than your fair share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your pending jobs decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of <tt>LevelFS</tt> is unique to the specific cluster.

Why are my jobs taking so long to start?

You can see why your jobs are in the PD (pending) state by running the squeue -u <username> command on the cluster.

The (REASON) column typically has the values Resources or Priority.

  • Resourcesː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. CPUs/nodes, GPUs, memory, time).
  • Priorityː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command sshare as explained in Job scheduling policies. The LevelFS column gives you information about your over- or under-consumption of cluster resources: when LevelFS is greater than one, you are consuming fewer resources than your fair share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your pending jobs decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of LevelFS is unique to the specific cluster.