Frequently Asked Questions: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
Line 68: Line 68:


== Why are my jobs taking so long to start? == <!--T:20-->
== Why are my jobs taking so long to start? == <!--T:20-->
You can see why your jobs are in the <tt>PD</tt> (pending) state by running the <tt>squeue -u <username></tt> command on the cluster.
You can see why your jobs are in the <tt>PD</tt> (pending) state by running the <tt>squeue -u <username></tt> command on the cluster.</br>
The <tt>(REASON)</tt> column typically has the values <tt>Resources</tt> or <tt>Priority</tt>.
The <tt>(REASON)</tt> column typically has the values <tt>Resources</tt> or <tt>Priority</tt>.
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. nodes, memory, time).
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. nodes, memory, time).
*  <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]].  
*  <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]].  


The column <tt>LevelFS</tt> gives you information about your over- or under-consumption of cluster resources; when <tt>LevelFS</tt> is greater than one you are consuming fewer resources than your just share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your tasks decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of <tt>LevelFS</tt> is unique to the specific cluster.   
The <tt>LevelFS</tt> column gives you information about your over- or under-consumption of cluster resources; when <tt>LevelFS</tt> is greater than one you are consuming fewer resources than your just share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your tasks decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of <tt>LevelFS</tt> is unique to the specific cluster.   
</translate>
</translate>

Revision as of 20:26, 2 March 2018

Other languages:

Forgot my password

To reset your password for any Compute Canada national cluster, visit https://ccdb.computecanada.ca/security/forgot.

Disk quota exceeded error on /project filesystems

Some users have seen this message or some similar quota error on their project folders. Other users have reported obscure failures while transferring files into their /project folder from another cluster. Many of the problems reported are due to bad file ownership.

Use diskusage_report to see if you are at or over your quota:

[ymartin@cedar5 ~]$ diskusage_report
                             Description                Space           # of files
                     Home (user ymartin)             345M/50G            9518/500k
                  Scratch (user ymartin)              93M/20T           6532/1000k
                 Project (group ymartin)          5472k/2048k            158/5000k
            Project (group/def-zrichard)            20k/1000G              4/5000k

The example above illustrates a frequent problem: /project for user ymartin contains too much data in files belonging to group ymartin. The data should instead be in files belonging to def-zrichard.

Note the two lines labelled Project.

  • Project (group ymartin) describes files belonging to group ymartin, which has the same name as the user. This user is the only member of this group, which has a very small quota (2048k).
  • Project (group def-zrichard) describes files belonging to a project group. Your account may be associated with one or more project groups, and they will typically have names like def-zrichard, rrg-someprof-ab, or rpp-someprof.

In this example, files have somehow been created belonging to group ymartin instead of group def-zrichard. This is neither the desired nor the expected behaviour.

By design, new files and directories in /project will normally be created belonging to a project group. The two main reasons why files may be associated with the wrong group are that

  • files were moved from /home to /project with the mvcommand; to avoid this, use cp instead;
  • files were transfered from another cluster using rsync or scp with an option to preserve the original group ownership. If you have a recurring problem with ownership, check the options you are using with your file transfer program.

For rsync you can use the following command to transfer a directory from a remote location to your project directory:

$ rsync -axvpH --no-g --no-p  remote_user@remote.system:remote/dir/path $HOME/project/$USER/

You can also compress the data to get a better transfer rate.

$ rsync -axvpH --no-g --no-p  --compress-level=5 remote_user@remote.system:remote/dir/path $HOME/project/$USER/

To see the project groups you may use, run the following command:

Question.png
[name@server ~]$ stat -c %G $HOME/projects/*/

If you are the owner of the files, you can run the chgrp command to change their group ownership to the appropriate project group. To ask us to change the group owner for several users, contact technical support. You can also use the command chmod g+s <directory name> to ensure that files created in that directory will inherit the directory's group membership.

Finding files with the wrong group ownership

You may find it difficult to identify files that are contributing to an over-quota condition in /project. The find command can be used in conjunction with readlink to solve this:

Question.png
[name@server ~]$ lfs find $(readlink $HOME/projects/*) -group $USER

This will identify files belonging to the user's unique group, e.g. ymartin in the example shown earlier. If the output of quota indicates that a different group is over quota, use that group name instead of $USER.

See Project layout for further explanations.

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

You may see this message when the load on the Slurm manager or scheduler process is too high. We are working both to improve Slurm's tolerance of that and to identify and eliminate the sources of load spikes, but that is a long-term project. The best advice we have currently is to wait a minute or so. Then run squeue -u $USER and see if the job you were trying to submit appears: in some cases the error message is delivered even though the job was accepted by Slurm. If it doesn't appear, simply submit it again.

Why are my jobs taking so long to start?

You can see why your jobs are in the PD (pending) state by running the squeue -u <username> command on the cluster.
The (REASON) column typically has the values Resources or Priority.

  • Resourcesː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. nodes, memory, time).
  • Priorityː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command sshare as explained in Job scheduling policies.

The LevelFS column gives you information about your over- or under-consumption of cluster resources; when LevelFS is greater than one you are consuming fewer resources than your just share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your tasks decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of LevelFS is unique to the specific cluster.