Frequently Asked Questions
Disk quota exceeded error on /project filesystems
Some users have seen this message or some similar quota error on their project folders. Other users have reported obscure failures while transferring files into their /project
folder from another cluster. Many of the problems reported are due to bad file ownership.
Use diskusage_report
to see if you are at or over your quota:
[ymartin@cedar5 ~]$ diskusage_report
Description Space # of files
Home (user ymartin) 345M/50G 9518/500k
Scratch (user ymartin) 93M/20T 6532/1000k
Project (group ymartin) 5472k/2048k 158/5000k
Project (group/def-zrichard) 20k/1000G 4/5000k
The example above illustrates a frequent problem: /project
for user ymartin
contains too much data in files belonging to group ymartin
. The data should instead be in files belonging to def-zrichard
.
Note the two lines labelled Project
.
Project (group ymartin)
describes files belonging to groupymartin
, which has the same name as the user. This user is the only member of this group, which has a very small quota (2048k).Project (group def-zrichard)
describes files belonging to a project group. Your account may be associated with one or more project groups, and they will typically have names likedef-zrichard
,rrg-someprof-ab
, orrpp-someprof
.
In this example, files have somehow been created belonging to group ymartin
instead of group def-zrichard
. This is neither the desired nor the expected behaviour.
By design, new files and directories in /project
will normally be created belonging to a project group. The two main reasons why files may be associated with the wrong group are that
- files were moved from
/home
to/project
with themv
command; to avoid this, usecp
instead; - files were transfered from another cluster using rsync or scp with an option to preserve the original group ownership. If you have a recurring problem with ownership, check the options you are using with your file transfer program.
For rsync you can use the following command to transfer a directory from a remote location to your project directory:
$ rsync -axvpH --no-g --no-p remote_user@remote.system:remote/dir/path $HOME/project/$USER/
You can also compress the data to get a better transfer rate.
$ rsync -axvpH --no-g --no-p --compress-level=5 remote_user@remote.system:remote/dir/path $HOME/project/$USER/
To see the project groups you may use, run the following command:
[name@server ~]$ stat -c %G $HOME/projects/*/
If you are the owner of the files, you can run the chgrp
command to change their group ownership to the appropriate project group. To ask us to change the group owner for several users, contact technical support.
You can also use the command chmod g+s <directory name> to ensure that files created in that directory will inherit the directory's group membership.
Finding files with the wrong group ownership
You may find it difficult to identify files that are contributing to an over-quota condition in /project
. The find
command can be used in conjunction with readlink
to solve this:
[name@server ~]$ lfs find $(readlink $HOME/projects/*) -group $USER
This will identify files belonging to the user's unique group, e.g. ymartin
in the example shown earlier.
See Project layout for further explanations.
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
You may see this message when the load on the Slurm manager or scheduler process is too high. We are working both to improve Slurm's tolerance of that and to identify and eliminate the sources of load spikes, but that is a long-term project. The best advice we have currently is to wait a minute or so. Then run squeue -u $USER
and see if the job you were trying to submit appears: in some cases the error message is delivered even though the job was accepted by Slurm. If it doesn't appear, simply submit it again.
slurmstepd: error: Exceeded step memory limit at some point
This and the similar message, "slurmstepd: error: Exceeded job memory limit at some point" are potentially misleading. In some, but not all, cases it signifies a harmless condition. If your job otherwise appears to have terminated normally, that is, if all expected output is present, then you should ignore these messages. Do not increase your memory requests simply to suppress these messages!
If your job was actually killed for exceeding the requested memory, the key word "Killed" should appear in the standard error output of the job.
However, if you are using job dependencies (dependency=afterok:<jobid>
), then either of the messages "Exceeded job memory limit" or "Exceeded step memory limit" probably means that the dependent job was cancelled. We are in discussion with the Slurm development team about fixing this behaviour, as well as suppressing the misleading messages in non-fatal circumstances.