38,760
edits
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
Line 1: | Line 1: | ||
<languages /> | <languages /> | ||
= | = Report an issue = | ||
* Please report issues to [mailto:support@computecanada.ca support@computecanada.ca] | * Please report issues to [mailto:support@computecanada.ca support@computecanada.ca]. | ||
= Shared issues = | = Shared issues = | ||
* The status page at http://status.computecanada.ca/ is not updated automatically yet, so may lag in showing current status. | * The status page at http://status.computecanada.ca/ is not updated automatically yet, so may lag in showing current status. | ||
* Utilization accounting is not currently being forwarded to the [https://ccdb.computecanada.ca Compute Canada database] | * Utilization accounting is not currently being forwarded to the [https://ccdb.computecanada.ca Compute Canada database]. | ||
== Scheduler problems == | == Scheduler problems == | ||
* The CC | * The CC Slurm configuration encourages whole-node jobs. When appropriate, users should request whole-node rather than per-core resources. See [[Job_scheduling_policies#Whole_nodes_versus_cores;|Job Scheduling - Whole Node Scheduling]]. | ||
* | * CPU and GPU backfill partitions have been created on the Cedar and Graham clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has low priority, but will allow increased utilization of the cluster by serial jobs. | ||
* | * Slurm epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. | ||
** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. | ** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. | ||
* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like | * By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like login environment, you can submit with <tt>--export=none</tt> or add <tt>#SBATCH --export=NONE</tt> to your job script. | ||
Line 18: | Line 18: | ||
== Quota and filesystem problems == | == Quota and filesystem problems == | ||
=== Quota errors on /project filesystem === | === Quota errors on /project filesystem === | ||
Users will sometimes see a quota error on their project folders. This may happen when files are owned by a group other than the project group. You can change the group which owns files using the command | |||
{{Command|chgrp -R <group> <folder>}} | {{Command|chgrp -R <group> <folder>}}. | ||
To see what <group> should be, run the following command : | To see what the value of <group> should be, run the following command: | ||
{{Command|stat -c %G $HOME/projects/*/}} | {{Command|stat -c %G $HOME/projects/*/}} | ||
Only the owner of the files can run the <tt>chgrp</tt> command. To ask us to correct the group owner for many users, write to support@computecanada.ca | Only the owner of the files can run the <tt>chgrp</tt> command. To ask us to correct the group owner for many users, write to support@computecanada.ca. | ||
=== Nearline === | === Nearline === | ||
* | * Nearline capabilities are not yet available; see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality. | ||
** | ** July 17 update: still not working. If you need your nearline RAC2017 quota, please ask [mailto:support@computecanada.ca CC support]. | ||
=== Missing symbolic links to project folders === | === Missing symbolic links to project folders === | ||
* Upon login to the new clusters, symbolic links are | * Upon login to the new clusters, symbolic links are not always created in the user's account, as described in [[Project layout]]. If this is the case, please verify that your access to the cluster is enabled on this page [https://ccdb.computecanada.ca/services/resources https://ccdb.computecanada.ca/services/resources]. | ||
= Cedar only = | = Cedar only = | ||
* | * Slurm operations will occasionally time out with a message such as ''Socket timed out on send/recv operation'' or ''Unable to contact slurm controller (connect failure)''. As a temporary workaround, attempt to resubmit your jobs/commands; they should go through in a few seconds. | ||
** Should be resolved after a VHD migration to a new backend for slurmctl. | ** Should be resolved after a VHD migration to a new backend for slurmctl. | ||
*Some people are getting | *Some people are getting the message ''error: Job submit/allocate failed: Invalid account or account/partition combination specified''. | ||
**They need to specify | **They need to specify --account=<accounting group>. | ||
= Graham only = | = Graham only = | ||
* /home is on an NFS appliance that does not support ACLs, so setfacl/getfacl doesn't work there. | * /home is on an NFS appliance that does not support ACLs, so setfacl/getfacl doesn't work there. | ||
** Workaround: use the /project or /scratch filesystems instead | ** Workaround: use the /project or /scratch filesystems instead. | ||
** | ** Might be resolved by an update or reconfiguration. | ||
* Compute nodes cannot access Internet | * Compute nodes cannot access Internet | ||
** Solution: Request exceptions to be made at support@computecanada.ca | ** Solution: Request exceptions to be made at support@computecanada.ca describing what you need to access and why. | ||
* Crontab is not offered on Graham. When attempting | * Crontab is not offered on Graham. When attempting the addition of a new item, there is an error during saving: | ||
<pre> | <pre> | ||
[rozmanov@gra-login1 ~]$ crontab -e | [rozmanov@gra-login1 ~]$ crontab -e | ||
Line 54: | Line 54: | ||
crontab: edits left in /tmp/crontab.u0ljzU | crontab: edits left in /tmp/crontab.u0ljzU | ||
</pre> | </pre> | ||
As crontab works on Cedar, there must be some kind of common approach for CC systems. | |||
Clearly, the main issue is how to handle user's crontabs on multiple login nodes | Clearly, the main issue is how to handle user's crontabs on multiple login nodes; it is not clear however if this should be implemented. | ||
= Other issues = | = Other issues = | ||
# | #Modules don't work for shells other than bash(sh) and tcsh. | ||
#*Workaround (this appears to work but not tested extensively) | #*Workaround: (this appears to work but not tested extensively) | ||
#**source $LMOD_PKG/init/zsh | #**source $LMOD_PKG/init/zsh | ||
#**source $LMOD_PKG/init/ksh | #**source $LMOD_PKG/init/ksh |