Known issues/en: Difference between revisions

Known issues/en (view source)

Revision as of 16:14, 17 August 2017

296 bytes removed , 7 years ago

Updating to match new version of source page

FuzzyBot

Bots

38,760

edits

@@ Line 1: / Line 1: @@
 <languages />
-= Intro =
+= Report an issue =
-* Please report issues to [mailto:support@computecanada.ca support@computecanada.ca]
+* Please report issues to [mailto:support@computecanada.ca support@computecanada.ca].
 = Shared issues =
 * The status page at http://status.computecanada.ca/ is not updated automatically yet, so may lag in showing current status.
-* Utilization accounting is not currently being forwarded to the [https://ccdb.computecanada.ca Compute Canada database]
+* Utilization accounting is not currently being forwarded to the [https://ccdb.computecanada.ca Compute Canada database].
 == Scheduler problems ==
-* The CC slurm configuration preferentially encourages whole-node jobs. Users should, if appropriate, request whole-node rather than per-core resources. See [[Job_scheduling_policies#Whole_nodes_versus_cores;|Job Scheduling - Whole Node Scheduling]] ([[User:Pjmann|Patrick Mann]] 20:15, 17 July 2017 (UTC))
+* The CC Slurm configuration encourages whole-node jobs. When appropriate, users should request whole-node rather than per-core resources. See [[Job_scheduling_policies#Whole_nodes_versus_cores;|Job Scheduling - Whole Node Scheduling]].
-* Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. ([[User:Nathanw|Nathan Wielenga]])
+* CPU and GPU backfill partitions have been created on the Cedar and Graham clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has low priority, but will allow increased utilization of the cluster by serial jobs.
-* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.    ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017)
+* Slurm epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.
-** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW)
+** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure.
-* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with <tt>--export=none</tt> or add <tt>#SBATCH --export=NONE</tt> to your job script.
+* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like login environment, you can submit with <tt>--export=none</tt> or add <tt>#SBATCH --export=NONE</tt> to your job script.
@@ Line 18: / Line 18: @@
 == Quota and filesystem problems ==
 === Quota errors on /project filesystem ===
-Sometimes, users will see quota error on their project folders. This may happen when files are owned by a group other than the project group. You can change the group which owns files using the command
+Users will sometimes see a quota error on their project folders. This may happen when files are owned by a group other than the project group. You can change the group which owns files using the command
-{{Command|chgrp -R <group> <folder>}}
+{{Command|chgrp -R <group> <folder>}}.
-To see what <group> should be, run the following command :
+To see what the value of <group> should be, run the following command:
 {{Command|stat -c %G $HOME/projects/*/}}
-Only the owner of the files can run the <tt>chgrp</tt> command. To ask us to correct the group owner for many users, write to support@computecanada.ca
+Only the owner of the files can run the <tt>chgrp</tt> command. To ask us to correct the group owner for many users, write to support@computecanada.ca.
 === Nearline ===
-* "Nearline" capabilities are not yet available (see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality)
+* Nearline capabilities are not yet available; see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality.
-** Update July 17: still not working. If you need your nearline RAC2017 quota then please ask [mailto:support@computecanada.ca CC support]. ([[User:Pjmann|Patrick Mann]] 20:45, 17 July 2017 (UTC))
+** July 17 update: still not working. If you need your nearline RAC2017 quota, please ask [mailto:support@computecanada.ca CC support].
 === Missing symbolic links to project folders ===
-* Upon login to the new clusters, symbolic links are supposed to be created in the user's account, as described in [[Project layout]]. Sometimes, it does not happen. If this is the case, please verify that your access to the cluster is enabled on this page [https://ccdb.computecanada.ca/services/resources https://ccdb.computecanada.ca/services/resources]
+* Upon login to the new clusters, symbolic links are not always created in the user's account, as described in [[Project layout]]. If this is the case, please verify that your access to the cluster is enabled on this page [https://ccdb.computecanada.ca/services/resources https://ccdb.computecanada.ca/services/resources].
 = Cedar only =
-* SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT))
+* Slurm operations will occasionally time out with a message such as ''Socket timed out on send/recv operation'' or ''Unable to contact slurm controller (connect failure)''. As a temporary workaround, attempt to resubmit your jobs/commands; they should go through in a few seconds.
-** Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
+** Should be resolved after a VHD migration to a new backend for slurmctl.
-*Some people are getting an error "error: Job submit/allocate failed: Invalid account or account/partition combination specified"
+*Some people are getting the message ''error: Job submit/allocate failed: Invalid account or account/partition combination specified''.
-**They need to specify '--account=<accounting group>'
+**They need to specify --account=<accounting group>.
 = Graham only =
 * /home is on an NFS appliance that does not support ACLs, so setfacl/getfacl doesn't work there.
-** Workaround: use the /project or /scratch filesystems instead
+** Workaround: use the /project or /scratch filesystems instead.
-** We're finding out whether this can be fixed through an update or reconfiguration.
+** Might be resolved by an update or reconfiguration.
 * Compute nodes cannot access Internet
-** Solution: Request exceptions to be made at support@computecanada.ca   Describe what you need to access and why.
+** Solution: Request exceptions to be made at support@computecanada.ca describing what you need to access and why.
-* Crontab is not offered on Graham. When attempting adding a new item there is an error during saving:
+* Crontab is not offered on Graham. When attempting the addition of a new item, there is an error during saving:
 <pre>
 [rozmanov@gra-login1 ~]$ crontab -e
@@ Line 54: / Line 54: @@
 crontab: edits left in /tmp/crontab.u0ljzU
 </pre>
-Crontab does work on Cedar. So, there must be some kind of a common approach on CC system.
+As crontab works on Cedar, there must be some kind of common approach for CC systems.
-Clearly, the main issue is how to handle user's crontabs on multiple login nodes.
+Clearly, the main issue is how to handle user's crontabs on multiple login nodes; it is not clear however if this should be implemented.
-But it's not clear whether we even want to do so.
 = Other issues =
-#modules don't work for shells other than bash(sh) and tcsh
+#Modules don't work for shells other than bash(sh) and tcsh.
-#*Workaround (this appears to work but not tested extensively)
+#*Workaround: (this appears to work but not tested extensively)
 #**source $LMOD_PKG/init/zsh
 #**source $LMOD_PKG/init/ksh

Known issues/en: Difference between revisions

Known issues/en (view source)

Revision as of 16:14, 17 August 2017

Navigation menu

Search