Known issues/en: Difference between revisions

Updating to match new version of source page
(Updating to match new version of source page)
(Updating to match new version of source page)
Line 4: Line 4:


= Shared issues =
= Shared issues =
* The status page at http://status.computecanada.ca/ is not updated automatically yet, so does not necessarily show correct, current status.
* The status page at http://status.computecanada.ca/ is not updated automatically yet, so may lag in showing current status.


== Scheduler errors ==
== Scheduler errors ==
* The CC slurm configuration preferentially encourages whole-node jobs. Users should if possible request whole-nodes rather than per-core resources. See [[Job_scheduling_policies#Whole_nodes_versus_cores;|Job Scheduling - Whole Node Scheduling]] ([[User:Pjmann|Patrick Mann]] 20:15, 17 July 2017 (UTC))
* The CC slurm configuration preferentially encourages whole-node jobs. Users should, if appropriate, request whole-node rather than per-core resources. See [[Job_scheduling_policies#Whole_nodes_versus_cores;|Job Scheduling - Whole Node Scheduling]] ([[User:Pjmann|Patrick Mann]] 20:15, 17 July 2017 (UTC))
** Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. ([[User:Nathanw|Nathan Wielenga]])
* Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. ([[User:Nathanw|Nathan Wielenga]])
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.    ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017)
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.    ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017)
* Operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT))
** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW)
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
* The environment of the shell in which a job was submitted is exported to the job. This can lead to irreproducible results.
* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with --export=none or add <tt>#SBATCH --export=NONE</tt> to your job script.
** Solution/workaround: Add the option <tt>#SBATCH --export=NONE</tt> to your job script.  


== Quota and filesystem problems ==
== Quota and filesystem problems ==
Line 28: Line 27:
* "Nearline" capabilities are not yet available (see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality)
* "Nearline" capabilities are not yet available (see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality)
** Update July 17: still not working. If you need your nearline RAC2017 quota then please ask [mailto:support@computecanada.ca CC support]. ([[User:Pjmann|Patrick Mann]] 20:45, 17 July 2017 (UTC))
** Update July 17: still not working. If you need your nearline RAC2017 quota then please ask [mailto:support@computecanada.ca CC support]. ([[User:Pjmann|Patrick Mann]] 20:45, 17 July 2017 (UTC))
=== Missing symbolic links to project folders ===
* Upon login to the new clusters, symbolic links are supposed to be created in the user's account, as described in [[Project layout]]. Sometimes, it does not happen. If this is the case, please verify that your access to the cluster is enabled on this page [https://ccdb.computecanada.ca/services/resources https://ccdb.computecanada.ca/services/resources]


= Cedar only =
= Cedar only =
* none
* SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT))


= Graham only =
= Graham only =
Line 41: Line 43:


= Other issues =
= Other issues =
#modules don't work for shells other than bash(sh)
#*Workaround (this appears to work but not tested extensively)
#**source $LMOD_PKG/init/tcsh
#**source $LMOD_PKG/init/zsh
#**source $LMOD_PKG/init/ksh
#*Bart will look at it in more detail
38,760

edits