Known issues

From Alliance Doc
Revision as of 15:24, 4 August 2017 by Hahn (talk | contribs)
Jump to navigation Jump to search
Other languages:

Intro

Shared issues

Scheduler errors

  • The CC slurm configuration preferentially encourages whole-node jobs. Users should, if appropriate, request whole-node rather than per-core resources. See Job Scheduling - Whole Node Scheduling (Patrick Mann 20:15, 17 July 2017 (UTC))
  • Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. (Nathan Wielenga)
  • SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. (Greg Newby) Fri Jul 14 19:32:48 UTC 2017)
    • This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW)
  • By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with --export=none or add #SBATCH --export=NONE to your job script.

Quota and filesystem problems

Quota errors on /project filesystem

Sometimes, users will see quota error on their project folders. This may happen when files are owned by a group other than the project group. You can change the group which owns files using the command

Question.png
[name@server ~]$ chgrp -R <group> <folder>

To see what <group> should be, run the following command :

Question.png
[name@server ~]$ stat -c %G $HOME/projects/*/

Only the owner of the files can run the chgrp command. To ask us to correct the group owner for many users, write to support@computecanada.ca

Nearline

Missing symbolic links to project folders

Cedar only

  • SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. (Nathan Wielenga) 08:50, 18 July 2017 (MDT))
    • Should be resolved after a VHD migration to a new backend for slurmctl. (NW)

Graham only

  • /home is on an NFS appliance that does not support ACLs, so setfacl/getfacl doesn't work there.
    • Workaround: use the /project or /scratch filesystems instead
    • We're finding out whether this can be fixed through an update or reconfiguration.
  • Compute nodes cannot access Internet
    • Solution: Request exceptions to be made at support@computecanada.ca Describe what you need to access and why.
  • Intel compiler does not work on compute nodes
    • Solution/workaround: Compile your code on the login node.

Other issues

  1. modules don't work for shells other than bash(sh)
    • Workaround (this appears to work but not tested extensively)
      • source $LMOD_PKG/init/tcsh
      • source $LMOD_PKG/init/zsh
      • source $LMOD_PKG/init/ksh
    • Bart will look at it in more detail