Known issues: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
Line 47: Line 47:
* Intel compiler does not work on compute nodes
* Intel compiler does not work on compute nodes
** Solution/workaround: Compile your code on the login node.
** Solution/workaround: Compile your code on the login node.
* Crontab does not work on Graham. When attempting adding a new item there is an error during saving:
<pre>
[rozmanov@gra-login1 ~]$ crontab -e
no crontab for rozmanov - using an empty one
crontab: installing new crontab
/var/spool/cron/#tmp.gra-login1.XXXXKsp8LU: Read-only file system
crontab: edits left in /tmp/crontab.u0ljzU
</pre>
Crontab does work on Cedar. So, there must be some kind of a common approach on CC system.
Clearly, the main issue is how to handle user's crontabs on multiple login nodes.


= Other issues = <!--T:5-->
= Other issues = <!--T:5-->

Revision as of 17:08, 4 August 2017

Other languages:

Intro

Shared issues

Scheduler errors

  • The CC slurm configuration preferentially encourages whole-node jobs. Users should, if appropriate, request whole-node rather than per-core resources. See Job Scheduling - Whole Node Scheduling (Patrick Mann 20:15, 17 July 2017 (UTC))
  • Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. (Nathan Wielenga)
  • SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. (Greg Newby) Fri Jul 14 19:32:48 UTC 2017)
    • This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW)
  • By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with --export=none or add #SBATCH --export=NONE to your job script.

Quota and filesystem problems

Quota errors on /project filesystem

Sometimes, users will see quota error on their project folders. This may happen when files are owned by a group other than the project group. You can change the group which owns files using the command

Question.png
[name@server ~]$ chgrp -R <group> <folder>

To see what <group> should be, run the following command :

Question.png
[name@server ~]$ stat -c %G $HOME/projects/*/

Only the owner of the files can run the chgrp command. To ask us to correct the group owner for many users, write to support@computecanada.ca

Nearline

Missing symbolic links to project folders

Cedar only

  • SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. (Nathan Wielenga) 08:50, 18 July 2017 (MDT))
    • Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
  • Some people are getting an error "error: Job submit/allocate failed: Invalid account or account/partition combination specified"
    • They need to specify '--account=<accounting group>'

Graham only

  • /home is on an NFS appliance that does not support ACLs, so setfacl/getfacl doesn't work there.
    • Workaround: use the /project or /scratch filesystems instead
    • We're finding out whether this can be fixed through an update or reconfiguration.
  • Compute nodes cannot access Internet
    • Solution: Request exceptions to be made at support@computecanada.ca Describe what you need to access and why.
  • Intel compiler does not work on compute nodes
    • Solution/workaround: Compile your code on the login node.
  • Crontab does not work on Graham. When attempting adding a new item there is an error during saving:
[rozmanov@gra-login1 ~]$ crontab -e
no crontab for rozmanov - using an empty one
crontab: installing new crontab
/var/spool/cron/#tmp.gra-login1.XXXXKsp8LU: Read-only file system
crontab: edits left in /tmp/crontab.u0ljzU

Crontab does work on Cedar. So, there must be some kind of a common approach on CC system. Clearly, the main issue is how to handle user's crontabs on multiple login nodes.

Other issues

  1. modules don't work for shells other than bash(sh)
    • Workaround (this appears to work but not tested extensively)
      • source $LMOD_PKG/init/tcsh
      • source $LMOD_PKG/init/zsh
      • source $LMOD_PKG/init/ksh
    • Bart will look at it in more detail