cc_staff
176
edits
(Marked this version for translation) |
No edit summary |
||
Line 12: | Line 12: | ||
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017) | * SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017) | ||
** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW) | ** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW) | ||
* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with <tt>--export=none</tt> or add <tt>#SBATCH --export=NONE</tt> to your job script. | |||
* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with --export=none or add <tt>#SBATCH --export=NONE</tt> to your job script. | |||
== Quota and filesystem problems == <!--T:7--> | == Quota and filesystem problems == <!--T:7--> | ||
=== Quota errors on /project filesystem === | === Quota errors on /project filesystem === | ||
Sometimes, users will see quota error on their project folders. This | Sometimes, users will see quota error on their project folders. This may happen when files are owned by a group other than the project group. You can change the group which owns files using the command | ||
{{Command|chgrp -R <group> <folder>}} | {{Command|chgrp -R <group> <folder>}} | ||
Line 36: | Line 35: | ||
= Cedar only = <!--T:3--> | = Cedar only = <!--T:3--> | ||
* SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT)) | * SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT)) | ||
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW) | |||
= Graham only = <!--T:4--> | = Graham only = <!--T:4--> | ||
* | * /home is on an NFS appliance that does not support ACLs, so setfacl/getfacl doesn't work there. | ||
** | ** Workaround: use the /project or /scratch filesystems instead | ||
** We're finding out whether this can be fixed through an update or reconfiguration. | |||
* Compute nodes cannot access Internet | * Compute nodes cannot access Internet | ||
** Solution: Request exceptions to be made at support@computecanada.ca Describe what you need to access and why. | ** Solution: Request exceptions to be made at support@computecanada.ca Describe what you need to access and why. |