Known issues: Difference between revisions

Jump to navigation Jump to search
no edit summary
(Marked this version for translation)
No edit summary
Line 12: Line 12:
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.    ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017)
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.    ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017)
** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW)
** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW)
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with <tt>--export=none</tt> or add <tt>#SBATCH --export=NONE</tt> to your job script.
* By default, the job receives environment settings from the submitting shell. This can lead to irreproducible results if it's not what you expect. To force the job to run with a fresh-like-login environment, you can submit with --export=none or add <tt>#SBATCH --export=NONE</tt> to your job script.


== Quota and filesystem problems == <!--T:7-->
== Quota and filesystem problems == <!--T:7-->
=== Quota errors on /project filesystem ===
=== Quota errors on /project filesystem ===
Sometimes, users will see quota error on their project folders. This is because the group that owns the files is not the project group. You can change the group which owns files using the command
Sometimes, users will see quota error on their project folders. This may happen when files are owned by a group other than the project group. You can change the group which owns files using the command
{{Command|chgrp -R <group> <folder>}}
{{Command|chgrp -R <group> <folder>}}


Line 36: Line 35:
= Cedar only = <!--T:3-->
= Cedar only = <!--T:3-->
* SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT))
* SLURM operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT))
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW)


= Graham only = <!--T:4-->
= Graham only = <!--T:4-->
* Custom file ACLs do not work on /home
* /home is on an NFS appliance that does not support ACLs, so setfacl/getfacl doesn't work there.
** Solution/workaround: use the /project or /scratch filesystems instead
** Workaround: use the /project or /scratch filesystems instead
** We're finding out whether this can be fixed through an update or reconfiguration.
* Compute nodes cannot access Internet
* Compute nodes cannot access Internet
** Solution: Request exceptions to be made at support@computecanada.ca  Describe what you need to access and why.
** Solution: Request exceptions to be made at support@computecanada.ca  Describe what you need to access and why.
cc_staff
176

edits

Navigation menu