cc_staff
28
edits
No edit summary |
mNo edit summary |
||
Line 11: | Line 11: | ||
** Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. ([[User:Nathanw|Nathan Wielenga]]) | ** Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. ([[User:Nathanw|Nathan Wielenga]]) | ||
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017) | * SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017) | ||
** This has been greatly improved after the addition of the epilog.clean script, but there are still nodes occasionally marked down for epilog failure. (NW) | |||
* Operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT)) | * Operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT)) | ||
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW) | ** Should be resolved after a VHD migration to a new backend for slurmctl. (NW) | ||
* The environment of the shell in which a job was submitted is exported to the job. This can lead to irreproducible results. | * The environment of the shell in which a job was submitted is exported to the job. This can lead to irreproducible results. | ||
** Solution/workaround: Add the option <tt>#SBATCH --export=NONE</tt> to your job script. | ** Solution/workaround: Add the option <tt>#SBATCH --export=NONE</tt> to your job script. | ||
== Quota and filesystem problems == <!--T:7--> | == Quota and filesystem problems == <!--T:7--> |