Known issues: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 9: Line 9:
* Quotas on <code>/project</code> are all 1 TB. The Storage National team is working on a project/RAC based schema. Fortunately Lustre have announced group-based quotas but that will need installation. ([[User:Pjmann|Patrick Mann]] 20:12, 17 July 2017 (UTC))
* Quotas on <code>/project</code> are all 1 TB. The Storage National team is working on a project/RAC based schema. Fortunately Lustre have announced group-based quotas but that will need installation. ([[User:Pjmann|Patrick Mann]] 20:12, 17 July 2017 (UTC))
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.    ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017)
* SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally.    ([[User:Gbnewby|Greg Newby]]) Fri Jul 14 19:32:48 UTC 2017)
* The SLURM 'sinfo' command yields different resource-type detail on graham and cedar.    ([[User:Gbnewby|Greg Newby]]) 16:05, 23 June 2017 (UTC))
* The status page at http://status.computecanada.ca/ is not updated automatically yet, so does not necessarily show correct, current status.
* The status page at http://status.computecanada.ca/ is not updated automatically yet, so does not necessarily show correct, current status.
* "Nearline" capabilities are not yet available (see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality)
* "Nearline" capabilities are not yet available (see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality)
Line 15: Line 14:
* Operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT))
* Operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. ([[User:Nathanw|Nathan Wielenga]]) 08:50, 18 July 2017 (MDT))
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
** Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
* Auto-creation of project directories such as /project/$USER was an interim solution. Soon there will be /project/gid where gid is the project group identifier.  This will be symlinked to /project/projects/pname where pname is the "friendly" project (RAPI) name).  And then, /project/gid/$USER can be where user subdirectories for that project will live.  Note that quotas in /project are project-based, not user-based.  ([[User:Gbnewby|Greg Newby]]) Thu Jul 20 00:45:00 UTC 2017)


== Cedar only == <!--T:3-->
== Cedar only == <!--T:3-->
cc_staff
50

edits

Navigation menu