Known issues: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 22: | Line 22: | ||
** Solution/workaround: use the /project or /scratch filesystems instead | ** Solution/workaround: use the /project or /scratch filesystems instead | ||
* Compute nodes cannot access Internet | * Compute nodes cannot access Internet | ||
** Solution: Request exceptions to be made at support@computecanada.ca Describe what you need to access and why. | |||
== Other issues == <!--T:5--> | == Other issues == <!--T:5--> | ||
</translate> | </translate> |
Revision as of 13:57, 29 July 2017
Intro[edit]
- Please report issues to support@computecanada.ca
[edit]
- The CC slurm configuration preferentially encourages whole-node jobs. Users should if possible request whole-nodes rather than per-core resources. See Job Scheduling - Whole Node Scheduling (Patrick Mann 20:15, 17 July 2017 (UTC))
- Cpu and Gpu backfill partitions have been created on both clusters. If a job is submitted with <24hr runtime, it will be automatically entered into the cluster-wide backfill partition. This partition has a low priority, but will allow increased utilization of the cluster by serial jobs. (Nathan Wielenga)
- Quotas on
/project
are all 1 TB. The Storage National team is working on a project/RAC based schema. Fortunately Lustre have announced group-based quotas but that will need installation. (Patrick Mann 20:12, 17 July 2017 (UTC)) - SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. (Greg Newby) Fri Jul 14 19:32:48 UTC 2017)
- The status page at http://status.computecanada.ca/ is not updated automatically yet, so does not necessarily show correct, current status.
- "Nearline" capabilities are not yet available (see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality)
- Update July 17: still not working. If you need your nearline RAC2017 quota then please ask CC support. (Patrick Mann 20:45, 17 July 2017 (UTC))
- Operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. (Nathan Wielenga) 08:50, 18 July 2017 (MDT))
- Should be resolved after a VHD migration to a new backend for slurmctl. (NW)
Cedar only[edit]
Graham only[edit]
- Custom file ACLs do not work on /home
- Solution/workaround: use the /project or /scratch filesystems instead
- Compute nodes cannot access Internet
- Solution: Request exceptions to be made at support@computecanada.ca Describe what you need to access and why.