Known issues: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
Line 24: Line 24:
== Graham only == <!--T:4-->
== Graham only == <!--T:4-->


* big memory nodes need to be added to the scheduler
* no network topology information in the scheduler
* no network topology information in the scheduler
**Provisional network topology information has been added to the config (July 20 | Nathan Wielenga)
**Provisional network topology information has been added to the config (July 20 | Nathan Wielenga)

Revision as of 17:59, 21 July 2017

Other languages:

Intro[edit]

Shared issues[edit]

  1. The CC slurm configuration preferentially encourages whole-node jobs. Users should if possible request whole-nodes rather than per-core resources. See Job Scheduling - Whole Node Scheduling (Patrick Mann (talk) 20:15, 17 July 2017 (UTC))
  2. Quotas on /project are all 1 TB. The Storage National team is working on a project/RAC based schema. Fortunately Lustre have announced group-based quotas but that will need installation. (Patrick Mann (talk) 20:12, 17 July 2017 (UTC))
  3. SLURM epilog does not fully clean up processes from ended jobs, especially if the job did not exit normally. (Greg Newby) Fri Jul 14 19:32:48 UTC 2017)
  4. Email from graham and cedar is still undergoing configuration. Therefore email job notifications from Slurm are failing. (Patrick Mann (talk) 17:17, 26 June 2017 (UTC))
    • Cedar email is working now (Patrick Mann (talk) 16:11, 6 July 2017 (UTC))
    • Graham email is working
  5. The SLURM 'sinfo' command yields different resource-type detail on graham and cedar. (Greg Newby) 16:05, 23 June 2017 (UTC))
  6. Local scratch on compute nodes has inconsistent naming. Cedar has /local and Graham has /localscratch.
  7. The status page at http://status.computecanada.ca/ is not updated automatically yet, so does not necessarily show correct, current status.
  8. "Nearline" capabilities are not yet available (see https://docs.computecanada.ca/wiki/National_Data_Cyberinfrastructure for a brief description of the intended functionality)
    • Update July 17: still not working. If you need your nearline RAC2017 quota then please ask CC support. (Patrick Mann (talk) 20:45, 17 July 2017 (UTC))
  9. Operations will occasionally time out with a message like "Socket timed out on send/recv operation" or "Unable to contact slurm controller (connect failure)". As a temporary workaround, attempt to resubmit your jobs/commands, they should go through in a few seconds. (Nathan Wielenga) 08:50, 18 July 2017 (MDT))
  10. Auto-creation of project directories such as /project/$USER was an interim solution. Soon there will be /project/gid where gid is the project group identifier. This will be symlinked to /project/projects/pname where pname is the "friendly" project (RAPI) name). And then, /project/gid/$USER can be where user subdirectories for that project will live. Note that quotas in /project are project-based, not user-based. (Greg Newby) Thu Jul 20 00:45:00 UTC 2017)

Cedar only[edit]

Graham only[edit]

  • no network topology information in the scheduler
    • Provisional network topology information has been added to the config (July 20 | Nathan Wielenga)

Other issues[edit]