Revision as of 19:34, 25 August 2017

Other languages:

English
français

Expected availability: In production. RAC 2017's implemented June 30, 2017
Login node: graham.computecanada.ca
Globus endpoint: computecanada#graham-dtn

GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after Wes Graham, the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the 2017 RAC documentation.

The parallel filesystem and external persistent storage (NDC-Waterloo) are similar to Cedar's. The interconnect is different and there is a slightly different mix of compute nodes.

The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.

Getting started with Graham

How to run jobs

Attached storage systems[edit]

Home space	Standard home directory. Small, standard quota. Not allocated via RAS or RAC. Larger requests go to Project space.
Scratch space Parallel high-performance filesystem	For active or temporary (`/scratch`) storage. Available to all nodes. Not allocated. Inactive data will be purged. Huawei OceanStor storage system with approximately 3.6PB usable capacity and aggregate performance of approximately 30GB/s.
Project space External persistent storage	Part of the National Data Cyberinfrastructure. Allocated via RAS or RAC. Available to all nodes. Not designed for parallel I/O workloads. Use Scratch space instead.

High-performance interconnect[edit]

Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.

A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.

Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.

The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner.

For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.

Graham high performance interconnect diagram

Node types and characteristics[edit]

A total of 33,472 cores and 320 GPU devices, spread across 1,043 nodes of different types.

Processor type: All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz

GPU type: P100 12g

"Base" compute nodes	800 nodes	128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.
"Large" nodes (cloud configuration)	56 nodes	256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.
"Bigmem500" nodes	24 nodes	0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.
"Bigmem3000" nodes	3 nodes	3 TB of memory, 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.
"GPU" nodes	160 nodes	128 GB of memory, 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.

Local (on-node) storage in the above nodes is in /tmp. Best practice is to use the temporary directory generated by Slurm, $SLURM_TMPDIR.

@@ Line 98: / Line 98: @@
 <!--T:7-->
-Local (on-node) storage in the above nodes will be available as /tmp.
+Local (on-node) storage in the above nodes is in /tmp. Best practice is to use the temporary directory generated by [[Running jobs|Slurm]], $SLURM_TMPDIR.
 <!--T:14-->

Graham: Difference between revisions

Revision as of 19:34, 25 August 2017

Attached storage systems[edit]

High-performance interconnect[edit]

Node types and characteristics[edit]

Navigation menu

Graham: Difference between revisions

Revision as of 19:34, 25 August 2017

Attached storage systems[edit]

High-performance interconnect[edit]

Node types and characteristics[edit]

Navigation menu

Search