https://docs.alliancecan.ca/mediawiki/api.php?action=feedcontributions&user=Kaizaad&feedformat=atomAlliance Doc - User contributions [en]2024-03-29T09:48:10ZUser contributionsMediaWiki 1.39.6https://docs.alliancecan.ca/mediawiki/index.php?title=Using_nearline_storage&diff=115509Using nearline storage2022-05-11T18:08:50Z<p>Kaizaad: update file hsm_state info</p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:30--><br />
Nearline is a tape-based filesystem intended for '''inactive data'''. Datasets which you do not expect to access for months are good candidates to be stored in /nearline. <br />
<br />
= Restrictions and best practices= <!--T:33--><br />
<br />
Note that there is no need to compress the data that you will be copying to nearline; the tape archive system automatically performs the compression using specialized circuitry.<br />
<br />
==== Size of files ==== <!--T:34--><br />
<br />
<!--T:35--><br />
Retrieving small files from tape is inefficient, while extremely large files pose other problems. Please observe these guidelines when storing files in /nearline:<br />
<br />
<!--T:9--><br />
*Files smaller than ~10GB should be combined into archive files (''tarballs'') using [[A tutorial on 'tar'|tar]] or a [[Archiving and compressing files|similar tool]].<br />
*Files larger than 4TB should be split in chunks of 1TB using the [[A_tutorial_on_'tar'#split|split]] command or a similar tool.<br />
*'''DO NOT SEND SMALL FILES TO NEARLINE, except for indexes (see ''Creating an index'' below).'''<br />
<br />
==== Using tar or dar ==== <!--T:36--><br />
<br />
<!--T:37--><br />
Use [[A tutorial on 'tar'|tar]] or [[dar]] to create an archive file.<br />
Keep the source files in their original filesystem. Do NOT copy the source files to /nearline before creating the archive.<br />
<br />
<!--T:38--><br />
If you have hundreds of gigabytes of data, the <code>tar</code> options <code>-M (--multi-volume)</code> and <code>-L (--tape-length)</code> can be used to produce archive files of suitable size.<br />
<br />
<!--T:39--><br />
If you are using <code>dar</code>, you can similarly use the <code>-s (--slice)</code> option.<br />
<br />
===== Creating an index ===== <!--T:48--><br />
When you bundle files, it becomes inconvenient to find individual files. To avoid having to restore an entire large collection from tape when you only need one or a few of the files from this collection, you should make an index of all archive files you create. Create the index as soon as you create the collection. For instance, you can save the output of tar with the <tt>verbose</tt> option when you create the archive, like this:<br />
<br />
<!--T:49--><br />
{{Command|tar cvvf /nearline/def-sponsor/user/mycollection.tar /project/def-sponsor/user/something > /nearline/def-sponsor/user/mycollection.index}}<br />
<br />
<!--T:50--><br />
If you've just created the archive (again using tar as an example), you can create an index like this:<br />
<br />
<!--T:51--><br />
{{Command|tar tvvf /nearline/def-sponsor/user/mycollection.tar > /nearline/def-sponsor/user/mycollection.index}}<br />
<br />
<!--T:52--><br />
Index files are an exception to the rule about small files on nearline: it's okay to store them in /nearline.<br />
<br />
==== No access from compute nodes ==== <!--T:40--><br />
<br />
<!--T:41--><br />
Because data retrieval from /nearline may take an uncertain amount of time (see ''How it works'' below), we do not permit reading from /nearline in a job context. /nearline is not mounted on compute nodes.<br />
<br />
==== Use a data-transfer node if available ==== <!--T:42--><br />
<br />
<!--T:32--><br />
Creating a tar or dar file for a large volume of data can be resource-intensive. Please do this on a data-transfer node (DTN) instead of on a login node whenever possible.<br />
<br />
= Why /nearline? = <!--T:43--><br />
<br />
<!--T:44--><br />
Tape as a storage medium has these advantages over disk and solid-state ("SSD") media.<br />
# Cost per unit of data stored is lower.<br />
# The volume of data stored can be easily expanded by buying more tapes.<br />
# Energy consumption per unit of data stored is effectively zero.<br />
<br />
<!--T:45--><br />
Consequently we can offer much greater volumes of storage on /nearline than we can on /project. Also, keeping inactive data ''off'' of /project reduces the load and improves its performance.<br />
<br />
= How it works = <!--T:46--><br />
<br />
<!--T:22--><br />
# When a file is first copied to (or created on) /nearline, the file exists only on disk, not tape.<br />
# After a period (on the order of a day), and if the file meets certain criteria, the system will copy the file to tape. At this stage, the file will be on both disk and tape.<br />
# After a further period the disk copy may be deleted, and the file will only be on tape.<br />
# When such a file is recalled, it is copied from tape back to disk, returning it to the second state.<br />
<br />
<!--T:2--><br />
When a file has been moved entirely to tape (that is, when it is ''virtualized'') it will still appear in the directory listing. If the virtual file is read, it will take some time for the tape to be retrieved from the library and copied back to disk. The process which is trying to read the file will block while this is happening. This may take from less than a minute to over an hour, depending on the size of the file and the demand on the tape system.<br />
<br />
== Transferring data from Nearline == <!--T:53--><br />
<br />
<!--T:54--><br />
Before starting [[Transferring_data|any transfer]] using [[Globus]] or another tool, make sure all the data are already back on disk. Otherwise, transfers will constantly hang and probably overload the tape storage system.<br />
<br />
<!--T:24--><br />
You can determine whether or not a given file has been moved to tape or is still on disk using the <tt>lfs hsm_state</tt> command. "hsm" stands for "hierarchical storage manager".<br />
<br />
<!--T:47--><br />
<source lang="bash"><br />
# Here, <FILE> is only on disk.<br />
$ lfs hsm_state <FILE><br />
<FILE>: (0x00000000)<br />
<br />
# Here, <FILE> is in progress of being copied to tape.<br />
$ lfs hsm_state <FILE><br />
<FILE>: [...]: exists, [...]<br />
<br />
# Here, <FILE> is both on the disk and on tape.<br />
$ lfs hsm_state <FILE><br />
<FILE>: [...]: exists archived, [...]<br />
<br />
# Here, <FILE> is on tape but no longer on disk. There will be a lag when opening it. <br />
$ lfs hsm_state <FILE><br />
<FILE>: [...]: released exists archived, [...]<br />
</source><br />
<br />
<!--T:27--><br />
You can explicitly force a file to be recalled from tape without actually reading it with the command <code>lfs hsm_restore <FILE></code>.<br />
<br />
<!--T:29--><br />
Note that as of October 2020, the output of the command <code>diskusage_report</code>, also known as <code>quota</code>, does not report on /nearline space consumption.<br />
<br />
== Cluster-specific information == <!--T:6--><br />
<br />
<!--T:10--><br />
<tabs><br />
<tab name="Graham"><br />
/nearline is only accessible as a directory on login nodes and on DTNs (''Data Transfer Nodes'').<br />
<br />
<!--T:11--><br />
To use /nearline, just put files into your <tt>~/nearline/PROJECT</tt> directory. After a period of time (24 hours as of February 2019), they will be copied onto tape. If the file remains unchanged for another period (24 hours as of February 2019), the copy on disk will be removed, making the file virtualized on tape. <br />
<br />
<!--T:8--><br />
If you accidentally (or deliberately) delete a file from <tt>~/nearline</tt>, the tape copy will be retained for up to 60 days. To restore such a file contact [[technical support]] with the full path for the file(s) and desired version (by date), just as you would for restoring a [[Storage and file management#Filesystem quotas and policies|backup]]. Note that since you will need the full path for the file, it is important for you to retain a copy of the complete directory structure of your /nearline space. For example, you can run the command <tt>ls -R > ~/nearline_contents.txt</tt> from the <tt>~/nearline/PROJECT</tt> directory so that you have a copy of the location of all the files.<br />
</tab><br />
<br />
<!--T:16--><br />
<tab name="Cedar"><br />
/nearline service similar to that on Graham.<br />
</tab><br />
<br />
<!--T:17--><br />
<tab name="Niagara"><br />
HPSS is the /nearline service on Niagara.<br/><br />
There are three methods to access the service:<br />
<br />
<!--T:12--><br />
1. By submitting HPSS-specific commands <tt>htar</tt> or <tt>hsi</tt> to the Slurm scheduler as a job in one of the archive partitions; see [https://docs.scinet.utoronto.ca/index.php/HPSS the HPSS documentation] for detailed examples. Using job scripts offers the benefit of automating /nearline transfers and is the best method if you use HPSS regularly. Your HPSS files can be found in the $ARCHIVE directory, which is like $PROJECT but with ''/project'' replaced by ''/archive''. <br />
<br />
<!--T:13--><br />
2. To manage a small number of files in HPSS, you can use the VFS (''Virtual File System'') node, which is accessed with the command <tt>salloc --time=1:00:00 -pvfsshort</tt>. Your HPSS files can be found in the $ARCHIVE directory, which is like $PROJECT but with ''/project'' replaced by ''/archive''. <br />
<br />
<!--T:14--><br />
3. By using [[Globus]] for transfers to and from HPSS using the endpoint <b>computecanada#hpss</b>. This is useful for occasional usage and for transfers to and from other sites.<br />
<br />
<!--T:21--><br />
</tab><br />
<br />
<!--T:20--><br />
<tab name="Béluga"><br />
/nearline service similar to that on Graham.<br />
</tab><br />
</tabs><br />
<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_nearline_storage&diff=115502Using nearline storage2022-05-11T17:19:02Z<p>Kaizaad: update file hsm_state info</p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:30--><br />
Nearline is a tape-based filesystem intended for '''inactive data'''. Datasets which you do not expect to access for months are good candidates to be stored in /nearline. <br />
<br />
= Restrictions and best practices= <!--T:33--><br />
<br />
Note that there is no need to compress the data that you will be copying to nearline; the tape archive system automatically performs the compression using specialized circuitry.<br />
<br />
==== Size of files ==== <!--T:34--><br />
<br />
<!--T:35--><br />
Retrieving small files from tape is inefficient, while extremely large files pose other problems. Please observe these guidelines when storing files in /nearline:<br />
<br />
<!--T:9--><br />
*Files smaller than ~10GB should be combined into archive files (''tarballs'') using [[A tutorial on 'tar'|tar]] or a [[Archiving and compressing files|similar tool]].<br />
*Files larger than 4TB should be split in chunks of 1TB using the [[A_tutorial_on_'tar'#split|split]] command or a similar tool.<br />
*'''DO NOT SEND SMALL FILES TO NEARLINE, except for indexes (see ''Creating an index'' below).'''<br />
<br />
==== Using tar or dar ==== <!--T:36--><br />
<br />
<!--T:37--><br />
Use [[A tutorial on 'tar'|tar]] or [[dar]] to create an archive file.<br />
Keep the source files in their original filesystem. Do NOT copy the source files to /nearline before creating the archive.<br />
<br />
<!--T:38--><br />
If you have hundreds of gigabytes of data, the <code>tar</code> options <code>-M (--multi-volume)</code> and <code>-L (--tape-length)</code> can be used to produce archive files of suitable size.<br />
<br />
<!--T:39--><br />
If you are using <code>dar</code>, you can similarly use the <code>-s (--slice)</code> option.<br />
<br />
===== Creating an index ===== <!--T:48--><br />
When you bundle files, it becomes inconvenient to find individual files. To avoid having to restore an entire large collection from tape when you only need one or a few of the files from this collection, you should make an index of all archive files you create. Create the index as soon as you create the collection. For instance, you can save the output of tar with the <tt>verbose</tt> option when you create the archive, like this:<br />
<br />
<!--T:49--><br />
{{Command|tar cvvf /nearline/def-sponsor/user/mycollection.tar /project/def-sponsor/user/something > /nearline/def-sponsor/user/mycollection.index}}<br />
<br />
<!--T:50--><br />
If you've just created the archive (again using tar as an example), you can create an index like this:<br />
<br />
<!--T:51--><br />
{{Command|tar tvvf /nearline/def-sponsor/user/mycollection.tar > /nearline/def-sponsor/user/mycollection.index}}<br />
<br />
<!--T:52--><br />
Index files are an exception to the rule about small files on nearline: it's okay to store them in /nearline.<br />
<br />
==== No access from compute nodes ==== <!--T:40--><br />
<br />
<!--T:41--><br />
Because data retrieval from /nearline may take an uncertain amount of time (see ''How it works'' below), we do not permit reading from /nearline in a job context. /nearline is not mounted on compute nodes.<br />
<br />
==== Use a data-transfer node if available ==== <!--T:42--><br />
<br />
<!--T:32--><br />
Creating a tar or dar file for a large volume of data can be resource-intensive. Please do this on a data-transfer node (DTN) instead of on a login node whenever possible.<br />
<br />
= Why /nearline? = <!--T:43--><br />
<br />
<!--T:44--><br />
Tape as a storage medium has these advantages over disk and solid-state ("SSD") media.<br />
# Cost per unit of data stored is lower.<br />
# The volume of data stored can be easily expanded by buying more tapes.<br />
# Energy consumption per unit of data stored is effectively zero.<br />
<br />
<!--T:45--><br />
Consequently we can offer much greater volumes of storage on /nearline than we can on /project. Also, keeping inactive data ''off'' of /project reduces the load and improves its performance.<br />
<br />
= How it works = <!--T:46--><br />
<br />
<!--T:22--><br />
# When a file is first copied to (or created on) /nearline, the file exists only on disk, not tape.<br />
# After a period (on the order of a day), and if the file meets certain criteria, the system will copy the file to tape. At this stage, the file will be on both disk and tape.<br />
# After a further period the disk copy may be deleted, and the file will only be on tape.<br />
# When such a file is recalled, it is copied from tape back to disk, returning it to the second state.<br />
<br />
<!--T:2--><br />
When a file has been moved entirely to tape (that is, when it is ''virtualized'') it will still appear in the directory listing. If the virtual file is read, it will take some time for the tape to be retrieved from the library and copied back to disk. The process which is trying to read the file will block while this is happening. This may take from less than a minute to over an hour, depending on the size of the file and the demand on the tape system.<br />
<br />
== Transferring data from Nearline == <!--T:53--><br />
<br />
<!--T:54--><br />
Before starting [[Transferring_data|any transfer]] using [[Globus]] or another tool, make sure all the data are already back on disk. Otherwise, transfers will constantly hang and probably overload the tape storage system.<br />
<br />
<!--T:24--><br />
You can determine whether or not a given file has been moved to tape or is still on disk using the <tt>lfs hsm_state</tt> command. "hsm" stands for "hierarchical storage manager".<br />
<br />
<!--T:47--><br />
<source lang="bash"><br />
# Here, <FILE> has not yet been copied to tape.<br />
$ lfs hsm_state <FILE><br />
<FILE>: (0x00000000)<br />
<br />
<!--T:25--><br />
# Here, <FILE> is both on the disk and on tape.<br />
$ lfs hsm_state <FILE><br />
<FILE>: [...]: exists archived, [...]<br />
<br />
<!--T:26--><br />
# Here, <FILE> is on tape but no longer on disk. There will be a lag when opening it. <br />
$ lfs hsm_state <FILE><br />
<FILE>: [...]: released exists archived, [...]<br />
</source><br />
<br />
<!--T:27--><br />
You can explicitly force a file to be recalled from tape without actually reading it with the command <code>lfs hsm_restore <FILE></code>.<br />
<br />
<!--T:29--><br />
Note that as of October 2020, the output of the command <code>diskusage_report</code>, also known as <code>quota</code>, does not report on /nearline space consumption.<br />
<br />
== Cluster-specific information == <!--T:6--><br />
<br />
<!--T:10--><br />
<tabs><br />
<tab name="Graham"><br />
/nearline is only accessible as a directory on login nodes and on DTNs (''Data Transfer Nodes'').<br />
<br />
<!--T:11--><br />
To use /nearline, just put files into your <tt>~/nearline/PROJECT</tt> directory. After a period of time (24 hours as of February 2019), they will be copied onto tape. If the file remains unchanged for another period (24 hours as of February 2019), the copy on disk will be removed, making the file virtualized on tape. <br />
<br />
<!--T:8--><br />
If you accidentally (or deliberately) delete a file from <tt>~/nearline</tt>, the tape copy will be retained for up to 60 days. To restore such a file contact [[technical support]] with the full path for the file(s) and desired version (by date), just as you would for restoring a [[Storage and file management#Filesystem quotas and policies|backup]]. Note that since you will need the full path for the file, it is important for you to retain a copy of the complete directory structure of your /nearline space. For example, you can run the command <tt>ls -R > ~/nearline_contents.txt</tt> from the <tt>~/nearline/PROJECT</tt> directory so that you have a copy of the location of all the files.<br />
</tab><br />
<br />
<!--T:16--><br />
<tab name="Cedar"><br />
/nearline service similar to that on Graham.<br />
</tab><br />
<br />
<!--T:17--><br />
<tab name="Niagara"><br />
HPSS is the /nearline service on Niagara.<br/><br />
There are three methods to access the service:<br />
<br />
<!--T:12--><br />
1. By submitting HPSS-specific commands <tt>htar</tt> or <tt>hsi</tt> to the Slurm scheduler as a job in one of the archive partitions; see [https://docs.scinet.utoronto.ca/index.php/HPSS the HPSS documentation] for detailed examples. Using job scripts offers the benefit of automating /nearline transfers and is the best method if you use HPSS regularly. Your HPSS files can be found in the $ARCHIVE directory, which is like $PROJECT but with ''/project'' replaced by ''/archive''. <br />
<br />
<!--T:13--><br />
2. To manage a small number of files in HPSS, you can use the VFS (''Virtual File System'') node, which is accessed with the command <tt>salloc --time=1:00:00 -pvfsshort</tt>. Your HPSS files can be found in the $ARCHIVE directory, which is like $PROJECT but with ''/project'' replaced by ''/archive''. <br />
<br />
<!--T:14--><br />
3. By using [[Globus]] for transfers to and from HPSS using the endpoint <b>computecanada#hpss</b>. This is useful for occasional usage and for transfers to and from other sites.<br />
<br />
<!--T:21--><br />
</tab><br />
<br />
<!--T:20--><br />
<tab name="Béluga"><br />
/nearline service similar to that on Graham.<br />
</tab><br />
</tabs><br />
<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=114371Graham2022-04-19T21:42:54Z<p>Kaizaad: fixup "constraint" tag</p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-globus'''<br />
|-<br />
| Data transfer node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage (called "NDC-Waterloo" in some documents) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transferring data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
* By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
On or after the Removal Date we will follow up with the Contact to confirm if the exception is still required.<br />
<br />
<!--T:41--><br />
* Crontab is not offered on Graham. <br />
* Each job on Graham should have a duration of at least one hour (five minutes for test jobs).<br />
* A user cannot have more than 1000 jobs, running and queued, at any given moment. An array job is counted as the number of tasks in the array.<br />
<br />
=Storage= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space'''<br />64TB total volume ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />16PB total volume<br />External persistent storage<br />
||<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types; note that Turbo Boost is activated for the ensemble of Graham nodes.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory). <br />
Note that one node is only populated with 6 GPUs.<br />
|-<br />
| 2 || 40 || 377G or 386048M || 2 x Intel Xeon Gold 6248 Cascade Lake @ 2.5GHz || 5.0TB NVMe SSD || 8 x NVIDIA V100 Volta (32GB HBM2 memory),NVLINK<br />
|-<br />
| 6 || 16 || 187G or 191840M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 187G or 191840M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 136 || 44 || 187G or 191840M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
<!--T:64--><br />
Most applications will run on either Broadwell, Skylake, or Cascade Lake nodes, and performance differences are expected to be small compared to job waiting times. Therefore we recommend that you do not select a specific node type for your jobs. If it is necessary, for CPU jobs there are only two constraints available, use either <code>--constraint=broadwell</code> or <code>--constraint=cascade</code>. See [[Running_jobs#Cluster_particularities|how to specify the CPU architecture]].<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs (including 2 nodes with NVLINK interconnect)<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned to any type of available GPU. If you require a specific type of GPU, please request it. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
Graham has a total of 9 Volta nodes.<br />
In 7 of these, four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket). The other 2 have high bandwidth NVLINK interconnect.<br />
<br />
<!--T:50--><br />
'''The nodes are available to all users with a maximum 7 days job runtime limit.''' <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less on 28 core nodes. For example, if you want to run a job using 4 GPUs, you should request '''at most 14 CPU cores'''. For a job with 1 GPU, you should request '''at most 3 CPU cores'''. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:65--><br />
The two newest Volta nodes have 40 cores so the number of cores requested per GPU should be adjusted upwards accordingly, i.e. you can use 5 CPU cores per GPU. They also have NVLINK, which can provide huge benefits for situations where memory bandwidth between GPUs is the bottleneck. If you want to use one of these NVLINK nodes, you should request it directly by adding the <code>--constraint=cascade,v100</code> parameter to the job submission script.<br />
<br />
<!--T:53--><br />
Single-GPU example:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see the [[Python#Creating_virtual_environments_inside_of_your_jobs|information on how to do this]].<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Globus&diff=107885Globus2021-12-13T18:01:27Z<p>Kaizaad: Update graham globus endpoint</p>
<hr />
<div><languages /><br />
<translate><br />
<!--T:1--><br />
[https://www.globus.org/ Globus] is a service for fast, reliable, secure transfer of files. Designed specifically for researchers, Globus has an easy-to-use interface with background monitoring features that automate the management of file transfers between any two resources, whether they are at Compute Canada, another supercomputing facility, a campus cluster, lab server, desktop or laptop.<br />
<br />
<!--T:2--><br />
Globus leverages GridFTP for its transfer protocol but shields the end user from complex and time consuming tasks related to GridFTP and other aspects of data movement. It improves transfer performance over GridFTP, rsync, scp, and sftp, by automatically tuning transfer settings, restarting interrupted transfers, and checking file integrity.<br />
<br />
<!--T:3--><br />
Globus can be accessed via the main [https://www.globus.org/ Globus website] or via the Compute Canada Globus portal at [https://globus.computecanada.ca https://globus.computecanada.ca].<br />
<br />
== Using Globus == <!--T:4--><br />
Go to [http://globus.computecanada.ca http://globus.computecanada.ca]. Your "existing organizational login" is your CCDB account. Ensure that "Compute Canada" is selected in the drop-down, then click Continue. Supply your CCDB username (not your e-mail address or other identifier) and password on the Compute Canada MyProxy page which appears. This takes you to the web portal for Globus.<br />
<br />
<!--T:46--><br />
[[File:1st-panel.png|400px|thumb|none| CC Globus Authentication page. (Click for larger image.)]]<br />
<br />
=== To Start a Transfer === <!--T:5--><br />
<br />
<!--T:23--><br />
Globus transfers happen between "collections" (formerly known as "endpoints" in previous Globus versions). Most Compute Canada systems have some standard collections set up for you to use. To transfer files to and from your computer, you need to create a collection for it. This requires a bit of setup initially, but once it has been done, transfers via Globus require little more than making sure the Globus Connect Personal software is running on your machine. More on this below under [[#Personal Computers|Personal Computers]].<br />
<br />
<!--T:6--><br />
If the [https://globus.computecanada.ca/file-manager File Manager page in the Globus Portal] is not already showing (see image), select it from the left sidebar.<br />
<br />
<!--T:49--><br />
[[File:Globus-file-manager.png|400px|thumb|none| Globus File Manager. (Click for larger image.)]]<br />
<br />
<!--T:7--><br />
On the top right of the page there are three buttons labelled "Panels". Select the second button (this will allow you to see two collections at the same time).<br />
<br />
<!--T:50--><br />
Find collections by clicking where the page says "Search" and entering a collection name. <br />
<br />
<!--T:51--><br />
[[File:Globus-select-collection.png|400px|thumb|none| Selecting a Globus collection. (Click for larger image.)]]<br />
<br />
<!--T:52--><br />
You can start typing a collection name to select it. For example, if you want to transfer data to or from the Béluga cluster, type "beluga", wait two seconds for a list of matching sites to appear, and select <code>computecanada#beluga-dtn</code>. <br />
<br />
<!--T:57--><br />
All Compute Canada resources have names prefixed with <code>computecanada#</code>. For example, [https://globus.computecanada.ca/file-manager?origin_id=278b9bfe-24da-11e9-9fa2-0a06afd4a22e <code>computecanada#beluga-dtn</code>], [https://globus.computecanada.ca/file-manager?origin_id=c99fd40c-5545-11e7-beb6-22000b9a448b <code>computecanada#cedar-dtn</code>], [https://globus.computecanada.ca/file-manager?origin_id=07baf15f-d7fd-4b6a-bf8a-5b5ef2e229d3 <code>computecanada#graham-globus</code>] or [https://globus.computecanada.ca/file-manager?origin_id=77506016-4a51-11e8-8f88-0a6d4e044368 <code>computecanada#niagara</code>] (note that 'dtn' stands for 'data transfer node').<br />
<br />
<!--T:58--><br />
You may be prompted to "authenticate" the collection, depending on which site is hosting the collection. For example, if you are activating a collection hosted on Graham, you will be asked for your Compute Canada username and password. The authentication of a collection remains valid for some time - typically one week for CC collections while personal collections do not expire.<br />
<br />
<!--T:8--><br />
Now select a second collection, searching for it and authenticating if required.<br />
<br />
<!--T:9--><br />
Once a collection has been activated you should see a list of directories and files. You can navigate these by double-clicking on directories and using the "up one folder" button. Highlight a file or directory that you want to transfer by single-clicking on it. Control-click to highlight multiple things. Then click one of the big blue buttons with white arrowheads to initiate the transfer. The transfer job will be given a unique id and will begin right away. You will receive an email when the transfer is complete. You can also monitor in-progress transfers and view details of completed transfers from the ["Activity" button](https://globus.computecanada.ca/activity) on the left hand side .<br />
<br />
<!--T:53--><br />
[[File:Globus-Initiate-Transfer.png|400px|thumb|none| Initiating a transfer. Note the highlighted file in the left-hand pane. (Click for larger image.)]]<br />
<br />
<!--T:10--><br />
See also [https://docs.globus.org/how-to/get-started/ How To Log In and Transfer Files with Globus] at the Globus.org site.<br />
<br />
=== Options === <!--T:11--><br />
<br />
<!--T:12--><br />
Globus provides several other options in the "Transfer & Sync Options" in between the two "Start" buttons in the middle of the screen. Here you can direct Globus to<br />
* sync - only transfer new or changed files<br />
* delete files on destination that do not exist on source<br />
* preserve source file modification times<br />
* verify file integrity after transfer (on by default)<br />
* encrypt transfer<br />
Note that enabling encryption significantly reduces transfer performance, so it should only be used for sensitive data.<br />
<br />
=== Personal Computers === <!--T:13--><br />
<br />
<!--T:14--><br />
Globus provides a desktop client, [https://www.globus.org/globus-connect-personal Globus Connect Personal], to make it easy to transfer files to and from a personal computer running Windows, MacOS X, or Linux.<br />
<br />
<!--T:54--><br />
There are links on the [https://www.globus.org/globus-connect-personal Globus Connect Personal] page which walk you through the setup of globus connect personal on the various operating systems, including setting it up from the commandline on Linux. If you are running Globus Connect Personal from the command line on linux, this [https://docs.globus.org/faq/globus-connect-endpoints/#how_do_i_configure_accessible_directories_on_globus_connect_personal_for_linux FAQ on the Globus site] describes configuring which paths you share and their permissions.<br />
<br />
====To install Globus Connect Personal==== <!--T:15--><br />
<br />
<!--T:55--><br />
[[File:Install-globus-connect-personal.png|400px|thumb|none| Finding the installation button. (Click for larger image.)]]<br />
<br />
<!--T:16--><br />
# Go to the [https://globus.computecanada.ca/endpoints?scope=administered-by-me Compute Canada Globus portal] and log in if you have not already done so.<br />
# From the File Manager screen click on the "Endpoints" icon on the left hand side.<br />
# Click on the "Create a personal endpoint" button in the top right corner of the screen<br />
# Enter an "Endpoint Display Name" of your choice, which you will use to access the computer you will be installing Globus Connect Personal on. Example: MacLaptop or WorkPC.<br />
# Click on the download link for your operating system (May need to click on "Show me other supported operating systems if downloading for another computer)<br />
# Install Globus Connect Personal<br />
# You should now be able to access the endpoint through Globus. The full endpoint name is [your username]#[name you give setup] Example: smith#WorkPC<br />
<br />
====To run Globus Connect Personal==== <!--T:24--><br />
<br />
<!--T:25--><br />
The above steps are only needed once, to setup the endpoint. For further file transfer operations, one has to make sure Globus Connect Personal is running, i.e., start the program, and ensure that the endpoint isn't paused.<br />
<br />
<!--T:56--><br />
[[File:gcp-applet.png|400px|thumb|none| Globus Connect Personal application for a personal endpoint.]]<br />
<br />
<!--T:26--><br />
Note that if the Globus Connect Personal program at your end point is closed during a file transfer to or from that endpoint, the transfer will stop. To restart the transfer, simply reopen the program.<br />
<br />
====Transfer between two personal endpoints==== <!--T:27--><br />
<br />
<!--T:28--><br />
Although you can create endpoints for any number of personal computers, transfers between two personal endpoints is not enabled by default. If you need this capability, please contact <br />
[mailto:globus@computecanada.ca globus@computecanada.ca] to setup a "Globus Plus" account.<br />
<br />
<!--T:17--><br />
For more information see the [https://docs.globus.org/how-to/ Globus.org how-to pages], particularly:<br />
* [https://docs.globus.org/how-to/globus-connect-personal-mac Globus Connect Personal for Mac OS X]<br />
* [https://docs.globus.org/how-to/globus-connect-personal-windows Globus Connect Personal for Windows]<br />
* [https://docs.globus.org/how-to/globus-connect-personal-linux Globus Connect Personal for Linux]<br />
<br />
== Globus Sharing == <!--T:18--><br />
<br />
<!--T:19--><br />
Globus sharing makes collaboration with your colleagues easy. Sharing enables people to access files stored on your account on a Compute Canada system even if the other user does not have an account on that system. Files can be shared with any user, anywhere in the world, who has a Globus account. See [https://docs.globus.org/how-to/share-files/ How To Share Data Using Globus].<br />
<br />
=== Creating a Shared Collection === <!--T:29--><br />
<br />
<!--T:30--><br />
To share a file or folder on an endpoint first requires that the system hosting the files has sharing enabled.<br />
<br />
<!--T:59--><br />
{{Panel<br />
|panelstyle=callout<br />
|title=Globus sharing is disabled on Niagara.<br />
|content=<br />
Globus sharing is disabled on Niagara.<br />
}}<br />
<br />
<!--T:60--><br />
{{Panel<br />
|panelstyle=callout<br />
|title=Project Requires Permission to Share<br />
|content=<br />
To create a Globus Share on project for the other Compute Canada systems the PI will need to contact [mailto:globus@computecanada.ca globus@computecanada.ca] with:<br />
<br />
<!--T:75--><br />
* Yes they want Globus sharing enabled <br />
* The path to enable <br />
* Whether the sharing will be read only, or sharing can be read and write<br />
<br />
<!--T:76--><br />
Data to be shared will need to be moved or copied into this path. Creating a symbolic link to the data will not allow access to the data.<br />
<br />
<!--T:77--><br />
Otherwise you will receive the error:<br />
<br />
<!--T:61--><br />
"The backend responded with an error: You do not have permission to create a shared endpoint on the selected path. The administrator of this endpoint has disabled creation of shared endpoints on the selected path."<br />
<br />
<!--T:62--><br />
Globus sharing is enabled for the home directory. By default we disable sharing on project to prevent users accidentally sharing other user's files. If you would like to test a Globus share you can create one in your home directory.<br />
<br />
<!--T:63--><br />
We suggest using a path that makes it clear to everyone that files in the directory might be shared such as:<br />
<br />
<!--T:64--><br />
/project/my-project-id/Sharing<br />
<br />
<!--T:65--><br />
Once we have enabled sharing on the path you will be able to create a new Globus shared endpoint for any sub directory under that path. So for example you will be able to create the sub directories:<br />
<br />
<!--T:66--><br />
/project/my-project-id/Sharing/Subdir-01<br />
<br />
<!--T:67--><br />
and<br />
<br />
<!--T:68--><br />
/project/my-project-id/Sharing/Subdir-02<br />
<br />
<!--T:69--><br />
Create a different Globus Share for each and share them with different users.<br />
<br />
<!--T:78--><br />
If you would like to have a Globus Share created on /project for one of these systems please email globus@computecanada.ca.<br />
}}<br />
<br />
<!--T:31--><br />
Log into [https://globus.computecanada.ca globus.computecanada.ca] with your Globus credentials. Once you are logged in, you will see a transfer window. In the ‘endpoint’ field, type the endpoint identifier for the endpoint you wish to share from (e.g. computecanada#beluga-dtn, computecanada#cedar-dtn, computecanada#graham-globus, computecanada#niagara etc.) and activate the endpoint, if asked to. <br />
<br />
<!--T:70--><br />
Select a folder that you wish to share, then click the "Share" button to the right of the folder list.<br />
[[File:Globus SharedEndpoint1-1024x607.png|thumb|Open "Share" option (Click for larger image.)]]<br />
<br />
<!--T:71--><br />
Click on the "Add a Guest Collection" button in the top right corner of the screen. <br />
[[File:Globus SharedEndpoint2.png|thumbnail|Click on "Add a Guest Collection" (Click for larger image.)]]<br />
<br />
<!--T:72--><br />
Give the new share a name that is easy for you and the people you intend to share it with to find. You can also adjust from where you want to share using the "Browse" button.<br />
[[File:Globus SharedEndpoint3-1024x430.png|thumbnail|Managing a Shared Endpoint]]<br />
<br />
===Managing Access=== <!--T:32--><br />
Once the endpoint is created, you will be shown the current access list, with only your account on it. Since sharing is of little use without someone to share with, click the ‘Add Permissions -- Share With’ button to add people or groups that you wish to share with.<br />
<br />
<!--T:33--><br />
You will now be prompted to select whether to share with people via email, username, or group.<br />
* E-mail is a good choice if you don’t know a person’s username on Globus. It will also allow you to share with people who do not currently have a Globus account, though they will need to create one to be able to access your share.<br />
* User presents a search box that allows you to search by name or Globus username. This is best if someone already has a Globus account, as it does not require any action on their part to be added to the share. Enter a name or Globus username (if you know it), and select the appropriate match from the list, then click ‘Use Selected’<br />
* Group allows you to share with a number of people simultaneously. You can search by group name or UUID. Group names may be ambiguous, so be sure to verify you are sharing with the correct one. This can be avoided by using the group’s UUID, which is available on the Groups page (See Groups Section)<br />
[[File:Globus ManagingAccess-1024x745.png|thumbnail|Managing Shared Endpoint Permissions]]<br />
To add or remove write permissions from a user, click the checkbox next to their name under the write column. It is not possible to remove read access.<br />
<br />
<!--T:73--><br />
[[File:Globus-Add-Permissions.png|thumb|Send an invitation to a share]]<br />
<br />
<!--T:34--><br />
Deleting users or groups from the list of people you are sharing with is as simple as clicking the ‘x’ at the end of the line containing their information.<br />
<br />
===Removing a Shared Collection=== <!--T:35--><br />
You can remove Shared Collections once you no longer need it. [https://globus.computecanada.ca/endpoints?scope=shared-by-me To do this, click on endpoints, and click on the "Shareable by You" tab.]<br />
<br />
<!--T:36--><br />
Click on the title of the "Shared Collection" you want to remove. Click on the "Delete Endpoint" on the right hand side of the screen. Confirm deleting it by clicking on the red button.<br />
[[File:Globus RemovingSharedEndpoint-1024x322.png|thumbnail|Removing a Shared Endpoint]]<br />
<br />
<!--T:74--><br />
The endpoint is now deleted. Your files will not be affected by this action, nor will those others may have uploaded.<br />
<br />
===Sharing Security=== <!--T:37--><br />
<br />
<!--T:38--><br />
Sharing files entails a certain level of risk. By creating a share, you are opening up files that up to now have been in your exclusive control to others. The following list is some things to think about before sharing, though it is far from comprehensive.<br />
<br />
*Make sure you have permission to share the files, if you are not the data’s owner<br />
*Make sure you are sharing with only those you intend to. Verify the person you add to the access list is the person you think, there are often people with the same or similar names. Remember that Globus usernames are not linked to Compute Canada usernames. The recommended method of sharing is to use the email address of the person you wish to share with, unless you have the exact account name.<br />
*If you are sharing with a group you do not control, make sure you trust the owner of the group. They may add people who are not authorized to access your files.<br />
*If granting write access, make sure that you have backups of important files that are not on the shared endpoint, as users of the shared endpoint may delete or overwrite files, and do anything that you yourself can do to a file.<br />
*It is highly recommended that sharing be restricted to a subdirectory, rather than your top-level home directory.<br />
== Globus Groups == <!--T:20--><br />
Globus groups provide an easy way to manage permissions for sharing with multiple users. When you create a group, you can use it from the sharing interface easily to control access for multiple users. <br />
<br />
=== Creating a Group === <!--T:39--><br />
Click on the [https://globus.computecanada.ca/groups "Groups" button] on the left hand sidebar. Click on the "Create New Group" button on the top right of the screen. Pressing this button brings up the ‘Create New Group’ window.<br />
[[File:Globus CreatingNewGroup-1024x717.png|thumbnail|Creating a Globus Group]]<br />
*Enter the name of the group in the ‘Group Name’ field<br />
*Enter the group description in the ‘Group Description’ field<br />
*Select if the group is visible to only group members (private group) or all Globus users.<br />
*Click ‘Create Group’ to add the group.<br />
<br />
=== Inviting Users === <!--T:40--><br />
Once a group has been created, users can be added by selecting ‘Invite users’, and then either entering an email address (preferred) or searching for the username. Once users have been selected for invitation, click the add button and they will be sent an email inviting them to join. Once they’ve accepted, they will be visible in the group.<br />
<br />
=== Modifying Membership === <!--T:41--><br />
Click on a user to modify their membership. You can change their Role and Status. Role allows you to grant permissions to the user, including Admin (Full access), Manager (Change user roles), or Member (no management functions). The ‘Save Changes’ button commits the changes.<br />
<br />
==Command Line Interface (CLI) == <!--T:45--><br />
===Installing===<br />
The Globus command line interface is a python module which can be installed using pip. Below are the steps to install Globus CLI on one of our clusters.<br />
# Create a virtual environment to install the Globus CLI into (see [[Python#Creating_and_using_a_virtual_environment|creating and using a virtual environment]]).<source lang='console>$ virtualenv $HOME/.globus-cli-virtualenv</source><br />
# Activate the virtual environment <source lang='console>$ source $HOME/.globus-cli-virtualenv/bin/activate</source><br />
# Install Globus CLI into the virtual environment (see [[Python#Installing_modules| installing modules]]).<source lang='console>$ pip install globus-cli</source><br />
# Then deactivate the virtual environment.<source lang='console'>$ deactivate</source><br />
# To avoid having to load that virtual environment every time before using Globus, you can add it to your path. <source lang='console>$ export PATH=$PATH:$HOME/.globus-cli-virtualenv/bin<br />
$ echo 'export PATH=$PATH:$HOME/.globus-cli-virtualenv/bin'>>$HOME/.bashrc</source><br />
See the Globus docs page on [https://docs.globus.org/cli/installation/ installation] for information on installing on different platforms, updating, and uninstalling.<br />
===Using===<br />
* See the Globus [https://docs.globus.org/cli/ Command Line Interface (CLI) documentation] to learn about using the CLI.<br />
* Also see the Globus [https://docs.globus.org/cli/legacy/ Hosted Command Line Interface (Legacy) documentation], which allows authentication with SSH keys instead of requiring web based authentication.<br />
===Scripting===<br />
* There is also a Python API, see the [https://globus-sdk-python.readthedocs.io/en/stable/ globus sdk python documentation].<br />
<br />
== Virtual Machine (Cloud VMs such as Arbutus, Cedar, East Cloud, Graham) == <!--T:79--><br />
Globus Endpoints exist for the cluster systems (Beluga, Cedar, Graham, Niagara etc.) but not for Cloud VMs. The reason for this is that there isn't a singular storage for each VM so we can't create a single endpoint for everyone.<br />
<br />
<!--T:80--><br />
If you need a Globus Endpoint on your VM and can't use another transfer mechanism there are two options for installing a Globus Endpoint: Globus Connect Personal, and Globus Connect Server.<br />
<br />
=== Globus Connect Personal === <!--T:81--><br />
Globus Connect Personal is easier to install, manage and get through the firewall but is designed to be installed on laptops / desktops.<br />
<br />
<!--T:82--><br />
Install Globus Connect Personal on Windows:<br />
https://docs.globus.org/how-to/globus-connect-personal-windows/<br />
<br />
<!--T:83--><br />
Install Globus Connect Personal on Linux:<br />
https://docs.globus.org/how-to/globus-connect-personal-linux/<br />
<br />
=== Globus Connect Server === <!--T:84--><br />
Server is designed for headless (command line only, no gui) installations and has some additional features I don't think you would use (such as ability to add multiple servers to the endpoint). It does require opening some ports to allow transfers to occur (see https://docs.globus.org/globus-connect-server/v5/#open-tcp-ports_section).<br />
<br />
== Support and More Information == <!--T:21--><br />
If you would like more information on Compute Canada’s use of Globus, or require support in using this service, please send an email to [mailto:globus@computecanada.ca globus@computecanada.ca] and provide the following information:<br />
<br />
<!--T:22--><br />
* Name<br />
* Compute Canada Role Identifier (CCRI)<br />
* Institution<br />
* Inquiry or issue. Be sure to indicate which sites you want to transfer to and from.<br />
<br />
<!--T:44--><br />
[[Category:Connecting]]<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=107427Graham2021-12-06T22:10:46Z<p>Kaizaad: Updated memory available on "new" nodes.</p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-globus'''<br />
|-<br />
| Data transfer node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage (called "NDC-Waterloo" in some documents) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transferring data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
* By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
On or after the Removal Date we will follow up with the Contact to confirm if the exception is still required.<br />
<br />
<!--T:41--><br />
* Crontab is not offered on Graham. <br />
* Each job on Graham should have a duration of at least one hour (five minutes for test jobs).<br />
* A user cannot have more than 1000 jobs, running and queued, at any given moment. An array job is counted as the number of tasks in the array.<br />
<br />
=Storage= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space'''<br />64TB total volume ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />16PB total volume<br />External persistent storage<br />
||<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types; note that Turbo Boost is activated for the ensemble of Graham nodes.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory). <br />
Note that one node is only populated with 6 GPUs.<br />
|-<br />
| 2 || 40 || 377G or 386048M || 2 x Intel Xeon Gold 6248 Cascade Lake @ 2.5GHz || 5.0TB NVMe SSD || 8 x NVIDIA V100 Volta (32GB HBM2 memory),NVLINK<br />
|-<br />
| 6 || 16 || 187G or 191840M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 187G or 191840M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 136 || 44 || 187G or 191840M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
<!--T:64--><br />
Most applications will run on either Broadwell, Skylake, or Cascade Lake nodes, and performance differences are expected to be small compared to job waiting times. Therefore we recommend that you do not select a specific node type for your jobs. If it is necessary, for CPU jobs there are only two constraints available, use either <code>--constraint=broadwell</code> or <code>--constraint=cascade</code>. See [[Running_jobs#Cluster_particularities|how to specify the CPU architecture]].<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs (including 2 nodes with NVLINK interconnect)<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned to any type of available GPU. If you require a specific type of GPU, please request it. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
Graham has a total of 9 Volta nodes.<br />
In 7 of these, four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket). The other 2 have high bandwidth NVLINK interconnect.<br />
<br />
<!--T:50--><br />
'''The nodes are available to all users with a maximum 7 days job runtime limit.''' <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less on 28 core nodes. For example, if you want to run a job using 4 GPUs, you should request '''at most 14 CPU cores'''. For a job with 1 GPU, you should request '''at most 3 CPU cores'''. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:65--><br />
The two newest Volta nodes have 40 cores so the number of cores requested per GPU should be adjusted upwards accordingly, i.e. you can use 5 CPU cores per GPU. They also have NVLINK, which can provide huge benefits for situations where memory bandwidth between GPUs is the bottleneck. To use one of these NVLINK nodes, it should be requested directly, by adding the option '''--nodelist=gra1337''' or '''--nodelist=gra1338''' to the job submission script.<br />
<br />
<!--T:53--><br />
Single-GPU example:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see the [[Python#Creating_virtual_environments_inside_of_your_jobs|information on how to do this]].<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=GROMACS&diff=102505GROMACS2021-08-04T17:34:07Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
[[Category:Software]][[Category:BiomolecularSimulation]]<br />
<br />
<translate><br />
=General= <!--T:1--><br />
<br />
<!--T:3--><br />
[http://www.gromacs.org/ GROMACS] is a versatile package to perform molecular dynamics for systems with hundreds to millions of particles.<br />
It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, <br />
but since GROMACS is extremely fast at calculating the nonbonded interactions <br />
(that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.<br />
<br />
== Strengths == <!--T:4--><br />
<br />
<!--T:5--><br />
* GROMACS provides extremely high performance compared to all other programs. <br />
* Since GROMACS 4.6, we have excellent CUDA-based GPU acceleration on GPUs that have Nvidia compute capability >= 2.0 (e.g. Fermi or later).<br />
* GROMACS comes with a large selection of flexible tools for trajectory analysis.<br />
* GROMACS can be run in parallel, using either the standard MPI communication protocol, or via our own "Thread MPI" library for single-node workstations.<br />
* GROMACS is Free Software, available under the GNU Lesser General Public License (LGPL), version 2.1.<br />
<br />
== Weak points == <!--T:6--><br />
<br />
<!--T:7--><br />
* To get very high simulation speed, GROMACS does not do much additional analysis and / or data collection on the fly. It may be a challenge to obtain somewhat non-standard information about the simulated system from a GROMACS simulation.<br />
<br />
<!--T:8--><br />
* Different versions may have significant differences in simulation methods and default parameters. Reproducing results of older versions with a newer version may not be straightforward.<br />
<br />
<!--T:9--><br />
* Additional tools and utilities that come with GROMACS are not always of the highest quality, may contain bugs and may implement poorly documented methods. Reconfirming the results of such tools with independent methods is always a good idea.<br />
<br />
== GPU support == <!--T:10--><br />
<br />
<!--T:11--><br />
The top part of any log file will describe the configuration, <br />
and in particular whether your version has GPU support compiled in. <br />
GROMACS will automatically use any GPUs it finds. <br />
<br />
<!--T:12--><br />
GROMACS uses both CPUs and GPUs; it relies on a reasonable balance between CPU and GPU performance.<br />
<br />
<!--T:13--><br />
The new neighbor structure required the introduction of a new variable called "cutoff-scheme" in the mdp file.<br />
The behaviour of older GROMACS versions (before 4.6) corresponds to <tt>cutoff-scheme = group</tt>, while in order to use<br />
GPU acceleration you must change it to <tt>cutoff-scheme = verlet</tt>, which has become the new default in version 5.0.<br />
<br />
= Quickstart guide = <!--T:14--><br />
This section summarizes configuration details.<br />
<br />
=== Environment modules === <!--T:15--><br />
<br />
<!--T:16--><br />
The following versions have been installed:<br />
<br />
<!--T:104--><br />
<tabs><br />
<tab name="StdEnv/2020"><br />
{| class="wikitable sortable"<br />
|-<br />
! GROMACS version !! modules for running on CPUs !! modules for running on GPUs (CUDA) !! Notes<br />
|-<br />
| gromacs/2021.2 || <code>StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2021.2</code> || <code>StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs/2021.2</code> || GCC & MKL<br />
|-<br />
| gromacs/2020.4 || <code>StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4</code> || <code>StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs/2020.4</code> || GCC & MKL<br />
|}</tab><br />
<tab name="StdEnv/2018.3"><br />
{| class="wikitable sortable"<br />
|-<br />
! GROMACS version !! modules for running on CPUs !! modules for running on GPUs (CUDA) !! Notes<br />
|-<br />
| gromacs/2020.2 || <code>StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.2 gromacs/2020.2</code> || <code>StdEnv/2018.3 gcc/7.3.0 cuda/10.0.130 openmpi/3.1.2 gromacs/2020.2</code> || GCC & MKL<br />
|-<br />
| gromacs/2019.6 || <code>StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.2 gromacs/2019.6</code> || <code>StdEnv/2018.3 gcc/7.3.0 cuda/10.0.130 openmpi/3.1.2 gromacs/2019.6</code> || GCC & MKL<br />
|-<br />
| gromacs/2019.3 || <code>StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.2 gromacs/2019.3</code> || <code>StdEnv/2018.3 gcc/7.3.0 cuda/10.0.130 openmpi/3.1.2 gromacs/2019.3</code> || GCC & MKL &nbsp;&#8225;<br />
|-<br />
| gromacs/2018.7 || <code>StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.2 gromacs/2018.7</code> || <code>StdEnv/2018.3 gcc/7.3.0 cuda/10.0.130 openmpi/3.1.2 gromacs/2018.7</code> || GCC & MKL<br />
|}</tab><br />
<tab name="StdEnv/2016.4"><br />
{| class="wikitable sortable"<br />
|-<br />
! GROMACS version !! modules for running on CPUs !! modules for running on GPUs (CUDA) !! Notes<br />
|-<br />
| gromacs/2018.3 || <code>StdEnv/2016.4 gcc/6.4.0 openmpi/2.1.1 gromacs/2018.3</code> || <code>StdEnv/2016.4 gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2018.3</code> || GCC & FFTW<br />
|-<br />
| gromacs/2018.2 || <code>StdEnv/2016.4 gcc/6.4.0 openmpi/2.1.1 gromacs/2018.2</code> || <code>StdEnv/2016.4 gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2018.2</code> || GCC & FFTW<br />
|-<br />
| gromacs/2018.1 || <code>StdEnv/2016.4 gcc/6.4.0 openmpi/2.1.1 gromacs/2018.1</code> || <code>StdEnv/2016.4 gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2018.1</code> || GCC & FFTW<br />
|-<br />
| gromacs/2018 || <code>StdEnv/2016.4 gromacs/2018</code> || <code>StdEnv/2016.4 cuda/9.0.176 gromacs/2018</code> || Intel & MKL<br />
|-<br />
| gromacs/2016.5 || <code>StdEnv/2016.4 gcc/6.4.0 openmpi/2.1.1 gromacs/2016.5</code> || <code>StdEnv/2016.4 gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2016.5</code> || GCC & FFTW<br />
|-<br />
| gromacs/2016.3 || <code>StdEnv/2016.4 gromacs/2016.3</code> || <code>StdEnv/2016.4 cuda/8.0.44 gromacs/2016.3</code> || Intel & MKL<br />
|-<br />
| gromacs/5.1.5 || <code>StdEnv/2016.4 gromacs/5.1.5</code> || <code>StdEnv/2016.4 cuda/8.0.44 gromacs/5.1.5</code> || Intel & MKL<br />
|-<br />
| gromacs/5.1.4 || <code>StdEnv/2016.4 gromacs/5.1.4</code> || <code>StdEnv/2016.4 cuda/8.0.44 gromacs/5.1.4</code> || Intel & MKL<br />
|-<br />
| gromacs/5.0.7 || <code>StdEnv/2016.4 gromacs/5.0.7</code> || <code>StdEnv/2016.4 cuda/8.0.44 gromacs/5.0.7</code> || Intel & MKL<br />
|-<br />
| gromacs/4.6.7 || <code>StdEnv/2016.4 gromacs/4.6.7</code> || <code>StdEnv/2016.4 cuda/8.0.44 gromacs/4.6.7</code> || Intel & MKL<br />
|}</tab><br />
</tabs><br />
<br />
<!--T:17--><br />
'''Notes:'''<br />
* Version 2020.4 and newer have been compiled for the new [[Standard software environments|Standard software environment]] <code>StdEnv/2020</code>.<br />
* Version 2018.7 and newer have been compiled with GCC compilers and the MKL-library, as they run a bit faster.<br />
* Older versions have been compiled with either with GCC compilers and FFTW or Intel compilers, using Intel MKL and Open MPI 2.1.1 libraries from the default environment as indicated in the table above.<br />
* CPU (non-GPU) versions are available in both single- and double precision, with the exception of 2019.3 ('''&#8225;'''), where double precision not available for AVX512.<br />
<br />
<!--T:18--><br />
These modules can be loaded by using a <code>module load</code> command with the modules as stated in the second column in above table.<br />
For example:<br />
</translate><br />
<br />
$ module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2021.2<br />
or <br />
$ module load StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.2 gromacs/2020.2<br />
<br />
<translate><br />
<!--T:19--><br />
These versions are also available with GPU support, albeit only with single precision. In order to load the GPU enabled version, the <code>cuda</code> module needs to be loaded first. The modules needed are listed in the third column of above table, e.g.:<br />
</translate><br />
<br />
$ module load StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs/2021.2 <br />
or<br />
$ module load StdEnv/2018.3 gcc/7.3.0 cuda/10.0.130 openmpi/3.1.2 gromacs/2020.2<br />
<br />
<translate><br />
<!--T:20--><br />
For more information on environment modules, please refer to the [[Using modules]] page.<br />
<br />
=== Suffixes === <!--T:21--><br />
<br />
==== GROMACS 5.x, 2016.x and newer ==== <!--T:22--><br />
GROMACS 5 and newer releases consist of only four binaries that contain the full functionality. <br />
All GROMACS tools from previous versions have been implemented as sub-commands of the gmx binaries.<br />
Please refer to [http://www.gromacs.org/Documentation/How-tos/Tool_Changes_for_5.0 GROMACS 5.0 Tool Changes] and the [http://manual.gromacs.org/documentation/ GROMACS documentation manuals] for your version.<br />
<br />
<!--T:47--><br />
:* '''<code>gmx</code>''' - single precision GROMACS with OpenMP (threading) but without MPI.<br />
:* '''<code>gmx_mpi</code>''' - single precision GROMACS with OpenMP and MPI.<br />
:* '''<code>gmx_d</code>''' - double precision GROMACS with OpenMP but without MPI.<br />
:* '''<code>gmx_mpi_d</code>''' - double precision GROMACS with OpenMP and MPI.<br />
<br />
==== GROMACS 4.6.7 ==== <!--T:23--><br />
* The double precision binaries have the suffix <code>_d</code>.<br />
* The parallel single and double precision <code>mdrun</code> binaries are:<br />
</translate><br />
<br />
:* '''<code>mdrun_mpi</code>'''<br />
:* '''<code>mdrun_mpi_d</code>'''<br />
<br />
<translate><br />
=== Submission scripts === <!--T:24--><br />
Please refer to the page [[Running jobs]] for help on using the SLURM workload manager.<br />
<br />
==== Serial jobs ==== <!--T:25--><br />
Here's a simple job script for serial mdrun:<br />
<br />
<!--T:26--><br />
{{File<br />
|name=serial_gromacs_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --time=0:30 # time limit (D-HH:MM)<br />
#SBATCH --mem-per-cpu=1000M # memory per CPU (in MB)<br />
module purge <br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
<!--T:50--><br />
gmx mdrun -deffnm em<br />
}}<br />
<br />
<!--T:27--><br />
This will run the simulation of the molecular system in the file <code>em.tpr</code>.<br />
<br />
==== Whole nodes ==== <!--T:30--><br />
Commonly the systems which are being simulated with GROMACS are so large, that you want to use a number of whole nodes for the simulation.<br />
<br />
<!--T:90--><br />
Generally the product of <code>--ntasks-per-node=</code> and <code>--cpus-per-task</code> has to match the number of CPU-cores in the<br />
compute-nodes of the cluster. Please see section [[GROMACS#Performance_Considerations|Performance Considerations]] below.<br />
<br />
</translate><br />
<tabs><br />
<tab name="Graham"><br />
{{File<br />
|name=gromacs_whole_node_graham.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of nodes<br />
#SBATCH --ntasks-per-node=8 # request 8 MPI tasks per node<br />
#SBATCH --cpus-per-task=4 # 4 OpenMP threads per MPI task => total: 8 x 4 = 32 CPUs/node<br />
#SBATCH --mem=0 # request all available memory on the node<br />
#SBATCH --time=0-01:00 # time limit (D-HH:MM)<br />
module purge <br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
srun gmx_mpi mdrun -deffnm md<br />
}}</tab><br />
<tab name="Cedar"><br />
{{File<br />
|name=gromacs_whole_node_cedar.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of nodes<br />
#SBATCH --ntasks-per-node=12 # request 12 MPI tasks per node<br />
#SBATCH --cpus-per-task=4 # 4 OpenMP threads per MPI task => total: 12 x 4 = 48 CPUs/node<br />
#SBATCH --constraint="[skylake{{!}}cascade]" # restrict to AVX512 capable nodes.<br />
#SBATCH --mem=0 # request all available memory on the node<br />
#SBATCH --time=0-01:00 # time limit (D-HH:MM)<br />
module purge<br />
module load arch/avx512 # switch architecture for up to 30% speedup<br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
srun gmx_mpi mdrun -deffnm md<br />
}}</tab><br />
<tab name="Béluga"><br />
{{File<br />
|name=gromacs_whole_node_beluga.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of nodes<br />
#SBATCH --ntasks-per-node=10 # request 10 MPI tasks per node<br />
#SBATCH --cpus-per-task=4 # 4 OpenMP threads per MPI task => total: 10 x 4 = 40 CPUs/node<br />
#SBATCH --mem=0 # request all available memory on the node<br />
#SBATCH --time=0-01:00 # time limit (D-HH:MM)<br />
module purge <br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
srun gmx_mpi mdrun -deffnm md<br />
}}</tab><br />
<tab name="Niagara"><br />
{{File<br />
|name=gromacs_whole_node_niagara.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of nodes<br />
#SBATCH --ntasks-per-node=10 # request 10 MPI tasks per node<br />
#SBATCH --cpus-per-task=4 # 4 OpenMP threads per MPI task => total: 10 x 4 = 40 CPUs/node<br />
#SBATCH --mem=0 # request all available memory on the node<br />
#SBATCH --time=0-01:00 # time limit (D-HH:MM)<br />
module purge --force<br />
module load CCEnv<br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
srun gmx_mpi mdrun -deffnm md<br />
}}</tab><br />
</tabs><br />
<translate><br />
<br />
==== GPU job ==== <!--T:32--><br />
This is a job script for mdrun using 4 OpenMP threads and one GPU:<br />
{{File<br />
|name=gpu_gromacs_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:p100:1 # request 1 GPU as "generic resource"<br />
#SBATCH --cpus-per-task 4 # number of OpenMP threads per MPI process<br />
#SBATCH --mem-per-cpu 1000 # memory limit per CPU core (megabytes)<br />
#SBATCH --time 0:30:00 # time limit (D-HH:MM:ss)<br />
module purge <br />
module load StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
<!--T:53--><br />
gmx mdrun -ntomp ${SLURM_CPUS_PER_TASK:-1} -deffnm md<br />
}}<br />
<br />
==== GPU job - whole node ==== <!--T:33--><br />
These are job scripts for mdrun using all GPUs and CPUs within a GPU node.<br />
</translate><br />
<br />
<tabs><br />
<tab name="Graham"><br />
{{File<br />
|name=gromacs_job_GPU_MPI_Graham.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of nodes<br />
#SBATCH --gres=gpu:p100:2 # request 2 GPUs per node (Graham)<br />
#SBATCH --ntasks-per-node=4 # request 4 MPI tasks per node<br />
#SBATCH --cpus-per-task=8 # 8 OpenMP threads per MPI process<br />
#SBATCH --mem=0 # Request all available memory in the node<br />
#SBATCH --time=1:00:00 # time limit (D-HH:MM:ss)<br />
module purge <br />
module load StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
mpiexec gmx_mpi mdrun -deffnm md<br />
}}</tab><br />
<tab name="Cedar"><br />
{{File<br />
|name=gromacs_job_GPU_MPI_Cedar.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of nodes<br />
#SBATCH --gres=gpu:p100:4 # request 4 GPUs per node (Cedar)<br />
#SBATCH --ntasks-per-node=4 # request 4 MPI tasks per node<br />
#SBATCH --cpus-per-task=6 # 6 OpenMP threads per MPI process<br />
#SBATCH --mem=0 # Request all available memory in the node<br />
#SBATCH --time=1:00:00 # time limit (D-HH:MM:ss)<br />
module purge <br />
module load StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
mpiexec gmx_mpi mdrun -deffnm md<br />
}}</tab><br />
<tab name="Beluga"><br />
{{File<br />
|name=gromacs_job_GPU_MPI_Beluga.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of nodes<br />
#SBATCH --gres=gpu:p100:4 # request 4 GPUs per node (Beluga)<br />
#SBATCH --ntasks-per-node=4 # request 8 MPI tasks per node<br />
#SBATCH --cpus-per-task=5 # 5 OpenMP threads per MPI process<br />
#SBATCH --mem=0 # Request all available memory in the node<br />
#SBATCH --time=1:00:00 # time limit (D-HH:MM:ss)<br />
module purge <br />
module load StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
srun gmx_mpi mdrun -deffnm md<br />
}}</tab><br />
</tabs><br />
<br />
<translate><br />
===== Notes on running GROMACS in GPUs ===== <!--T:34--><br />
<br />
<!--T:35--><br />
* The new national clusters (Cedar and Graham) have differently configured GPU nodes:<br />
::* Cedar has 4 GPUs and 24 CPU cores per node<br />
::* Graham has 2 GPUs and 32 CPU cores per node <br />
::Therefore one needs to use different settings to make use of all GPUs and CPU-cores in a node. <br />
::* Cedar: <code>--gres=gpu:p100:4 --ntasks-per-node=4 --cpus-per-task=6</code> <br />
::* Graham: <code>--gres=gpu:p100:2 --ntasks-per-node=4 --cpus-per-task=8</code><br />
::Of course the simulated system needs to be large enough to utilize the resources.<br />
* GROMACS imposes a number of constraints for choosing number of GPUs, tasks (MPI ranks) and OpenMP threads.<br>For GROMACS 2018.2 the constraints are:<br />
::* The number of <code>--tasks-per-node</code> always needs to be a multiple of the number of GPUs (<code>--gres=gpu:</code>)<br />
::* GROMACS will not run GPU runs with only 1 OpenMP thread, unless forced by setting the <code>-ntomp</code> option.<br>According to GROMACS developers, the optimum number of <code>--cpus-per-task</code> is between 2 and 6.<br />
* Avoid using a larger fraction of CPUs and memory than the fraction of GPUs you have requested in a node.<br />
* While according to the developers of the SLURM scheduler using <code>srun</code> as a replacement for <code>mpiexec</code>/<code>mpirun</code> is the preferred way to start MPI jobs, we have seen evidence of jobs failing on startup, when two jobs using <code>srun</code> are started on the same compute node.<br>At this time we therefore recommend to use <code>mpiexec</code>, especially when utilizing only partial nodes.<br />
<br />
= Usage = <!--T:36--><br />
<div style="color: red; border: 1px dashed #2f6fab"><br />
<br />
<!--T:45--><br />
More content for this section will be added at a later time.<br />
<br />
<!--T:46--><br />
</div><br />
<br />
=== System preparation === <!--T:37--><br />
In order to run a simulation, one needs to create a ''tpr'' file (portable binary run input file). This file contains the starting structure of the simulation, the molecular topology and all the simulation parameters.<br />
<br />
<!--T:38--><br />
''Tpr '' files are created with the <code>gmx grompp</code> command (or simply <code>grompp</code> for versions older than 5.0). Therefore one needs the following files:<br />
* The coordinate file with the starting structure. GROMACS can read the starting structure from various file-formats, such as ''.gro'', ''.pdb'' or ''.cpt'' (checkpoint).<br />
* The (system) topology (''.top'')) file. It defines which force-field is used and how the force-field parameters are applied to the simulated system. Often the topologies for individual parts of the simulated system (e.g. molecules) are placed in separate ''.itp'' files and included in the ''.top'' file using a <code>#include</code> directive.<br />
* The run-parameter (''.mdp'') file. See the GROMACSuser guide for a detailed description of the options.<br />
<br />
<!--T:39--><br />
''Tpr'' files are portable, that is they can be ''grompp'''ed on one machine, copied over to a different machine and used as an input file for ''mdrun''. One should always use the same version for both ''grompp'' and ''mdrun''. Although ''mdrun'' is able to use ''tpr'' files that have been created with an older version of ''grompp'', this can lead to unexpected simulation results.<br />
<br />
=== Running simulations === <!--T:40--><br />
<br />
<!--T:55--><br />
MD Simulations often take much longer than the maximum walltime for a job<br />
to complete and therefore need to be restarted. <br />
To minimize the time a job needs to wait before it starts, you should maximise <br />
[[Job_scheduling_policies#Percentage_of_the_nodes_you_have_access_to|the number of nodes you have access to]]<br />
by choosing a shorter running time for your job. Requesting a walltime of <br />
24 hours or 72 hours (three days) is often a good trade-off between waiting- <br />
and running-time.<br />
<br />
<!--T:56--><br />
You should use the <code>mdrun</code> parameter <code>-maxh</code> to tell<br />
the program the requested walltime so that it gracefully finishes the <br />
current timestep when reaching 99% of this walltime. <br />
This causes <code>mdrun</code> to create a new checkpoint file at this <br />
final timestep and gives it the chance to properly close all output-files<br />
(trajectories, energy- and log-files, etc.).<br />
<br />
<!--T:57--><br />
For example use <code>#SBATCH --time=24:00</code> along with <code>gmx mdrun -maxh 24 ...</code> <br />
or <code>#SBATCH --time=3-00:00</code> along with <code>gmx mdrun -maxh 72 ...</code>.<br />
<br />
<br />
<!--T:58--><br />
{{File<br />
|name=gromacs_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of Nodes<br />
#SBATCH --tasks-per-node=32 # number of MPI processes per node<br />
#SBATCH --mem-per-cpu=4000 # memory limit per CPU (megabytes)<br />
#SBATCH --time=24:00:00 # time limit (D-HH:MM:ss)<br />
module purge<br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
<!--T:59--><br />
srun gmx_mpi mdrun -deffnm md -maxh 24<br />
}}<br />
<br />
<br />
==== Restarting simulations ==== <!--T:60--><br />
<br />
<!--T:61--><br />
You can restart a simulation by using the same <code>mdrun</code> <br />
command as the original simulation and adding the <code>-cpi state.cpt</code><br />
parameter where <code>state.cpt</code> is the filename of the most recent<br />
checkpoint file. Mdrun will by default (since version 4.5) try to append<br />
to the existing files (trajectories, energy- and log-files, etc.).<br />
GROMACS will check the consistency of the output files and - if needed - <br />
discard timesteps that are newer than that of the checkpoint file.<br />
<br />
<!--T:62--><br />
Using the <code>-maxh</code> parameter ensures that the checkpoint and output<br />
files are written in a consistent state when the simulation reaches the time <br />
limit.<br />
<br />
<!--T:63--><br />
The GROMACS manual contains more detailed information<br />
<ref>[http://manual.gromacs.org/documentation/current/user-guide/managing-simulations.html GROMACS User-Guide: Managing long simulations.]</ref><br />
<ref>[http://manual.gromacs.org/documentation/current/onlinehelp/gmx-mdrun.html#gmx-mdrun GROMACS Manual page: gmx mdrun]</ref>.<br />
<br />
<!--T:64--><br />
{{File<br />
|name=gromacs_job_restart.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # number of Nodes<br />
#SBATCH --tasks-per-node=32 # number of MPI processes per node<br />
#SBATCH --mem-per-cpu=4000 # memory limit per CPU (megabytes)<br />
#SBATCH --time=24:00:00 # time limit (D-HH:MM:ss)<br />
module purge<br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2020.4<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
<!--T:65--><br />
srun gmx_mpi mdrun -deffnm md -maxh 24.0 -cpi md.cpt<br />
}}<br />
<br />
=== Performance considerations === <!--T:49--><br />
<br />
<!--T:66--><br />
Getting the best mdrun performance with GROMACS is not a straightforward <br />
task. The GROMACS developers are maintaining a long section in their user-guide<br />
dedicated to mdrun-performance<ref name="performance">[http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html GROMACS User-Guide: Getting good performance from mdrun]</ref><br />
which explains all relevant options/parameters and strategies.<br />
<br />
<!--T:67--><br />
There is no "One size fits all", but the best parameters to choose highly<br />
depend on the size of the system (number of particles as well as size and <br />
shape of the simulation box) and the simulation parameters (cut-offs, use of <br />
Particle-Mesh-Ewald<ref name="perf-background"> [http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html#gromacs-background-information GROMACS User-Guide: Performance background information]</ref><br />
(PME) method for long-range electrostatics).<br />
<br />
<!--T:68--><br />
GROMACS prints performance information and statistics at the end of the <br />
<code>md.log</code> file, which is helpful in identifying bottlenecks. <br />
This section often contains notes on how to further improve the performance.<br />
<br />
<!--T:69--><br />
The '''simulation performance''' is typically quantified by the number of <br />
nanoseconds of MD-trajectory that can be simulated within a day (ns/day).<br />
<br />
<!--T:70--><br />
'''Parallel scaling''' is a measure how effectively the compute resources <br />
are used. It is defined as:<br />
<br />
<!--T:71--><br />
: <span>S = p<sub>N</sub> / ( N * p<sub>1</sub> )</span><br />
<br />
<!--T:72--><br />
Where ''p<sub>N</sub>'' is the performance using ''N'' CPU cores.<br />
<br />
<!--T:73--><br />
Ideally, the performance increases linearly with the number of CPU cores <br />
("linear scaling"; S = 1).<br />
<br />
<br />
==== MPI processes / Slurm tasks / Domain decomposition ==== <!--T:74--><br />
<br />
<!--T:75--><br />
The most straight-forward way to increase the number of MPI processes (called <br />
MPI-ranks in the GROMACS documentation), which is done by using Slurm's<br />
<code>--ntasks</code> or <code>--ntasks-per-node</code> in the job script. <br />
<br />
<!--T:76--><br />
GROMACS uses '''Domain Decomposition'''<ref name="perf-background" /> (DD) <br />
to distribute the work of solving the non-bonded Particle-Particle (PP) <br />
interactions across multiple CPU cores. This is done by effectively cutting<br />
the simulation box along the X, Y and/or Z axes into domains and assigning<br />
each domain to one MPI process.<br />
<br />
<!--T:77--><br />
This works well until the time needed for communication becomes large in<br />
respect to the size (in respect of ''number of particles'' as well as ''volume'') <br />
of the domain. In that case the parallel scaling will drop significantly <br />
below 1 and in extreme cases the performance drops when increasing the <br />
number of domains.<br />
<br />
<!--T:78--><br />
GROMACS can use '''Dynamic Load Balancing''' to shift the boundaries between<br />
domains to some extent, in order to avoid certain domains taking significantly<br />
longer to solve than others. The <code>mdrun</code> parameter <br />
<code>-dlb auto</code> is the default.<br />
<br />
<!--T:79--><br />
Domains cannot be smaller in any direction, than the longest cut-off radius.<br />
<br />
<br />
===== Long-range interactions with PME ===== <!--T:80--><br />
<br />
<!--T:81--><br />
The Particle-Mesh-Ewald method (PME) is often used to calculate the long-range<br />
non-bonded interactions (interactions beyond the cut-off radius). As PME<br />
requires global communication, the performance can degrade quickly when <br />
many MPI processes are involved that are calculating both the short-range <br />
(PP) as well as the long-range (PME) interactons. This is avoided by having<br />
dedicated MPI processes that only perform PME (PME-ranks).<br />
<br />
<!--T:82--><br />
GROMACS mdrun by default uses heuristics to dedicate a number of MPI <br />
processes to PME when the total number of MPI processes 12 or greater. <br />
The mdrun parameter <code>-npme</code> can be used to select the number of <br />
PME ranks manually.<br />
<br />
<!--T:83--><br />
In case there is a significant "Load Imbalance" between the PP and PME ranks<br />
(e.g. the PP ranks have more work per timestep than the PME ranks), one can<br />
shift work from the PP ranks to the PME ranks by increasing the cut-off radius.<br />
This will not effect the result, as the sum of short-range + long-range forces<br />
(or energies) will be the same for a given timestep. Mdrun will attemtp to<br />
do that automatically since version 4.6 unless the mdrun parameter <br />
<code>-notunepme</code> is used.<br />
<br />
<!--T:84--><br />
Since version 2018, PME can be offloaded to the GPU (see below)<br />
however the implementation as of version 2018.1 has still several limitations<br />
<ref name="gpu-pme-2018.1">[http://manual.gromacs.org/documentation/2018.1/user-guide/mdrun-performance.html#gpu-accelerated-calculation-of-pme GROMACS User-Guide: GPU accelerated calculation of PME]</ref> among them that only<br />
a single GPU rank can be dedicated to PME.<br />
<br />
<br />
==== OpenMP threads / CPUs-per-task ==== <!--T:85--><br />
<br />
<!--T:86--><br />
Once Domain Decomposition with MPI processes reaches the scaling limit <br />
(parallel scaling starts dropping), performance can be further improved by<br />
using '''OpenMP threads''' to spread the work of an MPI process (rank) over more<br />
than one CPU core. To use OpenMP threads, use Slurm's <code>--cpus-per-task</code><br />
parameter in the job script and either set the ''OMP_NUM_THREADS'' variable with:<br />
<code>export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"</code> (recommended)<br />
or the mdrun parameter <code>-ntomp ${SLURM_CPUS_PER_TASK:-1}</code>.<br />
<br />
<!--T:87--><br />
According to GROMACS developers, the optimum is usually between 2 and 6 OpenMP threads<br />
per MPI process (cpus-per-task). However for jobs running on a very large<br />
number of nodes it might be worth trying even larger number of ''cpus-per-task''.<br />
<br />
<!--T:94--><br />
Especially for systems that don't use PME, we don't have to worry about a <br />
"PP-PME Load Imbalance". In those cases we can choose 2 or 4 ''ntasks-per-node''<br />
and set ''cpus-per-task'' to a value that ''ntasks-per-node * cpus-per-task'' <br />
matches the number of CPU cores in a compute node.<br />
<br />
==== CPU architecture ==== <!--T:105--><br />
<br />
<!--T:106--><br />
GROMACS uses optimised kernel functions to compute the real-space portion of short-range, non-bonded interactions. Kernel functions are available for a variety of SIMD instruction sets, such as AVX, AVX2, and AVX512. Kernel functions are chosen when compiling GROMACS, and should match the capabilities of the CPUs that will be used to run the simulations. This is done for you by the Compute Canada team: when you load a GROMACS module into your environment, an appropriate AVX/AVX2/AVX512 version is chosen depending on the architecture of the cluster. GROMACS reports what SIMD instruction set it supports in its log file, and will warn you if the selected kernel function is suboptimal.<br />
<br />
<!--T:107--><br />
However, certain clusters contain a mix of CPUs that have different levels of SIMD support. When that is the case, the smallest common denominator is used. For instance, if the cluster has Skylake (AVX/AVX2/AVX512) and Broadwell (AVX/AVX2) CPUs, as Cedar currently (May 2020) does, a version of GROMACS compiled for the AVX2 instruction set will be used. This means that you may end up with a suboptimal choice of kernel function, depending on which compute nodes the scheduler allocates for your job.<br />
<br />
<!--T:108--><br />
You can explicitly request nodes that support AVX512 with the <code>--constraint="[cascade|skylake]"</code> SLURM option on clusters that offer these node types. <br />
This will make sure that your job will be assigned to nodes based on either the "Cascade Lake" or the "Skylake" architecture (but not a mix of both types).<br />
If working on the command-line, make sure to not forget the quotation marks (<code>"</code>) to protect the special characters <code>[</code>, <code>|</code> and <code>]</code>. <br />
You can then explicitly request AVX512 software using <code>module load arch/avx512</code> before loading any other module. <br />
<br />
<!--T:111--><br />
For example, a simple job script could look like the following:<br />
<br />
<!--T:109--><br />
{{File<br />
|name=gromacs_job_cedar_avx512.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=4<br />
#SBATCH --ntasks-per-node=48<br />
#SBATCH --constraint="[skylake|cascade]"<br />
#SBATCH --time=24:00:00<br />
module load arch/avx512<br />
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2021.2<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
srun gmx_mpi mdrun<br />
}}<br />
<br />
<!--T:110--><br />
In our measurements, going from AVX2 to AVX512 on Skylake or Cascade nodes resulted in a 20−30% performance increase. However, you should also consider that restricting yourself to only AVX512-capable nodes will result in longer wait times in the queue.<br />
<br />
==== GPUs ==== <!--T:88--><br />
<br />
<!--T:89--><br />
<div style="color: red; border: 1px dashed #2f6fab"><br />
Tips how to use GPUs efficiently will be added soon.<br />
</div><br />
<br />
=== Analyzing results === <!--T:41--><br />
<br />
=== Common pitfalls === <!--T:42--><br />
<br />
= Related Modules = <!--T:95--><br />
<br />
== Gromacs-Plumed == <!--T:96--><br />
PLUMED<ref name="PLUMED">[http://www.plumed.org/home PLUMED Home]</ref> is an open source library for free energy calculations in molecular systems which works together with some of the most popular molecular dynamics engines.<br />
<br />
<!--T:97--><br />
The <code>gromacs-plumed</code> modules are versions of GROMACS that have been patched with PLUMED's modifications, so that they can run meta-dynamics simulations.<br />
<br />
<!--T:103--><br />
{| class="wikitable sortable"<br />
|-<br />
! GROMACS !! PLUMED !! modules for running on CPUs !! modules for running on GPUs (CUDA)<br />
|-<br />
| v2021.2 || v2.7.1 || <code>StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs-plumed/2021.2</code> || <code>StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs-plumed/2021.2</code><br />
|-<br />
| v2019.6 || v2.6.2 || <code>StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs-plumed/2019.6</code> || <code>StdEnv/2020 gcc/9.3.0 cuda/11.0 openmpi/4.0.3 gromacs-plumed/2019.6</code><br />
|-<br />
| v2019.6 || v2.5.4 || <code>StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.2 gromacs-plumed/2019.6</code> || <code>StdEnv/2018.3 gcc/7.3.0 cuda/10.0.130 openmpi/3.1.2 gromacs-plumed/2019.6</code><br />
|-<br />
| v2019.5 || v2.5.3 || <code>StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.2 gromacs-plumed/2019.5</code> || <code>StdEnv/2018.3 gcc/7.3.0 cuda/10.0.130 openmpi/3.1.2 gromacs-plumed/2019.5</code><br />
|-<br />
| v2018.1 || v2.4.2 || <code>StdEnv/2016.4 gcc/6.4.0 openmpi/2.1.1 gromacs-plumed/2018.1</code> || <code>StdEnv/2016.4 gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs-plumed/2018.1</code><br />
|-<br />
| v2016.3 || v2.3.2 || <code>StdEnv/2016.4 intel/2016.4 openmpi/2.1.1 gromacs-plumed/2016.3</code> || <code>StdEnv/2016.4 intel/2016.4 cuda/8.0.44 openmpi/2.1.1 gromacs-plumed/2016.3</code><br />
|}<br />
<br />
== G_MMPBSA == <!--T:98--><br />
<br />
<!--T:99--><br />
G_MMPBSA<ref name="g_mmpbsa">[http://rashmikumari.github.io/g_mmpbsa/ G_MMPBSA Homepage]</ref> is a tool that calculates components of binding energy using MM-PBSA method except the entropic term and energetic contribution of each residue to the binding using energy decomposition scheme.<br />
<br />
<!--T:100--><br />
Development of that tool seems to have stalled in April 2016 and no changes have been made since then. Therefore it is only compatible with Gromacs 5.1.x.<br />
<br />
<!--T:101--><br />
The version installed can be loaded with <code>module load StdEnv/2016.4 gcc/5.4.0 g_mmpbsa/2016-04-19</code> which represent the most up-to-date version and consists of version 1.6 plus the change to make it compatible with Gromacs 5.1.x. The installed version has been compiled with <code>gromacs/5.1.5</code> and <code>apbs/1.3</code>.<br />
<br />
<!--T:102--><br />
Please be aware that G_MMPBSA uses implicit solvents and there have been studies<ref>[http://pubs.acs.org/doi/abs/10.1021/acs.jctc.7b00169 Comparison of Implicit and Explicit Solvent Models for the Calculation of Solvation Free Energy in Organic Solvents]</ref> that conclude that there are issues with the accuracy of these methods for calculating binding free energies.<br />
<br />
= Links = <!--T:43--><br />
[[Biomolecular simulation]]<br />
<br />
<!--T:44--><br />
* Project Resources<br />
** Main Website: http://www.gromacs.org/<br />
** Documentation & GROMACS Manuals: http://manual.gromacs.org/documentation/<br />
** GROMACS Community Forums: https://gromacs.bioexcel.eu/ <br />The forums are the successors to the GROMACS email lists.<br />
* Tutorials<br />
** Set of 7 very good Tutorials: http://www.mdtutorials.com/gmx/<br />
** Link collection to more tutorials: http://www.gromacs.org/Documentation/Tutorials<br />
* External Resources<br />
**Tool to generate small molecule topology files: http://www.ccpn.ac.uk/v2-software/software/ACPYPE-folder<br />
** Database with Force Field topologies (CGenFF, GAFF and OPLS/AA) for small molecules: http://www.virtualchemistry.org/<br />
** Webservice to generate small-molecule topologies for GROMOS force fields: https://atb.uq.edu.au/<br />
** Discussion of best GPU configurations for running GROMACS: [https://arxiv.org/abs/1507.00898 Best bang for your buck: GPU nodes for GROMACS biomolecular simulations]<br />
<br />
= References = <!--T:51--><br />
<references /><br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=NAMD&diff=102504NAMD2021-08-04T17:31:45Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
[[Category:Software]][[Category:BiomolecularSimulation]]<br />
<br />
<translate><br />
<br />
<!--T:24--><br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems. <br />
Simulation preparation and analysis is integrated into the [[VMD]] visualization package.<br />
<br />
<br />
= Installation = <!--T:22--><br />
NAMD is installed by the Compute Canada software team and is available as a module. If a new version is required or if for some reason you need to do your own installation, please contact [[Technical support]]. You can also ask for details of how our NAMD modules were compiled.<br />
<br />
= Environment modules = <!--T:4--><br />
<br />
<!--T:48--><br />
The latest version of NAMD is 2.14 and it has been installed on all clusters. We recommend users run the newest version.<br />
<br />
<!--T:49--><br />
Older versions 2.13 and 2.12 are also available.<br />
<br />
<!--T:50--><br />
To run jobs that span nodes, use OFI versions on cedar and UCX versions on other clusters.<br />
<br />
= Submission scripts = <!--T:13--><br />
<br />
<!--T:14--><br />
Please refer to the [[Running jobs]] page for help on using the SLURM workload manager.<br />
<br />
== Serial and threaded jobs == <!--T:15--><br />
Below is a simple job script for a serial simulation (using only one core). You can increase the number for --cpus-per-task to use more cores, up to the maximum number of cores available on a cluster node.<br />
</translate><br />
{{File<br />
|name=serial_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --mem 2048 # memory in Mb, increase as needed <br />
#SBATCH -o slurm.%N.%j.out # STDOUT file<br />
#SBATCH -t 0:05:00 # time (D-HH:MM), increase as needed<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
== Parallel CPU jobs == <!--T:61--><br />
<br />
=== MPI jobs === <!--T:18--><br />
'''NOTE''': MPI should not be used. Instead use OFI on Cedar and UCX on other clusters.<br />
<br />
=== Verbs jobs === <!--T:16--><br />
<br />
<!--T:51--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:52--><br />
These provisional instructions will be refined further once this configuration can be fully tested on the new clusters.<br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:17--><br />
'''NOTES''':<br />
*Verbs versions will not run on Cedar because of its different interconnect; use the MPI version instead.<br />
*Verbs versions will not run on Béluga either because of its incompatible infiniband kernel drivers; use the UCX version instead.<br />
</translate><br />
{{File<br />
|name=verbs_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
NODEFILE=nodefile.dat<br />
slurm_hl2hl.py --format CHARM > $NODEFILE<br />
P=$SLURM_NTASKS<br />
<br />
module load namd-verbs/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== UCX jobs === <!--T:42--><br />
This example uses 80 processes in total on 2 nodes, each node running 40 processes, thus fully utilizing its 80 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 40 (on Béluga). For best performance, NAMD jobs should use full nodes.<br />
<br />
<br />
<!--T:43--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use the OFI version instead.<br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 80 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020 namd-ucx/2.14<br />
srun --mpi=pmi2 namd2 apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== OFI jobs === <!--T:53--><br />
<br />
<!--T:54--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
<br />
<!--T:55--><br />
module load StdEnv/2020 namd-ofi/2.14<br />
srun --mpi=pmi2 namd2 stmv.namd <br />
}}<br />
<translate><br />
<br />
== Single GPU jobs == <!--T:19--><br />
This example uses 8 CPU cores and 1 P100 GPU on a single node.<br />
</translate><br />
{{File<br />
|name=multicore_gpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=8 <br />
#SBATCH --mem 2048 <br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --gres=gpu:p100:1<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
module load StdEnv/2020<br />
module load cuda/11.0<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<br />
<translate><br />
<br />
== Parallel GPU jobs == <!--T:44--><br />
=== UCX GPU jobs ===<br />
This example is for Béluga and it assumes that full nodes are used, which gives best performance for NAMD jobs. It uses 8 processes in total on 2 nodes, each process(task) using 10 threads and 1 GPU. This fully utilizes Béluga GPU nodes which have 40 cores and 4 GPUs per node. Note that 1 core per task has to be reserved for a communications thread, so NAMD will report that only 72 cores are being used but this is normal. <br />
<br />
<!--T:45--><br />
To use this script on other clusters, please look up the specifications of their available nodes and adjust --cpus-per-task and --gres=gpu: options accordingly.<br />
<br />
<!--T:46--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use OFI version instead.<br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes 2 <br />
#SBATCH --cpus-per-task=10 # number of threads per task (process)<br />
#SBATCH --gres=gpu:p100:4<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
<!--T:47--><br />
module load StdEnv/2020 intel/2020.1.217 cuda/11.0 namd-ucx-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES apoa1.namd<br />
}}<br />
<br />
=== OFI GPU jobs === <!--T:56--><br />
<br />
<!--T:57--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --cpus-per-task=6<br />
#SBATCH --gres=gpu:p100:4<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
<br />
<!--T:58--><br />
module load StdEnv/2020 cuda/11.0 namd-ofi-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES stmv.namd<br />
}}<br />
<br />
=== Verbs-GPU jobs === <!--T:20--><br />
<br />
<!--T:59--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:60--><br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:21--><br />
'''NOTE''': Verbs versions will not run on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=verbsgpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem 0 # memory per node, 0 means all memory<br />
#SBATCH --gres=gpu:p100:2<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
slurm_hl2hl.py --format CHARM > nodefile.dat<br />
NODEFILE=nodefile.dat<br />
OMP_NUM_THREADS=32<br />
P=$SLURM_NTASKS<br />
<br />
module load cuda/8.0.44<br />
module load namd-verbs-smp/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=Benchmarking NAMD= <!--T:31--><br />
<br />
<!--T:32--><br />
This section shows an example of how you should conduct benchmarking of NAMD. Performance of NAMD will be different for different systems you are simulating, depending especially on the number of atoms in the simulation. Therefore, if you plan to spend a significant amount of time simulating a particular system, it would be very useful to conduct the kind of benchmarking shown below. Collecting and providing this kind of data is also very useful if you are applying for a RAC award.<br />
<br />
<!--T:33--><br />
For a good benchmark, please vary the number of steps so that your system runs for a few minutes, and that timing information is collected in reasonable time intervals of at least a few seconds. If your run is too short, you might see fluctuations in your timing results. <br />
<br />
<!--T:34--><br />
The numbers below were obtained for the standard NAMD apoa1 benchmark. The benchmarking was conducted on the Graham cluster, which has CPU nodes with 32 cores and GPU nodes with 32 cores and 2 GPUs. Performing the benchmark on other clusters will have to take account of the different structure of their nodes.<br />
<br />
<!--T:35--><br />
In the results shown in the first table below, we used NAMD 2.12 from the verbs module. Efficiency is computed from (time with 1 core) / (N * (time with N cores) ).<br />
<br />
<!--T:36--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! Wall time (s) per step !! Efficiency<br />
|-<br />
| 1 || 0.8313||100%<br />
|-<br />
| 2 || 0.4151||100%<br />
|-<br />
| 4 || 0.1945|| 107%<br />
|-<br />
| 8 || 0.0987|| 105%<br />
|-<br />
| 16 || 0.0501|| 104%<br />
|-<br />
| 32 || 0.0257|| 101%<br />
|-<br />
| 64 || 0.0133|| 98%<br />
|-<br />
| 128 || 0.0074|| 88%<br />
|-<br />
| 256 || 0.0036|| 90%<br />
|-<br />
| 512 || 0.0021|| 77%<br />
|-<br />
|}<br />
<br />
<!--T:37--><br />
These results show that for this system it is acceptable to use up to 256 cores. Keep in mind that if you ask for more cores, your jobs will wait in the queue for a longer time, affecting your overall throughput.<br />
<br />
<!--T:38--><br />
Now we perform benchmarking with GPUs. NAMD multicore module is used for simulations that fit within 1 node, and NAMD verbs-smp module is used for runs spanning nodes.<br />
<br />
<!--T:39--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! #GPUs !! Wall time (s) per step !! Notes<br />
|-<br />
| 4 || 1 || 0.0165 || 1 node, multicore<br />
|-<br />
| 8 || 1 || 0.0088 || 1 node, multicore<br />
|-<br />
| 16 || 1 || 0.0071 || 1 node, multicore<br />
|-<br />
| 32 || 2 || 0.0045 || 1 node, multicore<br />
|-<br />
| 64 || 4 || 0.0058 || 2 nodes, verbs-smp<br />
|-<br />
| 128 || 8 || 0.0051 || 2 nodes, verbs-smp<br />
|-<br />
|}<br />
<br />
<!--T:40--><br />
From this table it is clear that there is no point at all in using more than 1 node for this system, since performance actually becomes worse if we use 2 or more nodes. Using only 1 node, it is best to use 1GPU/16 core as that has the greatest efficiency, but also acceptable to use 2GPU/32core if you need to get your results quickly. Since on Graham GPU nodes your priority is charged the same for any job using up to 16 cores and 1 GPU, there is no benefit from running with 8 cores and 4 cores in this case.<br />
<br />
<!--T:41--><br />
Finally, you have to ask whether to run with or without GPUs for this simulation. From our numbers we can see that using a full GPU node of Graham (32 cores, 2 gpus) the job runs faster than it would on 4 non-GPU nodes of Graham. Since a GPU node on Graham costs about twice what a non-GPU node costs, in this case it is more cost effective to run with GPUs. You should run with GPUs if possible, however, given that there are fewer GPU than CPU nodes, you may need to consider submitting non-GPU jobs if your waiting time for GPU jobs is too long.<br />
<br />
= References = <!--T:23--><br />
* Downloads: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD -- Registration is required to download the software.<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/ug/ NAMD Users's guide for version 2.12]<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/notes.html NAMD version 2.12 release notes]<br />
* Tutorials: http://www.ks.uiuc.edu/Training/Tutorials/<br />
<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=NAMD&diff=102503NAMD2021-08-04T17:30:30Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
[[Category:Software]][[Category:BiomolecularSimulation]]<br />
<br />
<translate><br />
<br />
<!--T:24--><br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems. <br />
Simulation preparation and analysis is integrated into the [[VMD]] visualization package.<br />
<br />
<br />
= Installation = <!--T:22--><br />
NAMD is installed by the Compute Canada software team and is available as a module. If a new version is required or if for some reason you need to do your own installation, please contact [[Technical support]]. You can also ask for details of how our NAMD modules were compiled.<br />
<br />
= Environment modules = <!--T:4--><br />
<br />
<!--T:48--><br />
The latest version of NAMD is 2.14 and it has been installed on all clusters. We recommend users run the newest version.<br />
<br />
<!--T:49--><br />
Older versions 2.13 and 2.12 are also available.<br />
<br />
<!--T:50--><br />
To run jobs that span nodes, use OFI versions on cedar and UCX versions on other clusters.<br />
<br />
= Submission scripts = <!--T:13--><br />
<br />
<!--T:14--><br />
Please refer to the [[Running jobs]] page for help on using the SLURM workload manager.<br />
<br />
== Serial and threaded jobs == <!--T:15--><br />
Below is a simple job script for a serial simulation (using only one core). You can increase the number for --cpus-per-task to use more cores, up to the maximum number of cores available on a cluster node.<br />
</translate><br />
{{File<br />
|name=serial_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --mem 2048 # memory in Mb, increase as needed <br />
#SBATCH -o slurm.%N.%j.out # STDOUT file<br />
#SBATCH -t 0:05:00 # time (D-HH:MM), increase as needed<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
== Parallel CPU jobs == <!--T:61--><br />
<br />
=== MPI jobs === <!--T:18--><br />
'''NOTE''': MPI should not be used. Instead use OFI on Cedar and UCX on other clusters.<br />
<br />
=== Verbs jobs === <!--T:16--><br />
<br />
<!--T:51--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:52--><br />
These provisional instructions will be refined further once this configuration can be fully tested on the new clusters.<br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:17--><br />
'''NOTES''':<br />
*Verbs versions will not run on Cedar because of its different interconnect; use the MPI version instead.<br />
*Verbs versions will not run on Béluga either because of its incompatible infiniband kernel drivers; use the UCX version instead.<br />
</translate><br />
{{File<br />
|name=verbs_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
NODEFILE=nodefile.dat<br />
slurm_hl2hl.py --format CHARM > $NODEFILE<br />
P=$SLURM_NTASKS<br />
<br />
module load namd-verbs/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== UCX jobs === <!--T:42--><br />
This example uses 80 processes in total on 2 nodes, each node running 40 processes, thus fully utilizing its 80 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 40 (on Béluga). For best performance, NAMD jobs should use full nodes.<br />
<br />
<br />
<!--T:43--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use the OFI version instead.<br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 80 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020 namd-ucx/2.14<br />
srun --mpi=pmi2 namd2 apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== OFI jobs === <!--T:53--><br />
<br />
<!--T:54--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
<br />
<!--T:55--><br />
module load StdEnv/2020 namd-ofi/2.14<br />
srun --mpi=pmi2 namd2 stmv.namd <br />
}}<br />
<translate><br />
<br />
== Single GPU jobs == <!--T:19--><br />
This example uses 8 CPU cores and 1 P100 GPU on a single node.<br />
</translate><br />
{{File<br />
|name=multicore_gpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=8 <br />
#SBATCH --mem 2048 <br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --gres=gpu:p100:1<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
module load StdEnv/2020<br />
module load cuda/11.0<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<br />
<translate><br />
<br />
== Parallel GPU jobs == <!--T:44--><br />
=== UCX GPU jobs ===<br />
This example is for Béluga and it assumes that full nodes are used, which gives best performance for NAMD jobs. It uses 8 processes in total on 2 nodes, each process(task) using 10 threads and 1 GPU. This fully utilizes Béluga GPU nodes which have 40 cores and 4 GPUs per node. Note that 1 core per task has to be reserved for a communications thread, so NAMD will report that only 72 cores are being used but this is normal. <br />
<br />
<!--T:45--><br />
To use this script on other clusters, please look up the specifications of their available nodes and adjust --cpus-per-task and --gres=gpu: options accordingly.<br />
<br />
<!--T:46--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use OFI version instead.<br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes 2 <br />
#SBATCH --cpus-per-task=10 # number of threads per task (process)<br />
#SBATCH --gres=gpu:p100:4<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
<!--T:47--><br />
module load StdEnv/2020 intel/2020.1.217 cuda/11.0 namd-ucx-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES apoa1.namd<br />
}}<br />
<br />
=== OFI GPU jobs === <!--T:56--><br />
<br />
<!--T:57--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --cpus-per-task=6<br />
#SBATCH --gres=gpu:p100:4<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
<br />
<!--T:58--><br />
module load StdEnv/2020 cuda/11.0 namd-ofi-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES stmv.namd<br />
}}<br />
<br />
=== Verbs-GPU jobs === <!--T:20--><br />
<br />
<!--T:59--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:60--><br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:21--><br />
'''NOTE''': Verbs versions will not run on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=verbsgpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem 0 # memory per node, 0 means all memory<br />
#SBATCH --gres=gpu:2<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
slurm_hl2hl.py --format CHARM > nodefile.dat<br />
NODEFILE=nodefile.dat<br />
OMP_NUM_THREADS=32<br />
P=$SLURM_NTASKS<br />
<br />
module load cuda/8.0.44<br />
module load namd-verbs-smp/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=Benchmarking NAMD= <!--T:31--><br />
<br />
<!--T:32--><br />
This section shows an example of how you should conduct benchmarking of NAMD. Performance of NAMD will be different for different systems you are simulating, depending especially on the number of atoms in the simulation. Therefore, if you plan to spend a significant amount of time simulating a particular system, it would be very useful to conduct the kind of benchmarking shown below. Collecting and providing this kind of data is also very useful if you are applying for a RAC award.<br />
<br />
<!--T:33--><br />
For a good benchmark, please vary the number of steps so that your system runs for a few minutes, and that timing information is collected in reasonable time intervals of at least a few seconds. If your run is too short, you might see fluctuations in your timing results. <br />
<br />
<!--T:34--><br />
The numbers below were obtained for the standard NAMD apoa1 benchmark. The benchmarking was conducted on the Graham cluster, which has CPU nodes with 32 cores and GPU nodes with 32 cores and 2 GPUs. Performing the benchmark on other clusters will have to take account of the different structure of their nodes.<br />
<br />
<!--T:35--><br />
In the results shown in the first table below, we used NAMD 2.12 from the verbs module. Efficiency is computed from (time with 1 core) / (N * (time with N cores) ).<br />
<br />
<!--T:36--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! Wall time (s) per step !! Efficiency<br />
|-<br />
| 1 || 0.8313||100%<br />
|-<br />
| 2 || 0.4151||100%<br />
|-<br />
| 4 || 0.1945|| 107%<br />
|-<br />
| 8 || 0.0987|| 105%<br />
|-<br />
| 16 || 0.0501|| 104%<br />
|-<br />
| 32 || 0.0257|| 101%<br />
|-<br />
| 64 || 0.0133|| 98%<br />
|-<br />
| 128 || 0.0074|| 88%<br />
|-<br />
| 256 || 0.0036|| 90%<br />
|-<br />
| 512 || 0.0021|| 77%<br />
|-<br />
|}<br />
<br />
<!--T:37--><br />
These results show that for this system it is acceptable to use up to 256 cores. Keep in mind that if you ask for more cores, your jobs will wait in the queue for a longer time, affecting your overall throughput.<br />
<br />
<!--T:38--><br />
Now we perform benchmarking with GPUs. NAMD multicore module is used for simulations that fit within 1 node, and NAMD verbs-smp module is used for runs spanning nodes.<br />
<br />
<!--T:39--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! #GPUs !! Wall time (s) per step !! Notes<br />
|-<br />
| 4 || 1 || 0.0165 || 1 node, multicore<br />
|-<br />
| 8 || 1 || 0.0088 || 1 node, multicore<br />
|-<br />
| 16 || 1 || 0.0071 || 1 node, multicore<br />
|-<br />
| 32 || 2 || 0.0045 || 1 node, multicore<br />
|-<br />
| 64 || 4 || 0.0058 || 2 nodes, verbs-smp<br />
|-<br />
| 128 || 8 || 0.0051 || 2 nodes, verbs-smp<br />
|-<br />
|}<br />
<br />
<!--T:40--><br />
From this table it is clear that there is no point at all in using more than 1 node for this system, since performance actually becomes worse if we use 2 or more nodes. Using only 1 node, it is best to use 1GPU/16 core as that has the greatest efficiency, but also acceptable to use 2GPU/32core if you need to get your results quickly. Since on Graham GPU nodes your priority is charged the same for any job using up to 16 cores and 1 GPU, there is no benefit from running with 8 cores and 4 cores in this case.<br />
<br />
<!--T:41--><br />
Finally, you have to ask whether to run with or without GPUs for this simulation. From our numbers we can see that using a full GPU node of Graham (32 cores, 2 gpus) the job runs faster than it would on 4 non-GPU nodes of Graham. Since a GPU node on Graham costs about twice what a non-GPU node costs, in this case it is more cost effective to run with GPUs. You should run with GPUs if possible, however, given that there are fewer GPU than CPU nodes, you may need to consider submitting non-GPU jobs if your waiting time for GPU jobs is too long.<br />
<br />
= References = <!--T:23--><br />
* Downloads: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD -- Registration is required to download the software.<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/ug/ NAMD Users's guide for version 2.12]<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/notes.html NAMD version 2.12 release notes]<br />
* Tutorials: http://www.ks.uiuc.edu/Training/Tutorials/<br />
<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=NAMD&diff=102502NAMD2021-08-04T17:29:55Z<p>Kaizaad: /* UCX GPU jobs */</p>
<hr />
<div><languages /><br />
[[Category:Software]][[Category:BiomolecularSimulation]]<br />
<br />
<translate><br />
<br />
<!--T:24--><br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems. <br />
Simulation preparation and analysis is integrated into the [[VMD]] visualization package.<br />
<br />
<br />
= Installation = <!--T:22--><br />
NAMD is installed by the Compute Canada software team and is available as a module. If a new version is required or if for some reason you need to do your own installation, please contact [[Technical support]]. You can also ask for details of how our NAMD modules were compiled.<br />
<br />
= Environment modules = <!--T:4--><br />
<br />
<!--T:48--><br />
The latest version of NAMD is 2.14 and it has been installed on all clusters. We recommend users run the newest version.<br />
<br />
<!--T:49--><br />
Older versions 2.13 and 2.12 are also available.<br />
<br />
<!--T:50--><br />
To run jobs that span nodes, use OFI versions on cedar and UCX versions on other clusters.<br />
<br />
= Submission scripts = <!--T:13--><br />
<br />
<!--T:14--><br />
Please refer to the [[Running jobs]] page for help on using the SLURM workload manager.<br />
<br />
== Serial and threaded jobs == <!--T:15--><br />
Below is a simple job script for a serial simulation (using only one core). You can increase the number for --cpus-per-task to use more cores, up to the maximum number of cores available on a cluster node.<br />
</translate><br />
{{File<br />
|name=serial_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --mem 2048 # memory in Mb, increase as needed <br />
#SBATCH -o slurm.%N.%j.out # STDOUT file<br />
#SBATCH -t 0:05:00 # time (D-HH:MM), increase as needed<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
== Parallel CPU jobs == <!--T:61--><br />
<br />
=== MPI jobs === <!--T:18--><br />
'''NOTE''': MPI should not be used. Instead use OFI on Cedar and UCX on other clusters.<br />
<br />
=== Verbs jobs === <!--T:16--><br />
<br />
<!--T:51--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:52--><br />
These provisional instructions will be refined further once this configuration can be fully tested on the new clusters.<br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:17--><br />
'''NOTES''':<br />
*Verbs versions will not run on Cedar because of its different interconnect; use the MPI version instead.<br />
*Verbs versions will not run on Béluga either because of its incompatible infiniband kernel drivers; use the UCX version instead.<br />
</translate><br />
{{File<br />
|name=verbs_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
NODEFILE=nodefile.dat<br />
slurm_hl2hl.py --format CHARM > $NODEFILE<br />
P=$SLURM_NTASKS<br />
<br />
module load namd-verbs/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== UCX jobs === <!--T:42--><br />
This example uses 80 processes in total on 2 nodes, each node running 40 processes, thus fully utilizing its 80 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 40 (on Béluga). For best performance, NAMD jobs should use full nodes.<br />
<br />
<br />
<!--T:43--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use the OFI version instead.<br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 80 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020 namd-ucx/2.14<br />
srun --mpi=pmi2 namd2 apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== OFI jobs === <!--T:53--><br />
<br />
<!--T:54--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
<br />
<!--T:55--><br />
module load StdEnv/2020 namd-ofi/2.14<br />
srun --mpi=pmi2 namd2 stmv.namd <br />
}}<br />
<translate><br />
<br />
== Single GPU jobs == <!--T:19--><br />
This example uses 8 CPU cores and 1 P100 GPU on a single node.<br />
</translate><br />
{{File<br />
|name=multicore_gpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=8 <br />
#SBATCH --mem 2048 <br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --gres=gpu:p100:1<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
module load StdEnv/2020<br />
module load cuda/11.0<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<br />
<translate><br />
<br />
== Parallel GPU jobs == <!--T:44--><br />
=== UCX GPU jobs ===<br />
This example is for Béluga and it assumes that full nodes are used, which gives best performance for NAMD jobs. It uses 8 processes in total on 2 nodes, each process(task) using 10 threads and 1 GPU. This fully utilizes Béluga GPU nodes which have 40 cores and 4 GPUs per node. Note that 1 core per task has to be reserved for a communications thread, so NAMD will report that only 72 cores are being used but this is normal. <br />
<br />
<!--T:45--><br />
To use this script on other clusters, please look up the specifications of their available nodes and adjust --cpus-per-task and --gres=gpu: options accordingly.<br />
<br />
<!--T:46--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use OFI version instead.<br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes 2 <br />
#SBATCH --cpus-per-task=10 # number of threads per task (process)<br />
#SBATCH --gres=gpu:p100:4<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
<!--T:47--><br />
module load StdEnv/2020 intel/2020.1.217 cuda/11.0 namd-ucx-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES apoa1.namd<br />
}}<br />
<br />
=== OFI GPU jobs === <!--T:56--><br />
<br />
<!--T:57--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --cpus-per-task=6<br />
#SBATCH --gres=gpu:4<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
<br />
<!--T:58--><br />
module load StdEnv/2020 cuda/11.0 namd-ofi-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES stmv.namd<br />
}}<br />
<br />
=== Verbs-GPU jobs === <!--T:20--><br />
<br />
<!--T:59--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:60--><br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:21--><br />
'''NOTE''': Verbs versions will not run on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=verbsgpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem 0 # memory per node, 0 means all memory<br />
#SBATCH --gres=gpu:2<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
slurm_hl2hl.py --format CHARM > nodefile.dat<br />
NODEFILE=nodefile.dat<br />
OMP_NUM_THREADS=32<br />
P=$SLURM_NTASKS<br />
<br />
module load cuda/8.0.44<br />
module load namd-verbs-smp/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=Benchmarking NAMD= <!--T:31--><br />
<br />
<!--T:32--><br />
This section shows an example of how you should conduct benchmarking of NAMD. Performance of NAMD will be different for different systems you are simulating, depending especially on the number of atoms in the simulation. Therefore, if you plan to spend a significant amount of time simulating a particular system, it would be very useful to conduct the kind of benchmarking shown below. Collecting and providing this kind of data is also very useful if you are applying for a RAC award.<br />
<br />
<!--T:33--><br />
For a good benchmark, please vary the number of steps so that your system runs for a few minutes, and that timing information is collected in reasonable time intervals of at least a few seconds. If your run is too short, you might see fluctuations in your timing results. <br />
<br />
<!--T:34--><br />
The numbers below were obtained for the standard NAMD apoa1 benchmark. The benchmarking was conducted on the Graham cluster, which has CPU nodes with 32 cores and GPU nodes with 32 cores and 2 GPUs. Performing the benchmark on other clusters will have to take account of the different structure of their nodes.<br />
<br />
<!--T:35--><br />
In the results shown in the first table below, we used NAMD 2.12 from the verbs module. Efficiency is computed from (time with 1 core) / (N * (time with N cores) ).<br />
<br />
<!--T:36--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! Wall time (s) per step !! Efficiency<br />
|-<br />
| 1 || 0.8313||100%<br />
|-<br />
| 2 || 0.4151||100%<br />
|-<br />
| 4 || 0.1945|| 107%<br />
|-<br />
| 8 || 0.0987|| 105%<br />
|-<br />
| 16 || 0.0501|| 104%<br />
|-<br />
| 32 || 0.0257|| 101%<br />
|-<br />
| 64 || 0.0133|| 98%<br />
|-<br />
| 128 || 0.0074|| 88%<br />
|-<br />
| 256 || 0.0036|| 90%<br />
|-<br />
| 512 || 0.0021|| 77%<br />
|-<br />
|}<br />
<br />
<!--T:37--><br />
These results show that for this system it is acceptable to use up to 256 cores. Keep in mind that if you ask for more cores, your jobs will wait in the queue for a longer time, affecting your overall throughput.<br />
<br />
<!--T:38--><br />
Now we perform benchmarking with GPUs. NAMD multicore module is used for simulations that fit within 1 node, and NAMD verbs-smp module is used for runs spanning nodes.<br />
<br />
<!--T:39--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! #GPUs !! Wall time (s) per step !! Notes<br />
|-<br />
| 4 || 1 || 0.0165 || 1 node, multicore<br />
|-<br />
| 8 || 1 || 0.0088 || 1 node, multicore<br />
|-<br />
| 16 || 1 || 0.0071 || 1 node, multicore<br />
|-<br />
| 32 || 2 || 0.0045 || 1 node, multicore<br />
|-<br />
| 64 || 4 || 0.0058 || 2 nodes, verbs-smp<br />
|-<br />
| 128 || 8 || 0.0051 || 2 nodes, verbs-smp<br />
|-<br />
|}<br />
<br />
<!--T:40--><br />
From this table it is clear that there is no point at all in using more than 1 node for this system, since performance actually becomes worse if we use 2 or more nodes. Using only 1 node, it is best to use 1GPU/16 core as that has the greatest efficiency, but also acceptable to use 2GPU/32core if you need to get your results quickly. Since on Graham GPU nodes your priority is charged the same for any job using up to 16 cores and 1 GPU, there is no benefit from running with 8 cores and 4 cores in this case.<br />
<br />
<!--T:41--><br />
Finally, you have to ask whether to run with or without GPUs for this simulation. From our numbers we can see that using a full GPU node of Graham (32 cores, 2 gpus) the job runs faster than it would on 4 non-GPU nodes of Graham. Since a GPU node on Graham costs about twice what a non-GPU node costs, in this case it is more cost effective to run with GPUs. You should run with GPUs if possible, however, given that there are fewer GPU than CPU nodes, you may need to consider submitting non-GPU jobs if your waiting time for GPU jobs is too long.<br />
<br />
= References = <!--T:23--><br />
* Downloads: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD -- Registration is required to download the software.<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/ug/ NAMD Users's guide for version 2.12]<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/notes.html NAMD version 2.12 release notes]<br />
* Tutorials: http://www.ks.uiuc.edu/Training/Tutorials/<br />
<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=NAMD&diff=102501NAMD2021-08-04T17:28:58Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
[[Category:Software]][[Category:BiomolecularSimulation]]<br />
<br />
<translate><br />
<br />
<!--T:24--><br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems. <br />
Simulation preparation and analysis is integrated into the [[VMD]] visualization package.<br />
<br />
<br />
= Installation = <!--T:22--><br />
NAMD is installed by the Compute Canada software team and is available as a module. If a new version is required or if for some reason you need to do your own installation, please contact [[Technical support]]. You can also ask for details of how our NAMD modules were compiled.<br />
<br />
= Environment modules = <!--T:4--><br />
<br />
<!--T:48--><br />
The latest version of NAMD is 2.14 and it has been installed on all clusters. We recommend users run the newest version.<br />
<br />
<!--T:49--><br />
Older versions 2.13 and 2.12 are also available.<br />
<br />
<!--T:50--><br />
To run jobs that span nodes, use OFI versions on cedar and UCX versions on other clusters.<br />
<br />
= Submission scripts = <!--T:13--><br />
<br />
<!--T:14--><br />
Please refer to the [[Running jobs]] page for help on using the SLURM workload manager.<br />
<br />
== Serial and threaded jobs == <!--T:15--><br />
Below is a simple job script for a serial simulation (using only one core). You can increase the number for --cpus-per-task to use more cores, up to the maximum number of cores available on a cluster node.<br />
</translate><br />
{{File<br />
|name=serial_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --mem 2048 # memory in Mb, increase as needed <br />
#SBATCH -o slurm.%N.%j.out # STDOUT file<br />
#SBATCH -t 0:05:00 # time (D-HH:MM), increase as needed<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
== Parallel CPU jobs == <!--T:61--><br />
<br />
=== MPI jobs === <!--T:18--><br />
'''NOTE''': MPI should not be used. Instead use OFI on Cedar and UCX on other clusters.<br />
<br />
=== Verbs jobs === <!--T:16--><br />
<br />
<!--T:51--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:52--><br />
These provisional instructions will be refined further once this configuration can be fully tested on the new clusters.<br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:17--><br />
'''NOTES''':<br />
*Verbs versions will not run on Cedar because of its different interconnect; use the MPI version instead.<br />
*Verbs versions will not run on Béluga either because of its incompatible infiniband kernel drivers; use the UCX version instead.<br />
</translate><br />
{{File<br />
|name=verbs_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
NODEFILE=nodefile.dat<br />
slurm_hl2hl.py --format CHARM > $NODEFILE<br />
P=$SLURM_NTASKS<br />
<br />
module load namd-verbs/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== UCX jobs === <!--T:42--><br />
This example uses 80 processes in total on 2 nodes, each node running 40 processes, thus fully utilizing its 80 cores. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 40 (on Béluga). For best performance, NAMD jobs should use full nodes.<br />
<br />
<br />
<!--T:43--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use the OFI version instead.<br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 80 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
module load StdEnv/2020 namd-ucx/2.14<br />
srun --mpi=pmi2 namd2 apoa1.namd<br />
}}<br />
<translate><br />
<br />
=== OFI jobs === <!--T:53--><br />
<br />
<!--T:54--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
<br />
<!--T:55--><br />
module load StdEnv/2020 namd-ofi/2.14<br />
srun --mpi=pmi2 namd2 stmv.namd <br />
}}<br />
<translate><br />
<br />
== Single GPU jobs == <!--T:19--><br />
This example uses 8 CPU cores and 1 P100 GPU on a single node.<br />
</translate><br />
{{File<br />
|name=multicore_gpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --cpus-per-task=8 <br />
#SBATCH --mem 2048 <br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --gres=gpu:p100:1<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
module load StdEnv/2020<br />
module load cuda/11.0<br />
module load namd-multicore/2.14<br />
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd<br />
}}<br />
<br />
<translate><br />
<br />
== Parallel GPU jobs == <!--T:44--><br />
=== UCX GPU jobs ===<br />
This example is for Béluga and it assumes that full nodes are used, which gives best performance for NAMD jobs. It uses 8 processes in total on 2 nodes, each process(task) using 10 threads and 1 GPU. This fully utilizes Béluga GPU nodes which have 40 cores and 4 GPUs per node. Note that 1 core per task has to be reserved for a communications thread, so NAMD will report that only 72 cores are being used but this is normal. <br />
<br />
<!--T:45--><br />
To use this script on other clusters, please look up the specifications of their available nodes and adjust --cpus-per-task and --gres=gpu: options accordingly.<br />
<br />
<!--T:46--><br />
'''NOTE''': UCX versions will not run on Cedar because of its different interconnect. Use OFI version instead.<br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes 2 <br />
#SBATCH --cpus-per-task=10 # number of threads per task (process)<br />
#SBATCH --gres=gpu:4<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
<br />
<!--T:47--><br />
module load StdEnv/2020 intel/2020.1.217 cuda/11.0 namd-ucx-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES apoa1.namd<br />
}}<br />
<br />
=== OFI GPU jobs === <!--T:56--><br />
<br />
<!--T:57--><br />
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. <br />
{{File<br />
|name=ucx_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-specifyaccount<br />
#SBATCH --ntasks 8 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --cpus-per-task=6<br />
#SBATCH --gres=gpu:4<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --mem=0 # memory per node, 0 means all memory<br />
<br />
<!--T:58--><br />
module load StdEnv/2020 cuda/11.0 namd-ofi-smp/2.14<br />
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )<br />
srun --mpi=pmi2 namd2 ++ppn $NUM_PES stmv.namd<br />
}}<br />
<br />
=== Verbs-GPU jobs === <!--T:20--><br />
<br />
<!--T:59--><br />
NOTE: For NAMD 2.14, use OFI GPU on cedar and UCX GPU on other clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.<br />
<br />
<!--T:60--><br />
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.<br />
<br />
<!--T:21--><br />
'''NOTE''': Verbs versions will not run on Cedar because of its different interconnect. <br />
</translate><br />
{{File<br />
|name=verbsgpu_namd_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#<br />
#SBATCH --ntasks 64 # number of tasks<br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem 0 # memory per node, 0 means all memory<br />
#SBATCH --gres=gpu:2<br />
#SBATCH -o slurm.%N.%j.out # STDOUT<br />
#SBATCH -t 0:05:00 # time (D-HH:MM)<br />
#SBATCH --account=def-specifyaccount<br />
<br />
slurm_hl2hl.py --format CHARM > nodefile.dat<br />
NODEFILE=nodefile.dat<br />
OMP_NUM_THREADS=32<br />
P=$SLURM_NTASKS<br />
<br />
module load cuda/8.0.44<br />
module load namd-verbs-smp/2.12<br />
CHARMRUN=`which charmrun`<br />
NAMD2=`which namd2`<br />
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd<br />
}}<br />
<translate><br />
<br />
=Benchmarking NAMD= <!--T:31--><br />
<br />
<!--T:32--><br />
This section shows an example of how you should conduct benchmarking of NAMD. Performance of NAMD will be different for different systems you are simulating, depending especially on the number of atoms in the simulation. Therefore, if you plan to spend a significant amount of time simulating a particular system, it would be very useful to conduct the kind of benchmarking shown below. Collecting and providing this kind of data is also very useful if you are applying for a RAC award.<br />
<br />
<!--T:33--><br />
For a good benchmark, please vary the number of steps so that your system runs for a few minutes, and that timing information is collected in reasonable time intervals of at least a few seconds. If your run is too short, you might see fluctuations in your timing results. <br />
<br />
<!--T:34--><br />
The numbers below were obtained for the standard NAMD apoa1 benchmark. The benchmarking was conducted on the Graham cluster, which has CPU nodes with 32 cores and GPU nodes with 32 cores and 2 GPUs. Performing the benchmark on other clusters will have to take account of the different structure of their nodes.<br />
<br />
<!--T:35--><br />
In the results shown in the first table below, we used NAMD 2.12 from the verbs module. Efficiency is computed from (time with 1 core) / (N * (time with N cores) ).<br />
<br />
<!--T:36--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! Wall time (s) per step !! Efficiency<br />
|-<br />
| 1 || 0.8313||100%<br />
|-<br />
| 2 || 0.4151||100%<br />
|-<br />
| 4 || 0.1945|| 107%<br />
|-<br />
| 8 || 0.0987|| 105%<br />
|-<br />
| 16 || 0.0501|| 104%<br />
|-<br />
| 32 || 0.0257|| 101%<br />
|-<br />
| 64 || 0.0133|| 98%<br />
|-<br />
| 128 || 0.0074|| 88%<br />
|-<br />
| 256 || 0.0036|| 90%<br />
|-<br />
| 512 || 0.0021|| 77%<br />
|-<br />
|}<br />
<br />
<!--T:37--><br />
These results show that for this system it is acceptable to use up to 256 cores. Keep in mind that if you ask for more cores, your jobs will wait in the queue for a longer time, affecting your overall throughput.<br />
<br />
<!--T:38--><br />
Now we perform benchmarking with GPUs. NAMD multicore module is used for simulations that fit within 1 node, and NAMD verbs-smp module is used for runs spanning nodes.<br />
<br />
<!--T:39--><br />
{| class="wikitable sortable"<br />
|-<br />
! # cores !! #GPUs !! Wall time (s) per step !! Notes<br />
|-<br />
| 4 || 1 || 0.0165 || 1 node, multicore<br />
|-<br />
| 8 || 1 || 0.0088 || 1 node, multicore<br />
|-<br />
| 16 || 1 || 0.0071 || 1 node, multicore<br />
|-<br />
| 32 || 2 || 0.0045 || 1 node, multicore<br />
|-<br />
| 64 || 4 || 0.0058 || 2 nodes, verbs-smp<br />
|-<br />
| 128 || 8 || 0.0051 || 2 nodes, verbs-smp<br />
|-<br />
|}<br />
<br />
<!--T:40--><br />
From this table it is clear that there is no point at all in using more than 1 node for this system, since performance actually becomes worse if we use 2 or more nodes. Using only 1 node, it is best to use 1GPU/16 core as that has the greatest efficiency, but also acceptable to use 2GPU/32core if you need to get your results quickly. Since on Graham GPU nodes your priority is charged the same for any job using up to 16 cores and 1 GPU, there is no benefit from running with 8 cores and 4 cores in this case.<br />
<br />
<!--T:41--><br />
Finally, you have to ask whether to run with or without GPUs for this simulation. From our numbers we can see that using a full GPU node of Graham (32 cores, 2 gpus) the job runs faster than it would on 4 non-GPU nodes of Graham. Since a GPU node on Graham costs about twice what a non-GPU node costs, in this case it is more cost effective to run with GPUs. You should run with GPUs if possible, however, given that there are fewer GPU than CPU nodes, you may need to consider submitting non-GPU jobs if your waiting time for GPU jobs is too long.<br />
<br />
= References = <!--T:23--><br />
* Downloads: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD -- Registration is required to download the software.<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/ug/ NAMD Users's guide for version 2.12]<br />
*[http://www.ks.uiuc.edu/Research/namd/2.12/notes.html NAMD version 2.12 release notes]<br />
* Tutorials: http://www.ks.uiuc.edu/Training/Tutorials/<br />
<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=102499Graham2021-08-04T17:26:12Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data transfer node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage (called "NDC-Waterloo" in some documents) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham cluster was purchased from Huawei Canada, Inc. in early 2017. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transferring data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
* By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
On or after the Removal Date we will follow up with the Contact to confirm if the exception is still required.<br />
<br />
<!--T:41--><br />
* Crontab is not offered on Graham. <br />
* Each job on Graham should have a duration of at least one hour (five minutes for test jobs).<br />
* A user cannot have more than 1000 jobs, running and queued, at any given moment. An array job is counted as the number of tasks in the array.<br />
<br />
=Storage= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space'''<br />64TB total volume ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />16PB total volume<br />External persistent storage<br />
||<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types; note that Turbo Boost is activated for the ensemble of Graham nodes.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory). <br />
Note that one node is only populated with 6 GPUs.<br />
|-<br />
| 2 || 40 || 377G or 386048M || 2 x Intel Xeon Gold 6248 Cascade Lake @ 2.5GHz || 5.0TB NVMe SSD || 8 x NVIDIA V100 Volta (32GB HBM2 memory),NVLINK<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 72 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
<!--T:64--><br />
Most applications will run on either Broadwell, Skylake, or Cascade Lake nodes, and performance differences are expected to be small compared to job waiting times. Therefore we recommend that you do not select a specific node type for your jobs. If it is necessary, for CPU jobs there are only two constraints available, use either <code>--constraint=broadwell</code> or <code>--constraint=cascade</code>. See [[Running_jobs#Cluster_particularities|how to specify the CPU architecture]].<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs (including 2 nodes with NVLINK interconnect)<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned to any type of available GPU. If you require a specific type of GPU, please request it. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
Graham has a total of 9 Volta nodes.<br />
In 7 of these, four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket). The other 2 have high bandwidth NVLINK interconnect.<br />
<br />
<!--T:50--><br />
'''The nodes are available to all users with a maximum 7 days job runtime limit.''' <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less on 28 core nodes. For example, if you want to run a job using 4 GPUs, you should request '''at most 14 CPU cores'''. For a job with 1 GPU, you should request '''at most 3 CPU cores'''. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:65--><br />
The two newest Volta nodes have 40 cores so the number of cores requested per GPU should be adjusted upwards accordingly, i.e. you can use 5 CPU cores per GPU. They also have NVLINK, which can provide huge benefits for situations where memory bandwidth between GPUs is the bottleneck. To use one of these NVLINK nodes, it should be requested directly, by adding the option '''--nodelist=gra1337''' or '''--nodelist=gra1338''' to the job submission script.<br />
<br />
<!--T:53--><br />
Single-GPU example:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see the [[Python#Creating_virtual_environments_inside_of_your_jobs|information on how to do this]].<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Job_scheduling_policies&diff=97788Job scheduling policies2021-03-24T20:55:02Z<p>Kaizaad: Fix syntax</p>
<hr />
<div><languages/><br />
<translate><br />
<br />
<!--T:1--><br />
''Parent page: [[Running jobs]]''<br />
<br />
<!--T:2--><br />
You can do much work on Compute Canada clusters by [[Running jobs|submitting jobs]] <br />
that specify only the number of cores and a runtime limit.<br />
However if you submit large numbers of jobs, or jobs that require large<br />
amounts of resources, you may be able to improve your productivity<br />
by understanding the policies affecting job scheduling.<br />
<br />
===Priority and fair-share === <!--T:7--><br />
<br />
<!--T:4--><br />
The order in which jobs are considered for scheduling is determined by ''priority''. Priority on our systems is determined using the [https://slurm.schedmd.com/fair_tree.html Fair Tree] algorithm.<ref>A detailed description Fair Tree can be found at https://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf, with references to early rock'n'roll music.</ref><br />
<br />
<!--T:38--><br />
Each job is charged to a Resource Allocation Project (RAP). <br />
You specify the project with the <code>--account</code> argument to <code>sbatch</code>.<br />
The project might hold a grant of CPU or GPU time from a [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ Resource Allocation Competition], in which case the account code will probably begin with <code>rrg-</code> or <code>rpp-</code>. Or it could be a non-RAC project, also known as a Rapid Access Service project, in which case the account code will probably begin with <code>def-</code>. See [[Running_jobs#Accounts_and_projects|Accounts and Projects]] for how to determine what account codes you can use.<br />
<br />
<!--T:39--><br />
Every project has a target usage level. Non-RAC projects all have equal target usage, while RAC projects have target usages determined by the number of CPU-years or GPU-years granted with each RAC award.<br />
<br />
<!--T:42--><br />
As an example let us imagine a research group with the account code <code>def-prof1</code>. Members of this imaginary group have user names <code>prof1, grad2</code> and <code>postdoc3</code>. We can examine the group's usage and share information with the <code>sshare</code> command as shown below. Note that we must append <code>_cpu</code> or <code>_gpu</code> to the end of the account code, as appropriate, since CPU and GPU use are tracked separately.<br />
<br />
<!--T:59--><br />
[prof1@gra-login4 ~]$ sshare -l -A def-prof1_cpu -u prof1,grad2,postdoc3<br />
Account User RawShares NormShares RawUsage ... EffectvUsage ... LevelFS ...<br />
-------------- ---------- ---------- ----------- -------- ... ------------ ... ---------- ...<br />
<span style="color:#ff0000">def-prof1_cpu 434086 0.001607 1512054 ... 0.000043 ... 37.357207 ...</span><br />
def-prof1_cpu prof1 1 0.100000 0 ... 0.000000 ... inf ... <br />
def-prof1_cpu grad2 1 0.100000 54618 ... 0.036122 ... 2.768390 ...<br />
def-prof1_cpu postdoc3 1 0.100000 855517 ... 0.565798 ... 0.176741 ...<br />
<br />
<!--T:43--><br />
The output shown above has been simplified by removing several fields which are not relevant to this discussion. <br />
Furthermore the line that is ''most'' important for scheduling is the first one, highlighted in red.<br />
This line describes the status of the project relative to all other projects using the cluster. <br />
In this example, the research group share is 0.1607% and they have used 0.0043% of the resources on the cluster, the groups Level fairshare is 37 which is quite high as the group has used a small fraction of their allocated share of resources. We would expect that jobs submitted by this group to have a fairly high priority.<br />
<br />
<!--T:60--><br />
Successive lines describe the status of each user relative to other users ''in this project''. <br />
Reading the 3rd line, grad2 has 1 share allocated within his group representing 10% of the group's allocation is responsible for only 3.6122% of the group's recent resource use and therefore has a higher than average level fairshare within the group. We would expect that jobs submitted by grad2 to have slightly more priority than jobs submitted by postdoc3 but less priority than jobs submitted by prof1.<br />
The priority of jobs belonging to def-prof1 group as compared to the priority of jobs belonging to other research groups is determined solely by the group’s fairshare and not the users fairshare within the group.<br />
<br />
<!--T:61--><br />
The project by itself, or the user within a project, is referred to as an "association" in the Slurm documentation.<br />
* <code>Account</code>, obviously, is the project name with <code>_cpu</code> or <code>_gpu</code> appended.<br />
* <code>User</code>: Notice that the first line of output, the highlighted line, does not include a user name. <br />
* <code>RawShares</code> is proportional to the number of CPU-years that was granted to the project for use on this cluster in the Resource Allocation Competition. All non-RAC accounts have small equal numbers of shares. For numeric reasons, inactive accounts (which do not have pending or running jobs) are given only one share. Activity is checked periodically, so if you submit a job with an inactive account, it may take up to 15 minutes before the account shows the expected <code>RawShares</code> and <code>LevelFS</code>.<br />
* <code>NormShares</code> is the number of shares assigned to the user or account divided by the total number of assigned shares within the level. So for the first line, the NormShares of 0.001607 is the fraction of the shares held by the project, relative to all other projects. The NormShares of 0.10000 on the other three lines are the fraction of shares held by each member of the project relative to the other members. (This project has ten members, but we only asked for information about three.)<br />
* <code>RawUsage</code> is calculated from the total number of resource-seconds (that is, CPU time, GPU time, and memory) that have been charged to this account. Past usage is discounted with a [https://en.wikipedia.org/wiki/Half-life half-life] of one week, so usage more than a few weeks in the past will have only a small effect on priority.<br />
* <code>EffectvUsage</code> is the association's usage normalized with its parent; that is, the project's usage relative to other projects, the user's relative to other users in that project. In this example, <code>postdoc3</code> has 56.6% of the project's usage, and <code>grad2</code> has 3.6%.<br />
* <code>LevelFS</code> is the association's fairshare value compared to its siblings, calculated as NormShares / EffectvUsage. If an association is over-served, the value is between 0 and 1. If an association is under-served, the value is greater than 1. Associations with no usage receive the highest possible value, infinity. For inactive accounts, as described above for <code>RawShares</code>, this value equals a meaningless small number close to 0.0001.<br />
<br />
<!--T:40--><br />
A project which consistently uses its target amount will have a LevelFS near 1.0. If the project uses more than its target, then its LevelFS will be below 1.0 and the priority of new jobs belonging to that project will also be low. If the project uses less than its target usage then its LevelFS will be greater than 1.0 and new jobs will enjoy high priority. <br />
<br />
<!--T:54--><br />
'''See also:''' [[Allocations and resource scheduling]].<br />
<br />
=== Whole nodes versus cores === <!--T:12--><br />
<br />
<!--T:13--><br />
Parallel calculations which can efficiently use 32 or more cores may benefit from being scheduled on '''whole nodes'''. Part of a cluster may be reserved for jobs which request one or more entire nodes. See [[Advanced MPI scheduling#Whole_nodes|whole nodes]] on the page [[Advanced MPI scheduling]] for example scripts and further discussion.<br />
<br />
<!--T:15--><br />
Note that requesting an inefficient number of processors for a calculation simply in order to take advantage of whole-node scheduling advantage will be construed as abuse of the system. For example, a program which takes just as long to run on 32 cores as on 16 cores should request <code>--ntasks=16</code>, not <code>--nodes=1 --ntasks-per-node=32</code>. (Although <code>--nodes=1 --ntasks-per-node=16</code> is fine if you need all the tasks to be on the same node.) Similarly, using whole nodes commits the user to a specific amount of memory--- submitting whole-node jobs that underutilize memory is as abusive as underutilizing cores.<br />
<br />
<!--T:14--><br />
If you have huge amounts of serial work and can efficiently use [[GNU Parallel]], [[GLOST]], <br />
or [https://docs.scinet.utoronto.ca/index.php/Running_Serial_Jobs_on_Niagara other techniques] to pack <br />
serial processes onto a single node, you are also welcome to use whole-node scheduling.<br />
<br />
=== Time limits === <!--T:16--><br />
<br />
<!--T:17--><br />
[[Niagara]] accepts jobs of up to 24 hours run-time, [[Béluga/en|Béluga]] up to 7 days, and [[Cedar]] and [[Graham]] up to 28 days. <br />
<br />
<!--T:18--><br />
On the three general-purpose clusters, longer jobs are restricted to use only a fraction of the cluster by ''partitions''. There are partitions for jobs of<br />
* 3 hours or less,<br />
* 12 hours or less,<br />
* 24 hours (1 day) or less,<br />
* 72 hours (3 days) or less,<br />
* 7 days or less, and<br />
* 28 days or less<br />
Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.<br />
<br />
<!--T:55--><br />
At Béluga a minimum time limit of 1 hour is also imposed. <br />
<br />
=== Backfilling === <!--T:19--><br />
<br />
<!--T:20--><br />
The scheduler employs [https://slurm.schedmd.com/sched_config.html backfilling] to improve<br />
overall system usage.<br />
<br />
<!--T:21--><br />
<blockquote><br />
Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.<br />
</blockquote><br />
<br />
<!--T:22--><br />
Backfilling will primarily benefit jobs with short time limits, e.g. under 3 hours.<br />
<br />
== Percentage of the nodes you have access to == <!--T:26--><br />
This section aims at giving some insight into how the general-purpose clusters (Cedar and Graham) are partitioned. <br />
<br />
<!--T:27--><br />
First, the nodes are partitioned into four different categories: <br />
* Base nodes, which have 4 or 8 GB of memory per core<br />
* Large memory nodes, which have 16 to 96 GB of memory per core<br />
* GPU nodes<br />
* Large GPU nodes (on Cedar only)<br />
Upon submission, your job will be routed to one of these categories based on what resources are requested. <br />
<br />
<!--T:28--><br />
Second, within each of the above categories, some nodes are reserved for jobs which can make use of complete nodes (i.e. jobs which use all of the resources available on the allocated nodes). If your job only uses a few cores (or a single core) out of each node, it is only allowed to use a subset of the category. These are referred to as "by-node" and "by-core" partitions.<br />
<br />
<!--T:29--><br />
Finally, the nodes are partitioned based on the walltime requested by your job. Shorter jobs have access to more resources. For example, a job with less than 3 hours of requested walltime can run on any node that allows 12 hours, but there are nodes which accept 3 hour jobs that do *not* accept 12 hour jobs.<br />
<br />
<!--T:30--><br />
The utility <code>partition-stats</code> shows<br />
* how many jobs are waiting to run ("queued") in each partition,<br />
* how many jobs are currently running,<br />
* how many nodes are currently idle, and<br />
* how many nodes are assigned to each partition.<br />
Here is some sample output from <code>partition-stats</code>:<br />
<br />
<!--T:47--><br />
<pre><br />
[user@gra-login3 ~]$ partition-stats<br />
<br />
<!--T:48--><br />
Node type | Max walltime<br />
| 3 hr | 12 hr | 24 hr | 72 hr | 168 hr | 672 hr |<br />
----------|-------------------------------------------------------------<br />
Number of Queued Jobs by partition Type (by node:by core)<br />
----------|-------------------------------------------------------------<br />
Regular | 12:170 | 69:7066| 70:7335| 386:961 | 59:509 | 5:165 |<br />
Large Mem | 0:0 | 0:0 | 0:0 | 0:15 | 0:1 | 0:4 |<br />
GPU | 5:14 | 3:8 | 21:1 | 177:110 | 1:5 | 1:1 |<br />
----------|-------------------------------------------------------------<br />
Number of Running Jobs by partition Type (by node:by core)<br />
----------|-------------------------------------------------------------<br />
Regular | 8:32 | 10:854 | 84:10 | 15:65 | 0:674 | 1:26 |<br />
Large Mem | 0:0 | 0:0 | 0:0 | 0:1 | 0:0 | 0:0 |<br />
GPU | 5:0 | 2:13 | 47:20 | 19:18 | 0:3 | 0:0 |<br />
----------|-------------------------------------------------------------<br />
Number of Idle nodes by partition Type (by node:by core)<br />
----------|-------------------------------------------------------------<br />
Regular | 16:9 | 15:8 | 15:8 | 7:0 | 2:0 | 0:0 |<br />
Large Mem | 3:1 | 3:1 | 0:0 | 0:0 | 0:0 | 0:0 |<br />
GPU | 0:0 | 0:0 | 0:0 | 0:0 | 0:0 | 0:0 |<br />
----------|-------------------------------------------------------------<br />
Total Number of nodes by partition Type (by node:by core)<br />
----------|-------------------------------------------------------------<br />
Regular | 871:431 | 851:411 | 821:391 | 636:276 | 281:164 | 90:50 |<br />
Large Mem | 27:12 | 27:12 | 24:11 | 20:3 | 4:3 | 3:2 |<br />
GPU | 156:78 | 156:78 | 144:72 | 104:52 | 13:12 | 13:12 |<br />
----------|-------------------------------------------------------------<br />
</pre><br />
<br />
<!--T:49--><br />
Looking at the first entry in the table, at the upper left, the numbers <tt>12:170, 0:0</tt>, and <tt>5:14</tt> mean that there were<br />
* 12 jobs waiting to run which requested <br />
** whole nodes,<br />
** less than 8GB of memory per core, and <br />
** 3 hours or less of run time. <br />
* 170 jobs waiting which requested<br />
** less than whole nodes and were therefore waiting to be scheduled on individual cores,<br />
** less than 8GB memory per core, and<br />
** 3 hours or less of run time. <br />
* 5 jobs waiting which requested <br />
** a whole node equipped with GPUs and<br />
** 3 hours or less of run time.<br />
* 14 jobs waiting which requested<br />
** single GPUs and<br />
** 3 hours or less of run time.<br />
There were no jobs running or waiting which requested large-memory nodes and 3 hours of run time.<br />
<br />
<!--T:50--><br />
At the bottom of the table we find the division of resources by policy, independent of the immediate number of jobs. Hence there are 871 base nodes, called "regular" here (that is, nodes with 4 to 8 GB memory per core), which may receive whole-node jobs of up to 3 hours. Of those 871, <br />
* 431 of them may also receive by-core jobs of up to three hours, <br />
* 851 of them may receive whole-node jobs of up to 12 hours, <br />
* and so on.<br />
<br />
<!--T:51--><br />
It may help to think of these partitions as being like [https://en.wikipedia.org/wiki/Matryoshka_doll Matryoshka (Russian) dolls]. The 3-hour partition contains the nodes for the 12-hour partition as a subset. The 12-hour partition in turn contains the 24-hour partition, and so on.<br />
<br />
<!--T:52--><br />
The <code>partition-stats</code> utility does not give information about the number of cores represented by running or waiting jobs, nor the number of cores free in partly-assigned nodes in by-core partitions, nor about available memory associated with free cores in by-core partitions. <br />
<br />
<!--T:53--><br />
Running <code>partition-stats</code> is somewhat costly to the scheduler. Please do not write a script which automatically calls <code>partition-stats</code> repeatedly. If you have a workflow which you believe would benefit from automatic parsing of the information from <code>partition-stats</code>, please contact [[Technical support]] and ask for guidance.<br />
<br />
== Number of jobs == <!--T:56--><br />
<br />
<!--T:57--><br />
There may be a limit on the number of jobs you can have in the system at any one time. <br />
* On [[Graham]] and [[Béluga/en|Béluga]] normal accounts have MaxSubmit is 1000.<br />
<br />
<!--T:58--><br />
[[Category:SLURM]]<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=96424Graham2021-03-03T19:58:15Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data transfer node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham cluster was purchased from Huawei Canada, Inc. in early 2017. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transferring data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
* By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
On or after the Removal Date we will follow up with the Contact to confirm if the exception is still required.<br />
<br />
<!--T:41--><br />
* Crontab is not offered on Graham. <br />
* Each job on Graham should have a duration of at least one hour (five minutes for test jobs).<br />
* A user cannot have more than 1000 jobs, running and queued, at any given moment. An array job is counted as the number of tasks in the array.<br />
<br />
=Storage= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space'''<br />64TB total volume ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />16PB total volume<br />External persistent storage<br />
||<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types; note that Turbo Boost is activated for the ensemble of Graham nodes.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory). <br />
Note that one node is only populated with 6 GPUs.<br />
|-<br />
| 2 || 40 || 377G or 386048M || 2 x Intel Xeon Gold 6248 Cascade Lake @ 2.5GHz || 5.0TB NVMe SSD || 8 x NVIDIA V100 Volta (32GB HBM2 memory),NVLINK<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 72 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
<!--T:64--><br />
Most applications will run on either Broadwell, Skylake, or Cascade Lake nodes, and performance differences are expected to be small compared to job waiting times. Therefore we recommend that you do not select a specific node type for your jobs. If it is necessary, for CPU jobs there are only two constraints available, use either <code>--constraint=broadwell</code> or <code>--constraint=cascade</code>. See [[Running_jobs#Cluster_particularities|how to specify the CPU architecture]].<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs (including 2 nodes with NVLINK interconnect)<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
Graham has a total of 9 Volta nodes.<br />
In 7 of these, four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket). The other 2 have high bandwidth NVLINK interconnect.<br />
<br />
<!--T:50--><br />
'''The nodes are available to all users with a maximum 7 days job runtime limit.''' <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less on 28 core nodes. For example, if you want to run a job using 4 GPUs, you should request '''at most 14 CPU cores'''. For a job with 1 GPU, you should request '''at most 3 CPU cores'''. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:65--><br />
The two newest Volta nodes have 40 cores so the number of cores requested per GPU should be adjusted upwards accordingly, i.e. you can use 5 CPU cores per GPU. They also have NVLINK, which can provide huge benefits for situations where memory bandwidth between GPUs is the bottleneck. To use one of these NVLINK nodes, it should be requested directly, by adding the option '''--nodelist=gra1337''' or '''--nodelist=gra1338''' to the job submission script.<br />
<br />
<!--T:53--><br />
Single-GPU example:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see the [[Python#Creating_virtual_environments_inside_of_your_jobs|information on how to do this]].<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Frequently_Asked_Questions&diff=86678Frequently Asked Questions2020-07-22T19:18:28Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
__TOC__<br />
<br />
<translate><br />
== Forgot my password == <!--T:19--><br />
To reset your password for any Compute Canada national cluster, visit https://ccdb.computecanada.ca/security/forgot. <br />
<br />
==Text file line endings == <!--T:46--><br />
For historical reasons, Windows and most other operating systems, including Linux and OS X, disagree on the convention that is used to denote the end of a line in a plain text ASCII file. Text files prepared in a Windows environment will therefore have an additional invisible "carriage return" character at the end of each line and this can cause certain problems when reading this file in a Linux environment. For this reason you should either consider creating and editing your text files on the cluster itself using standard Linux text editors like emacs, vim and nano or, if you prefer Windows, to then use the command <tt>dos2unix <filename></tt> on the cluster login node to convert the line endings of your file to the appropriate convention. <br />
<br />
== Moving files across the project, scratch and home filesystems == <!--T:40--><br />
On [[Béluga/en|Béluga]], [[Cedar]] and [[Graham]], the scratch and home filesystems have quotas that are per-user, while the [[Project layout|project filesystem]] has quotas that are per-project. Because the underlying implementation of quotas on the [[Lustre]] filesystem <br />
is currently based on group ownership of files, it is important to ensure that the files have the right group. On the scratch and home filesystems, the correct group is typically the group with the same name as your username. On the project filesystem, group name should follow the pattern <tt>prefix-piusername</tt> where <tt>prefix</tt> is typically one of <tt>def</tt>, <tt>rrg</tt>, <tt>rpp</tt>.<br />
<br />
=== Moving files between scratch and home filesystems === <!--T:41--><br />
Since the quotas of these two filesystems are based on your personal group, you should be able to move files across the two using <br />
{{Command|mv $HOME/scratch/some_file $HOME/some_file}}<br />
<br />
=== Moving files from scratch or home filesystems to project === <!--T:42--><br />
If you want to move files from your scratch or home space into a project space, you '''should not''' use the <tt>mv</tt> command. Instead, we recommend using the regular <tt>cp</tt>, or the <tt>rsync</tt> command.<br />
<br />
<!--T:47--><br />
It is very important to run <tt>cp</tt> and <tt>rsync</tt> correctly to ensure that the files copied over to the project space have the correct group ownership. With <tt>cp</tt>, do not use the archive <tt>-a</tt> option. And when using <tt>rsync</tt>, make sure you specify the <tt>--no-g --no-p</tt> options, like so:<br />
<br />
<!--T:43--><br />
{{Command|rsync -axvH --no-g --no-p $HOME/scratch/some_directory $HOME/projects/<project>/some_other_directory}}<br />
<br />
<!--T:45--><br />
Once the files are copied, you can then delete them from your scratch space.<br />
<br />
=== Moving files from project to scratch or home filesystems === <!--T:55--><br />
If you want to move files from your project into your scratch or home space, you '''should not''' use the <tt>mv</tt> command. Instead, we recommend using the regular <tt>cp</tt>, or the <tt>rsync</tt> command.<br />
<br />
<!--T:56--><br />
It is very important to run <tt>cp</tt> and <tt>rsync</tt> correctly to ensure that the files copied over to the project space have the correct group ownership. With <tt>cp</tt>, do not use the archive <tt>-a</tt> option. And when using <tt>rsync</tt>, make sure you specify the <tt>--no-g --no-p</tt> options, like so:<br />
<br />
<!--T:57--><br />
{{Command|rsync -axvH --no-g --no-p $HOME/projects/<project>/some_other_directory $HOME/scratch/some_directory}}<br />
<br />
=== Moving files between two project spaces === <!--T:60--><br />
If you want to move files between two project spaces, you '''should not''' use the <tt>mv</tt> command. Instead, we recommend using the regular <tt>cp</tt>, or the <tt>rsync</tt> command.<br />
<br />
<!--T:61--><br />
It is very important to run <tt>cp</tt> or <tt>rsync</tt> correctly to ensure that the files copied over have the correct group ownership. With <tt>cp</tt>, do not use the archive <tt>-a</tt> option. And when using <tt>rsync</tt>, make sure you specify the <tt>--no-g --no-p</tt> options, like so:<br />
<br />
<!--T:62--><br />
{{Command|rsync -axvH --no-g --no-p $HOME/projects/<project>/some_other_directory $HOME/projects/<project2>/some_directory}}<br />
<br />
<!--T:63--><br />
'''Once you have copied your data over, please delete the old data.'''<br />
<br />
== ''Disk quota exceeded'' error on /project filesystems == <!--T:12--><br />
:''Also see: [[Project layout]]''<br />
Some users have seen this message or some similar quota error on their [[Project layout|project]] folders. Other users have reported obscure failures while transferring files into their <code>/project</code> folder from another cluster. Many of the problems reported are due to bad file ownership.<br />
<br />
<!--T:5--><br />
Use <code>diskusage_report</code> to see if you are at or over your quota:<br />
<source lang="bash"><br />
[ymartin@cedar5 ~]$ diskusage_report<br />
Description Space # of files<br />
Home (user ymartin) 345M/50G 9518/500k<br />
Scratch (user ymartin) 93M/20T 6532/1000k<br />
Project (group ymartin) 5472k/2048k 158/5000k<br />
Project (group def-zrichard) 20k/1000G 4/5000k<br />
</source><br />
<br />
<!--T:6--><br />
The example above illustrates a frequent problem: <code>/project</code> for user <code>ymartin</code> contains too much data in files belonging to group <code>ymartin</code>. The data should instead be in files belonging to <code>def-zrichard</code>. To see the project groups you may use, run the following command:<br />
stat -c %G $HOME/projects/*/<br />
<br />
<!--T:8--><br />
Note the two lines labelled <code>Project</code>.<br />
*<code>Project (group ymartin)</code> describes files belonging to group <code>ymartin</code>, which has the same name as the user. This user is the only member of this group, which has a very small quota (2048k). <br />
*<code>Project (group def-zrichard)</code> describes files belonging to a '''project group'''. Your account may be associated with one or more project groups, and they will typically have names like <code>def-zrichard</code>, <code>rrg-someprof-ab</code>, or <code>rpp-someprof</code>. <br />
<br />
<!--T:9--><br />
In this example, files have somehow been created belonging to group <code>ymartin</code> instead of group <code>def-zrichard</code>. This is neither the desired nor the expected behaviour. <br />
<br />
<!--T:2--><br />
By design, new files and directories in <code>/project</code> will normally be created belonging to a project group. The main reasons why files may be associated with the wrong group are<br />
* files were moved from <code>/home</code> to <code>/project</code> with the <code>mv</code>command; to avoid this, see [[#Moving files between scratch and home filesystems | advice above]];<br />
* files were transferred from another cluster using [[Transferring_data#Rsync|rsync]] or [[Transferring_data#SCP|scp]] with an option to preserve the original group ownership. If you have a recurring problem with ownership, check the options you are using with your file transfer program;<br />
* you have no <tt>setgid</tt> bit set on your Project folders.<br />
<br />
=== How to fix the problem === <!--T:48--><br />
If you already have data in your <code>/project</code> directory with the wrong group ownership, you can use the <code>find</code> to display those files:<br />
lfs find ~/projects/*/ -group $USER<br />
<br />
<!--T:49--><br />
Next, change group ownership from $USER to the project group, for example:<br />
chown -h -R $USER:def-professor -- ~/projects/def-professor/$USER/<br />
<br />
<!--T:50--><br />
Then, set the [[Sharing_data#Set_Group_ID_.28SGID.29|Set Group ID]] (SGID or <code>setgid</code>) bit on all directories to ensure that newly created files will inherit the directory's group membership, for example:<br />
lfs find ~/projects/def-professor/$USER -type d -print0 | xargs -0 chmod g+s<br />
<br />
<!--T:51--><br />
Finally, verify that project space directories have correct permissions set<br />
chmod 2770 ~/projects/def-professor/<br />
chmod 2700 ~/projects/def-professor/$USER<br />
<br />
=== Another explanation === <!--T:23--><br />
Each file in Linux belongs to a person and a group at the same time. By default, a file you create belongs to you, user '''username''', and your group, named the same '''username'''. That is it is owned by '''username:username'''. Your group is created at the same time your account was created and you are the only user in that group. <br />
<br />
<!--T:24--><br />
This file ownership is good for your home directory and the scratch space, as shown hereː <br />
<br />
<!--T:39--><br />
<pre><br />
Description Space # of files<br />
Home (user username) 15G/53G 74k/500k<br />
Scratch (user username) 1522G/100T 65k/1000k<br />
Project (group username) 34G/2048k 330/2048<br />
Project (group def-professor) 28k/1000G 9/500k<br />
</pre><br />
<br />
<!--T:26--><br />
The quota is set for these for a user '''username'''.<br />
<br />
<!--T:27--><br />
The other two lines are set for groups '''username''' and '''def-professor''' in Project space. It is not important what users own the files in that space, but the group the files belong to determines the quota limit. <br />
<br />
<!--T:28--><br />
You see, that files that are owned by '''username''' group (your default group) have very small limit in the project space, only 2MB, and you already have 34 GB of data that is owned by your group (your files). This is why you cannot write more data there. Because you are trying to place data there owned by a group that has very little allocation there.<br />
<br />
<!--T:29--><br />
The allocation for the group '''def-professor''', your professor's group, on the other hand does not use almost any space and has 1 TB limit. The files that can be put there should have '''username:def-professor''' ownership. <br />
<br />
<!--T:30--><br />
Now, depending on how you copy you files, what software you use, that software either will respect the ownership of the directory and apply the correct group, or it may insist on retaining the ownership of the source data. In the latter case you will have a problem like you have now.<br />
<br />
<!--T:31--><br />
Most probably your original data belongs to '''username:username''', properly, upon moving it, it should belong to '''username:def-professor''', but you software probably insists on keeping the original ownership and this causes the problem.<br />
<br />
== ''sbatch: error: Batch job submission failed: Socket timed out on send/recv operation'' == <!--T:10--><br />
<br />
<!--T:11--><br />
You may see this message when the load on the [[Running jobs|Slurm]] manager or scheduler process is too high. We are working both to improve Slurm's tolerance of that and to identify and eliminate the sources of load spikes, but that is a long-term project. The best advice we have currently is to wait a minute or so. Then run <code>squeue -u $USER</code> and see if the job you were trying to submit appears: in some cases the error message is delivered even though the job was accepted by Slurm. If it doesn't appear, simply submit it again.<br />
<br />
== Why are my jobs taking so long to start? == <!--T:20--><br />
You can see why your jobs are in the <tt>PD</tt> (pending) state by running the <tt>squeue -u <username></tt> command on the cluster.<br><br><br />
The <tt>(REASON)</tt> column typically has the values <tt>Resources</tt> or <tt>Priority</tt>.<br />
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. CPUs/nodes, GPUs, memory, time).<br />
* <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]]. The <tt>LevelFS</tt> column gives you information about your over- or under-consumption of cluster resources: when <tt>LevelFS</tt> is greater than one, you are consuming fewer resources than your fair share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your pending jobs decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of <tt>LevelFS</tt> is unique to the specific cluster.<br />
<br />
== Why do my jobs show "Nodes required for job are DOWN, DRAINED or RESERVED for jobs in higher priority partitions" or "ReqNodeNotAvailable"? == <!--T:58--><br />
<br />
<!--T:59--><br />
This string may appear in the "Reason" field of <tt>squeue</tt> output for a waiting job, and is new to Slurm 19.05.<br />
It means just what it says: One or more of the nodes Slurm considered for the job are down, or deliberately taken offline,<br />
or are being reserved for other jobs. On a large busy cluster there will almost always be such nodes. The message means <br />
effectively the same thing as the reason "Resources" that appeared in Slurm version 17.11.<br />
<br />
== How accurate is START_TIME in <tt>squeue</tt> output? == <!--T:33--><br />
We don't show the start time by default with <tt>squeue</tt>, but it can be printed with an option. The start times Slurm forecasts depend on rapidly-changing conditions, and are therefore not very useful.<br />
<br />
<!--T:34--><br />
[[Running jobs|Slurm]] computes START_TIME for high-priority pending jobs. These expected start times are computed from currently-available information: <br />
* What resources will be freed by running jobs that complete; and<br />
* what resources will be needed by other, higher-priority jobs waiting to run.<br />
<br />
<!--T:35--><br />
Slurm invalidates these future plans: <br />
* if jobs end early, changing which resources become available; and<br />
* if prioritization changes, due to submission of higher-priority jobs or cancellation of queued jobs for example.<br />
<br />
<!--T:36--><br />
On Compute Canada general purpose clusters, new jobs are submitted about every five seconds, and 30-50% of jobs end early,<br />
so Slurm often discards and recomputes its future plans.<br />
<br />
<!--T:37--><br />
Most waiting jobs have a START_TIME of "N/A", which stands for "not available", meaning <tt>Slurm</tt> is not attempting to project a start time for them.<br />
<br />
<!--T:38--><br />
For jobs which are already running, the start time reported by <tt>squeue</tt> is perfectly accurate.<br />
<br />
==What are the .core files that I find in my directory?== <!--T:52--><br />
<br />
<!--T:53--><br />
In some instances a program which crashes or otherwise exits abnormally will leave behind a binary file, called a core file, containing a snapshot of the program's state at the moment that it crashed, typically with the extension ".core". While such files can be useful for programmers who are debugging the software in question, they are normally of no interest for regular users beyond the indication that something went wrong with the execution of the software, something already indicated by the job's output normally. You can therefore delete these files if you wish and add the line <tt>ulimit -c 0</tt> to the end of your $HOME/.bashrc file to ensure that they are no longer created.<br />
<br />
==How to fix library not found error== <!--T:54--><br />
When installing pre-compiled binary packages in your <code>$HOME</code>, they may fail with an error such as <tt>/lib64/libc.so.6: version `GLIBC_2.18' not found</tt> at runtime. See [https://docs.computecanada.ca/wiki/Installing_software_in_your_home_directory#Installing_binary_packages Installing binary packages] for how to fix this kind of issue.<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=86271Using GPUs with Slurm2020-07-16T15:53:16Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:15--><br />
For general advice on job scheduling, see [[Running jobs]].<br />
<br />
== Available hardware == <!--T:1--><br />
These are the node types containing GPUs currently available on [[Béluga/en|Béluga]], [[Cedar]], [[Graham]] and [[Hélios/en|Hélios]]:<br />
<br />
<!--T:2--><br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node type !! CPU cores !! CPU memory !! # of GPUs !! NVIDIA GPU type !! PCIe bus topology<br />
|-<br />
| 172 || Béluga P100 GPU || 40 || 191000M || 4 || V100-SXM2-16GB || All GPUs associated with the same CPU socket<br />
|-<br />
| 114 || Cedar P100 GPU || 24 || 128000M || 4 || P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar P100L GPU || 24|| 257000M || 4 || P100-PCIE-16GB || All GPUs associated with the same CPU socket<br />
|-<br />
| 192 || Cedar V100L GPU || 32|| 192000M || 4 || V100-PCIE-32GB || Two GPUs per CPU socket; all GPUs connected via NVLink<br />
|-<br />
| 160 || Graham Base GPU || 32|| 127518M || 2 || P100-PCIE-12GB || One GPU per CPU socket<br />
|-<br />
| 30 || Graham GPU || 44 || 192000M || 4 || Tesla T4 16GB || Two GPUs per CPU socket<br />
|-<br />
| 15 || Hélios K20 || 20 || 110000M || 8 || K20 5GB || Four GPUs per CPU socket<br />
|- <br />
| 6 || Hélios K80 || 24 || 257000M || 16 || K80 12GB || Eight GPUs per CPU socket<br />
|- <br />
| 54 || Niagara IBM AC922 || 32 Power9 || 256GB || 4 || V100-SMX2-32GB || all GPUs connected via NVLinks <br />
|}<br />
<br />
== Specifying the type of GPU to use == <!--T:16--><br />
Most clusters have multiple types of GPUs available. You can specify the type of GPU to use by adding a specifier to the <code>--gres=gpu</code> option. The following options are available: <br />
<br />
=== On Cedar === <!--T:17--><br />
You can request a 12G P100 using<br />
<br />
<!--T:18--><br />
#SBATCH --gres=gpu:p100:1<br />
<br />
<!--T:19--><br />
or a 16G P100 using <br />
<br />
<!--T:20--><br />
#SBATCH --gres=gpu:p100l:1<br />
<br />
<!--T:21--><br />
or a 32G V100 using <br />
<br />
<!--T:34--><br />
#SBATCH --gres=gpu:v100l:1<br />
<br />
<!--T:35--><br />
Unless specified, all GPU jobs requesting <= 125G of memory will run on 12G P100s<br />
<br />
=== On Graham === <!--T:22--><br />
You can request a P100 using<br />
<br />
<!--T:23--><br />
#SBATCH --gres=gpu:p100:1<br />
<br />
<!--T:24--><br />
or a V100 using <br />
<br />
<!--T:25--><br />
#SBATCH --gres=gpu:v100:1<br />
<br />
<!--T:26--><br />
or a T4 using <br />
<br />
<!--T:27--><br />
#SBATCH --gres=gpu:t4:1<br />
<br />
<!--T:28--><br />
Unless specified, all GPU jobs will run on a P100.<br />
<br />
=== On Béluga === <!--T:29--><br />
Béluga has only one type of GPU, so no options are provided. <br />
<br />
=== On Hélios === <!--T:30--><br />
You can request a K20 using<br />
<br />
<!--T:31--><br />
#SBATCH --gres=gpu:k20:1<br />
<br />
<!--T:32--><br />
or a K80 using <br />
<br />
<!--T:33--><br />
#SBATCH --gres=gpu:k80:1<br />
<br />
== Single-core job == <!--T:3--><br />
If you need only a single CPU core and one GPU:<br />
{{File<br />
|name=gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPUs (per node)<br />
#SBATCH --mem=4000M # memory (per node)<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
./program # you can use 'nvidia-smi' for a test<br />
}}<br />
<br />
== Multi-threaded job == <!--T:4--><br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./program<br />
}}<br />
For each GPU requested on:<br />
* Béluga, we recommend no more than 10 CPU cores.<br />
* Cedar, we recommend no more than 6 CPU cores per P100 GPU (p100 and p100l) and no more than 8 CPU cores per V100 GPU (v100l).<br />
* Graham, we recommend no more than 16 CPU cores.<br />
<br />
== MPI job == <!--T:5--><br />
{{File<br />
|name=gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of nodes<br />
#SBATCH --ntasks=48 # Number of MPI process<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI process<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Whole nodes == <!--T:6--><br />
If your application can efficiently use an entire node and its associated GPUs, you will probably experience shorter wait times if you ask Slurm for a whole node. Use one of the following job scripts as a template. <br />
<br />
=== Requesting a GPU node on Graham === <!--T:7--><br />
{{File<br />
|name=graham_gpu_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem=127000M<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
nvidia-smi<br />
}}<br />
<br />
=== Requesting a P100 GPU node on Cedar === <!--T:8--><br />
{{File<br />
|name=cedar_gpu_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:p100:4<br />
#SBATCH --ntasks-per-node=24<br />
#SBATCH --exclusive<br />
#SBATCH --mem=125G<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
nvidia-smi<br />
}}<br />
<br />
=== Requesting a P100-16G GPU node on Cedar === <!--T:9--><br />
<br />
<!--T:10--><br />
There is a special group of GPU nodes on [[Cedar]] which have four Tesla P100 16GB cards each. (Other P100 GPUs in the cluster have 12GB and the V100 GPUs have 32G.) The GPUs in a P100L node all use the same PCI switch, so the inter-GPU communication latency is lower, but bandwidth between CPU and GPU is lower than on the regular GPU nodes. The nodes also have 256GB RAM. You may only request these nodes as whole nodes, therefore you must specify <code>--gres=gpu:p100l:4</code>. P100L GPU jobs up to 28 days can be run on Cedar.<br />
<br />
<!--T:11--><br />
{{File<br />
|name=p100l_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --gres=gpu:p100l:4 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=24 # There are 24 CPU cores on P100 Cedar GPU nodes<br />
#SBATCH --mem=0 # Request the full memory of the node<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
===Packing single-GPU jobs within one SLURM job=== <!--T:12--><br />
<br />
<!--T:13--><br />
If you need to run four single-GPU programs or two 2-GPU programs for longer than 24 hours, [[GNU Parallel]] is recommended. A simple example is given below:<br />
<pre><br />
cat params.input | parallel -j4 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) python {} &> {#}.out'<br />
</pre><br />
In this example the GPU ID is calculated by subtracting 1 from the slot ID {%}. {#} is the job ID, starting from 1.<br />
<br />
<!--T:14--><br />
A params.input file should include input parameters in each line, like this:<br />
<pre><br />
code1.py<br />
code2.py<br />
code3.py<br />
code4.py<br />
...<br />
</pre><br />
With this method, users can run multiple tasks in one submission. The <code>-j4</code> parameter means GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time.<br />
<br />
<!--T:36--><br />
[[Category:SLURM]]<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=78682Graham2019-12-20T19:16:49Z<p>Kaizaad: /* Choosing a node type */</p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017F<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 72 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
Most applications will run on either Broadwell, Skylake, or Cascade Lake nodes, and performance differences are expected to be small compared to job waiting times. Therefore we recommend that you do not select a specific node type for your jobs. If it is necessary, for CPU jobs there are only two constraints available, use either <code>--constraint=broadwell</code> or <code>--constraint=cascade</code>. See [[Running_jobs#Specifying_a_CPU_architecture|Specifying a CPU architecture]].<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Single-GPU example for default users:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for default users:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for Ontario users who have been granted higher priority access:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=78681Graham2019-12-20T19:16:20Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017F<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 72 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
=== Choosing a node type === <br />
Most applications will run on either Broadwell, Skylake, or Cascade Lake nodes, and performance differences are expected to be small compared to job waiting times. Therefore we recommend that you do not select a specific node type for your jobs. If it is necessary, for CPU jobs there are only two constraints available, use either <code>--constraint=broadwell</code> or <code>--constraint=cascade</code>. See [[Running_jobs#Specifying_a_CPU_architecture|Specifying a CPU architecture]].<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Single-GPU example for default users:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for default users:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for Ontario users who have been granted higher priority access:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=78680Graham2019-12-20T19:09:07Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017F<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 72 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Single-GPU example for default users:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for default users:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for Ontario users who have been granted higher priority access:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=78506Graham2019-12-11T17:13:26Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017F<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 38,380 cores and 520 GPU devices, spread across 1,139 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Single-GPU example for default users:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for default users:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for Ontario users who have been granted higher priority access:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=78505Graham2019-12-11T17:11:59Z<p>Kaizaad: Processor type change</p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017F<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 38,380 cores and 520 GPU devices, spread across 1,139 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal - 320 GPUs<br />
* V100 Volta - 54 GPUs<br />
* T4 Turing - 144 GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Single-GPU example for default users:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for default users:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for Ontario users who have been granted higher priority access:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=78504Graham2019-12-11T16:57:20Z<p>Kaizaad: Update total cores, gpu devices, and nodes. -> sinfo -o '%F %C' NODES(A/I/O/T) CPUS(A/I/O/T) 1139/42/8/1189 34706/</p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017F<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 38,380 cores and 520 GPU devices, spread across 1,139 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Skylake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal - 320 GPUs<br />
* V100 Volta - 54 GPUs<br />
* T4 Turing - 144 GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Single-GPU example for default users:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for default users:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for Ontario users who have been granted higher priority access:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=78503Graham2019-12-11T16:50:23Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017F<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 36,160 cores and 320 GPU devices, spread across 1,127 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Skylake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal - 320 GPUs<br />
* V100 Volta - 54 GPUs<br />
* T4 Turing - 144 GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Single-GPU example for default users:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for default users:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example for Ontario users who have been granted higher priority access:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=77337Graham2019-10-21T18:49:35Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 36,160 cores and 320 GPU devices, spread across 1,127 nodes of different types.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory)<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on graham =<br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal - 320 GPUs<br />
* V100 Volta - 54 GPUs<br />
* T4 Turing - 24 GPUs (more to be added soon)<br />
<br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on graham ==<br />
<br />
These are graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned Pascal P100 cards. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
In the first quarter of 2019, new Volta GPU nodes were added, as described in the table above.<br />
Four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).<br />
<br />
<!--T:50--><br />
The nodes are available to all users with a 24 hour job runtime limit. Higher priority access with longer job runtimes can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
Example for default users.<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:8<br />
#SBATCH --exclusive<br />
#SBATCH --cpus-per-task=28<br />
#SBATCH --mem=150G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Example for Ontario users who have been granted higher priority access.<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:8<br />
#SBATCH --exclusive<br />
#SBATCH --cpus-per-task=28<br />
#SBATCH --mem=150G<br />
#SBATCH --time=3-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
==Turing GPU nodes on Graham==<br />
<br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
--gres=gpu:t4:2<br />
<br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Storage_and_file_management&diff=72902Storage and file management2019-06-03T14:03:52Z<p>Kaizaad: Updated /scratch to match other CC GP systems</p>
<hr />
<div><languages /><br />
<translate><br />
==Overview== <!--T:1--><br />
<br />
<!--T:2--><br />
Compute Canada provides a wide range of storage options to cover the needs of our very diverse users. These storage solutions range from high-speed temporary local storage to different kinds of long-term storage, so you can choose the storage medium that best corresponds to your needs and usage patterns. In most cases the [https://en.wikipedia.org/wiki/File_system filesystems] on Compute Canada systems are a ''shared'' resource and for this reason should be used responsibly - unwise behaviour can negatively affect dozens or hundreds of other users. These filesystems are also designed to store a limited number of very large files, which are typically binary since very large (hundreds of MB or more) text files lose most of their interest in being human-readable. You should therefore avoid storing tens of thousands of small files, where small means less than a few megabytes, particularly in the same directory. A better approach is to use commands like [[Archiving and compressing files|<tt>tar</tt>]] or <tt>zip</tt> to convert a directory containing many small files into a single very large archive file. <br />
<br />
<!--T:3--><br />
It is also your responsibility to manage the age of your stored data: most of the filesystems are not intended to provide an indefinite archiving service so when a given file or directory is no longer needed, you need to move it to a more appropriate filesystem which may well mean your personal workstation or some other storage system under your control. Moving significant amounts of data between your workstation and a Compute Canada system or between two Compute Canada systems should generally be done using [[Globus]]. <br />
<br />
<!--T:4--><br />
Note that Compute Canada storage systems are not for personal use and should only be used to store research data.<br />
<br />
<!--T:17--><br />
When your account is created on a Compute Canada cluster, your home directory will not be entirely empty. It will contain references to your scratch and [[Project layout|project]] spaces through the mechanism of a [https://en.wikipedia.org/wiki/Symbolic_link symbolic link], a kind of shortcut that allows easy access to these other filesystems from your home directory. Note that these symbolic links may appear up to a few hours after you first connect to the cluster. While your home and scratch spaces are unique to you as an individual user, the project space is a shared by a research group. This group may consist of those individuals with a Compute Canada account sponsored by a particular faculty member or members of a [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC allocation]. A given individual may thus have access to several different project spaces, associated with one or more faculty members, with symbolic links to these different project spaces in the directory projects of your home. Every account has one or many projects. In the folder <tt>projects</tt> within their home directory, each user has a link to each of the projects they have access to. For users with a single active sponsored role is the default project of your sponsor while users with more than one active sponsored role will have a default project that corresponds to the default project of the faculty member with the most sponsored accounts.<br />
<br />
<!--T:16--><br />
All users can check the available disk space and the current disk utilization for the ''project'', ''home'' and ''scratch'' file systems with the command line utility '''''diskusage_report''''', available on Compute Canada clusters. To use this utility, log into the cluster using SSH, at the command prompt type diskusage_report, and press the Enter key. Following is a typical output of this utility:<br />
<pre><br />
# diskusage_report<br />
Description Space # of files<br />
Home (username) 280 kB/47 GB 25/500k<br />
Scratch (username) 4096 B/18 TB 1/1000k<br />
Project (def-username-ab) 4096 B/9536 GB 2/500k<br />
Project (def-username) 4096 B/9536 GB 2/500k<br />
</pre><br />
<br />
== Storage types == <!--T:5--><br />
Unlike your personal computer, a Compute Canada system will typically have several storage spaces or filesystems and you should ensure that you are using the right space for the right task. In this section we will discuss the principal filesystems available on most Compute Canada systems and the intended use of each one along with some of its characteristics. <br />
* '''HOME:''' While your home directory may seem like the logical place to store all your files and do all your work, in general this isn't the case - your home normally has a relatively small quota and doesn't have especially good performance for the writing and reading of large amounts of data. The most logical use of your home directory is typically source code, small parameter files and job submission scripts. <br />
* '''PROJECT:''' The project space has a significantly larger quota and is well-adapted to [[Sharing data | sharing data]] among members of a research group since it, unlike the home or scratch, is linked to a professor's account rather than an individual user. <br />
* '''SCRATCH''': For intensive read/write operations, scratch is the best choice. Remember however that important files must be copied off scratch since they are not backed up there, and older files are subject to [[Scratch purging policy|purging]]. The scratch storage should therefore only be used for transient files.<br />
<br />
== Best practices == <!--T:9--><br />
* Only use text format for files that are smaller than a few megabytes.<br />
* As far as possible, use scratch and local storage for temporary files. For local storage you can use the temporary directory created by the [[Running jobs|job scheduler]] for this, named <code>$SLURM_TMPDIR</code>.<br />
* If your program must search within a file, it is fastest to do it by first reading it completely before searching.<br />
* Regularly clean up your data in the scratch and project spaces, because those filesystems are used for huge data collections.<br />
* If you no longer use certain files but they must be retained, [[Archiving and compressing files|archive and compress]] them, and if possible copy them elsewhere.<br />
* If your needs are not well served by the available storage options please contact [[technical support]].<br />
<br />
==Filesystem quotas and policies== <!--T:10--><br />
<br />
<!--T:11--><br />
In order to ensure that there is adequate space for all Compute Canada users, there are a variety of quotas and policy restrictions concerning back-ups and automatic purging of certain filesystems. <br />
By default on our clusters each user has access to the home and scratch spaces, and each group has access to 1 TB of project space. Small increases in project and scratch spaces are available through our Rapid Access Service ([https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS]). Larger increases in project spaces are available through the annual Resource Allocation Competitions ([https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions RAC]). You can see your current quota usage for various filesystems on Cedar and Graham using the command [[Storage and file management#Overview|<tt>diskusage_report</tt>]].<br />
<br />
<!--T:12--><br />
<tabs><br />
<tab name="Cedar"><br />
{| class="wikitable" style="font-size: 95%; text-align: center;"<br />
|+Filesystem Characteristics <br />
! Filesystem<br />
! Default Quota<br />
! Lustre-based?<br />
! Backed up?<br />
! Purged?<br />
! Available by Default?<br />
! Mounted on Compute Nodes?<br />
|-<br />
|Home Space<br />
|50 GB and 500K files per user<ref>This quota is fixed and cannot be changed.</ref><br />
|Yes<br />
|Yes<br />
|No<br />
|Yes<br />
|Yes<br />
|-<br />
|Scratch Space<br />
|20 TB and 1M files per user<br />
|Yes<br />
|No<br />
|Files older than 60 days are purged.<ref>See [[Scratch purging policy]] for more information.</ref><br />
|Yes<br />
|Yes<br />
|-<br />
|Project Space<br />
|1 TB and 5M files per group<ref>Project space can be increased to 10 TB per group by a RAS request. The group's sponsoring PI should write to [[technical support]] to make the request.</ref><br />
|Yes<br />
|Yes<br />
|No<br />
|Yes<br />
|Yes<br />
<br />
<!--T:19--><br />
|}<br />
<references /><br />
</tab><br />
<tab name="Graham"><br />
{| class="wikitable" style="font-size: 95%; text-align: center;"<br />
|+Filesystem Characteristics <br />
! Filesystem<br />
! Default Quota<br />
! Lustre-based?<br />
! Backed up?<br />
! Purged?<br />
! Available by Default?<br />
! Mounted on Compute Nodes?<br />
|-<br />
|Home Space<br />
|50 GB and 500K files per user<ref>This quota is fixed and cannot be changed.</ref><br />
|No<br />
|Yes<br />
|No<br />
|Yes<br />
|Yes<br />
|-<br />
|Scratch Space<br />
|20 TB and 1M files per user<br />
|Yes<br />
|No<br />
|Files older than 60 days are purged.<ref>See [[Scratch purging policy]] for more information.</ref><br />
|Yes<br />
|Yes<br />
|-<br />
|Project Space<br />
|1 TB and 500k files per group<ref>Project space can be increased to 10 TB per group by a RAS request. The group's sponsoring PI should write to [[technical support]] to make the request.</ref><br />
|Yes<br />
|Yes<br />
|No<br />
|Yes<br />
|Yes<br />
<br />
<!--T:20--><br />
|}<br />
<references /><br />
</tab><br />
<tab name="Béluga"><br />
{| class="wikitable" style="font-size: 95%; text-align: center;"<br />
|+Filesystem Characteristics <br />
! Filesystem<br />
! Default Quota<br />
! Lustre-based?<br />
! Backed up?<br />
! Purged?<br />
! Available by Default?<br />
! Mounted on Compute Nodes?<br />
|-<br />
|Home Space<br />
|50 GB and 500K files per user<ref>This quota is fixed and cannot be changed.</ref><br />
|Yes<br />
|Yes<br />
|No<br />
|Yes<br />
|Yes<br />
|-<br />
|Scratch Space<br />
|20 TB and 1M files per user<br />
|Yes<br />
|No<br />
|Files older than 60 days are purged.<ref>See [[Scratch purging policy]] for more information.</ref><br />
|Yes<br />
|Yes<br />
|-<br />
|Project Space<br />
|1 TB and 500k files per group<ref>Project space can be increased to 10 TB per group by a RAS request. The group's sponsoring PI should write to [[technical support]] to make the request.</ref><br />
|Yes<br />
|Yes<br />
|No<br />
|Yes<br />
|Yes<br />
|}<br />
<references /><br />
</tab><br />
<tab name="Niagara"><br />
{| class="wikitable"<br />
! location<br />
!colspan="2"| quota<br />
!align="right"| block size<br />
! expiration time<br />
! backed up<br />
! on login nodes<br />
! on compute nodes<br />
|-<br />
| $HOME<br />
|colspan="2"| 100 GB per user<br />
|align="right"| 1 MB<br />
| <br />
| yes<br />
| yes<br />
| read-only<br />
|-<br />
|rowspan="6"| $SCRATCH<br />
|colspan="2"| 25 TB per user (dynamic per group)<br />
|align="right" rowspan="6" | 16 MB<br />
|rowspan="6"| 2 months<br />
|rowspan="6"| no<br />
|rowspan="6"| yes<br />
|rowspan="6"| yes<br />
|-<br />
|align="right"|up to 4 users per group<br />
|align="right"|50TB<br />
|-<br />
|align="right"|up to 11 users per group<br />
|align="right"|125TB<br />
|-<br />
|align="right"|up to 28 users per group<br />
|align="right"|250TB<br />
|-<br />
|align="right"|up to 60 users per group<br />
|align="right"|400TB<br />
|-<br />
|align="right"|above 60 users per group<br />
|align="right"|500TB<br />
|-<br />
| $PROJECT<br />
|colspan="2"| by group allocation (RRG or RPP)<br />
|align="right"| 16 MB<br />
| <br />
| yes<br />
| yes<br />
| yes<br />
|-<br />
| $ARCHIVE<br />
|colspan="2"| by group allocation<br />
|align="right"| <br />
|<br />
| dual-copy<br />
| no<br />
| no<br />
|-<br />
| $BBUFFER<br />
|colspan="2"| 10 TB per user<br />
|align="right"| 1 MB<br />
| very short<br />
| no<br />
| yes<br />
| yes<br />
|}<br />
<ul><br />
<li>[https://docs.scinet.utoronto.ca/images/9/9a/Inode_vs._Space_quota_-_v2x.pdf Inode vs. Space quota (PROJECT and SCRATCH)]</li><br />
<li>[https://docs.scinet.utoronto.ca/images/0/0e/Scratch-quota.pdf dynamic quota per group (SCRATCH)]</li><br />
<li>Compute nodes do not have local storage.</li><br />
<li>Archive(a.k.a. nearline) space is on [https://docs.scinet.utoronto.ca/index.php/HPSS HPSS]</li><br />
<li>Backup means a recent snapshot, not an archive of all data that ever was.</li><br />
<li><code>$BBUFFER</code> stands for [https://docs.scinet.utoronto.ca/index.php/Burst_Buffer Burst Buffer], a faster parallel storage tier for temporary data.</li></ul><br />
<br />
<!--T:21--><br />
</tab><br />
</tabs><br />
<br />
<!--T:22--><br />
The backup policy on the home and project space is nightly backups which are retained for 30 days, while deleted files are retained for a further 60 days - note that is entirely distinct from the age limit for purging files from the scratch space. If you wish to recover a previous version of a file or directory, you should contact [[technical support]] with the full path for the file(s) and desired version (by date).<br />
<br />
== See also == <!--T:13--><br />
<br />
<!--T:14--><br />
* [[Project layout]]<br />
* [[Sharing data]]<br />
* [[National Data Cyberinfrastructure]]<br />
* [[Tuning Lustre]]<br />
* [[Archiving and compressing files]]<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=51462Graham2018-04-27T15:04:42Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: In production. RAC 2017's implemented June 30, 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, cp, mv,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[https://docs.computecanada.ca/wiki/Getting_Started_with_the_new_National_Systems Getting started with Graham]<br />
<br />
<!--T:34--><br />
[https://docs.computecanada.ca/wiki/Running_jobs How to run jobs]<br />
<br />
<!--T:34--><br />
[https://docs.computecanada.ca/wiki/Transferring_data Transfering data]<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_Quotas_and_Policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_Quotas_and_Policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_Quotas_and_Policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Node types and characteristics= <!--T:5--><br />
A total of 35,520 cores and 320 GPU devices, spread across 1,107 nodes of different types.<br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
|-<br />
| base nodes || 864 nodes || 128 GiB of memory (125 Gib usable), 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| large nodes (cloud configuration) || 56 nodes || 256 GiB of memory (251 GiB usable), 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| GPU nodes || 160 nodes || 128 GiB of memory (125 Gib usable), 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.<br />
|-<br />
| bigmem500 nodes|| 24 nodes || 512 GiB of memory (503 GiB usable), 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| bigmem3000 nodes || 3 nodes || 3 TiB of memory (3023 GiB usable), 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.<br />
<br />
<!--T:35--><br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], $SLURM_TMPDIR. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Project_layout&diff=47890Project layout2018-03-07T15:10:13Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
<translate><br />
<!--T:18--><br />
''Parent page: [[Storage and file management]]''<br />
<br />
<!--T:1--><br />
The project filesystem on [[Cedar]] and [[Graham]] is organized on the basis of ''groups'' though with an easy user-based interface. The normal method to access the project space is by means of symbolic links which exist in your home directory. These will have the form <tt>$HOME/projects/group_name</tt> and, in older accounts, another symbolic link <tt>$HOME/project</tt> that points to the project directory for your default group (for those users who belong to more than one group). <br />
<br />
<!--T:2--><br />
The permissions on the group space are such that it is owned by the principal investigator (PI) for this group and members have read and write permission on this directory. However by default a newly created file will only be readable by group members. If the group wishes to have writeable files, the best approach is to create a special directory for that, for example<br />
{{Command|mkdir $HOME/projects/def-profname/group_writable}}<br />
followed by<br />
{{Command|setfacl -d -m g::rwx $HOME/projects/def-profname/group_writable}}<br />
<br />
<!--T:3--><br />
For more on sharing data, file ownership, and access control lists (ACLs), see [[Sharing data]].<br />
<br />
<!--T:4--><br />
The project space is subject to a default quota of 1 TB and five million files per group and which can be increased up to 10 TB of space upon request to [mailto:support@computecanada.ca Compute Canada support]. Certain groups may have been awarded significantly higher quotas through the annual [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ resource allocation competition]. In this case, you will already have been notified of your group's quota for the coming year. Note that this storage allocation is specific to a particular cluster and cannot normally be transferred to another cluster. <br />
<br />
<!--T:5--><br />
To check current usage and available disk space for scratch and project on Cedar and Graham, or for home on Cedar, use<br />
{{Command|diskusage_report}}<br />
<br />
<!--T:16--><br />
In order to ensure that files which are copied or moved to a given project space acquire the appropriate group membership - and thus are counted against the expected quota - it can be useful to set the <tt>setgid</tt> bit on the directory in question. This will have the effect of ensuring that every new file and subdirectory created below the directory will inherit the same group as the ambient directory; equally so, new subdirectories will also possess this same <tt>setgid</tt> bit. However, existing files and subdirectories will not have their group membership changed - this should be done with the <tt>chgrp</tt> command - and any files moved to the directory will also continue to retain their existing group membership. You can set the <tt>setgid</tt> bit on a directory with the command<br />
{{Command|chmod g+s <directory name>}}<br />
If you want to apply this command to the existing subdirectories of a directory, you can use the command<br />
{{Command|find <directory name> -type d -exec chmod g+s '{}' \;}}<br />
More information on the <tt>setgid</tt> is available from this [https://en.wikipedia.org/wiki/Setuid#setuid_and_setgid_on_directories page]. <br />
<br />
<!--T:17--><br />
You can also use the command <tt>newgrp</tt> to modify your default group during an interactive session, for example<br />
{{Command|newgrp rrg-profname-ab}}<br />
and then to copy any data to the appropriate project directory. This will only change your default group for this particular session however - at your next login you will need to reuse the <tt>newgrp</tt> command if you wish to change the default group again. <br />
<br />
=== An explanatory example === <!--T:6--><br />
<br />
<!--T:7--><br />
Imagine that we have a PI (“Sue”) who has a sponsored user under her (“Bob”). Both Sue and Bob start with a directory structure that on the surface looks similar:<br />
<br />
<!--T:8--><br />
<div style="column-count:2;-moz-column-count:2;-webkit-column-count:2"><br />
* <code>/home/sue/scratch</code> (symbolic link)<br />
* <code>/home/sue/projects</code> (directory)<br />
* <code>/home/bob/scratch</code> (symbolic link)<br />
* <code>/home/bob/projects</code> (directory)<br />
</div><br />
<br />
<!--T:9--><br />
The scratch link points to a different location for Sue (<code>/scratch/sue</code>) and Bob (<code>/scratch/bob</code>). <br />
<br />
<!--T:10--><br />
If Bob's only role was the one sponsored by Sue, then Bob's <code>projects</code> directory would have the same contents as Sue's <code>projects</code> directory. Further, if neither Sue nor Bob have any other roles or projects with Compute Canada, then each one's <code>projects</code> directory would just contain one subdirectory, <code>def-sue</code>.<br />
<br />
<!--T:11--><br />
Each of <code>/home/sue/projects/def-sue</code> and <code>/home/bob/projects/def-sue</code> would point to the same location, <code>/project/<some random number></code>. This project directory is the best place for Sue and Bob to share data. They can both create directories in it, read it, and write to it. Sue for instance could do<br />
$ cd ~/projects/def-sue<br />
$ mkdir foo<br />
and Bob could then copy a file into the directory <code>~/projects/def-sue/foo</code>, where it will be visible to both of them.<br />
<br />
<!--T:12--><br />
If Sue were to get a RAC award with storage (as is often the case these days), both she and Bob would find that there is a new entry in their respective <code>projects</code> directory, something like<br />
~/projects/rrg-sue-ab<br />
They should use this directory to store and share data related to the research carried out under the RAC award.<br />
<br />
For sharing data with someone who doesn't have a role sponsored by Sue--- let's say Heather--- the simplest thing to do is to change the file permissions so that Heather can read a particular directory or file. See [https://docs.computecanada.ca/wiki/Sharing_data Sharing data] for more details. The best idea is usually to use ACLs to let Heather read a directory. Note that these filesystem permissions can be changed for almost any directory or file, not just those in your <tt>project</tt> space --- you could share a directory in your <code>scratch</code> too, or just a particular subdirectory of <code>projects</code>, if you have several (a default one, one for a RAC, ''etc.''). Best practice is to restrict file sharing to <code>/project</code> and <code>/scratch</code>.)<br />
<br />
<!--T:13--><br />
One thing to keep in mind when sharing a directory is that Heather will need to be able to descend the entire filesystem structure down to this directory and so she will need to have read and execute permission on each of the directories between <code>~/projects/def-sue</code> and the directory containing the file(s) to be shared. We have implicitly assumed here that Heather has an account on the cluster but you can even share with researchers who don't have a Compute Canada account using a [https://docs.computecanada.ca/wiki/Globus#Globus_Sharing Globus shared endpoint].<br />
<br />
<!--T:14--><br />
If Heather is pursuing a serious and ongoing collaboration with Sue then it may naturally make sense for Sue to sponsor a role for Heather, thereby giving Heather access similar to Bob's, described earlier. <br />
<br />
<!--T:15--><br />
To summarize:<br />
* <code>scratch</code> space is for (private) temporary files<br />
* <code>home</code> space is normally for small amounts of relatively private data (e.g. a job script),<br />
* Shared data for a research group should normally go in that group's <code>project</code> space, as it is persistent, backed-up, and fairly large (up to 10 TB, or more with a RAC).<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Frequently_Asked_Questions&diff=45517Frequently Asked Questions2018-01-16T21:15:02Z<p>Kaizaad: </p>
<hr />
<div><languages /><br />
__TOC__<br />
<br />
<translate><br />
<br />
== ''Disk quota exceeded'' error on /project filesystems == <!--T:12--><br />
Some users have seen this message or some similar quota error on their [[Project layout|project]] folders. Other users have reported obscure failures while transferring files into their <code>/project</code> folder from another cluster. Many of the problems reported are due to bad file ownership.<br />
<br />
<!--T:5--><br />
Use <code>diskusage_report</code> to see if you are at or over your quota:<br />
<source lang="bash"><br />
[ymartin@cedar5 ~]$ diskusage_report<br />
Description Space # of files<br />
Home (user ymartin) 345M/50G 9518/500k<br />
Scratch (user ymartin) 93M/20T 6532/1000k<br />
Project (group ymartin) 5472k/2048k 158/5000k<br />
Project (group/def-zrichard) 20k/1000G 4/5000k<br />
</source><br />
<br />
<!--T:6--><br />
The example above illustrates a frequent problem: <code>/project</code> for user <code>ymartin</code> contains too much data in files belonging to group <code>ymartin</code>. The data should instead be in files belonging to <code>def-zrichard</code>.<br />
<br />
<!--T:8--><br />
Note the two lines labelled <code>Project</code>.<br />
*<code>Project (group ymartin)</code> describes files belonging to group <code>ymartin</code>, which has the same name as the user. This user is the only member of this group, which has a very small quota (2048k). <br />
*<code>Project (group def-zrichard)</code> describes files belonging to a '''project group'''. Your account may be associated with one or more project groups, and they will typically have names like <code>def-zrichard</code>, <code>rrg-someprof-ab</code>, or <code>rpp-someprof</code>. <br />
<br />
<!--T:9--><br />
In this example, files have somehow been created belonging to group <code>ymartin</code> instead of group <code>def-zrichard</code>. This is neither the desired nor the expected behaviour <br />
<br />
<!--T:2--><br />
By design, new files and directories in <code>/project</code> will normally be created belonging to a project group. The two main reasons why files may be associated with the wrong group are that<br />
*files were moved from <code>/home</code> to <code>/project</code> with the <code>mv</code>command; to avoid this, use <code>cp</code> instead;<br />
*files were transfered from another cluster using [[Transferring_data#Rsync|rsync]] or [[Transferring_data#SCP|scp]] with an option to preserve the original group ownership. If you have a recurring problem with ownership, check the options you are using with your file transfer program.<br />
<br />
<!--T:13--><br />
For [[Transferring_data#Rsync|rsync]] you can use the following command to transfer a directory from a remote location to your project directory:<br />
<pre><br />
$ rsync -axvpH --no-g --no-p remote_user@remote.system:remote/dir/path $HOME/project/$USER/<br />
</pre><br />
You can also compress the data to get a better transfer rate.<br />
<pre><br />
$ rsync -axvpH --no-g --no-p --compress-level=5 remote_user@remote.system:remote/dir/path $HOME/project/$USER/<br />
</pre><br />
<br />
<!--T:3--><br />
To see the project groups you may use, run the following command:<br />
{{Command|stat -c %G $HOME/projects/*/}}<br />
<br />
<!--T:4--><br />
If you are the owner of the files, you can run the <code>chgrp</code> command to change their group ownership to the appropriate project group. To ask us to change the group owner for several users, contact [[Technical Support|technical support]].<br />
<br />
<!--T:7--><br />
See [[Project layout]] for further explanations.<br />
<br />
== ''sbatch: error: Batch job submission failed: Socket timed out on send/recv operation'' == <!--T:10--><br />
<br />
<!--T:11--><br />
You may see this message when the load on the [[Running jobs|Slurm]] manager or scheduler process is too high. We are working both to improve Slurm's tolerance of that and to identify and eliminate the sources of load spikes, but that is a long-term project. The best advice we have currently is to wait a minute or so. Then run <code>squeue -u $USER</code> and see if the job you were trying to submit appears: in some cases the error message is delivered even though the job was accepted by Slurm. If it doesn't appear, simply submit it again.<br />
<br />
== ''slurmstepd: error: Exceeded step memory limit at some point'' == <!--T:14--><br />
<br />
<!--T:15--><br />
This and the similar message, "slurmstepd: error: Exceeded job memory limit at some point" are potentially misleading. In some, but not all, cases it signifies a harmless condition. If your job otherwise appears to have terminated normally, that is, if all expected output is present, then you should ignore these messages. Do not increase your memory requests simply to suppress these messages!<br />
<br />
<!--T:16--><br />
If your job was actually killed for exceeding the requested memory, the key word "Killed" should appear in the standard error output of the job. <br />
<br />
<!--T:17--><br />
However, if you are using job dependencies (<code>dependency=afterok:<jobid></code>), then either of the messages "Exceeded job memory limit" or "Exceeded step memory limit" probably means that the dependent job was cancelled. We are [https://bugs.schedmd.com/show_bug.cgi?id=3820 in discussion] with the Slurm development team about fixing this behaviour, as well as suppressing the misleading messages in non-fatal circumstances.<br />
<br />
</translate></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Gaussian&diff=44822Gaussian2017-12-19T14:47:23Z<p>Kaizaad: Remove module load via command line example</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<br />
<translate><br />
<!--T:1--><br />
Gaussian is a computational chemistry application produced by [http://gaussian.com/ Gaussian, Inc.]<br />
<br />
== License limitations == <!--T:2--><br />
<br />
<!--T:3--><br />
Compute Canada currently supports Gaussian only on [[Graham]] and certain legacy systems. <br />
<br />
<!--T:4--><br />
In order to use Gaussian you must agree to certain conditions. Send an email with a copy of the following statement to [mailto:support@computecanada.ca support@computecanada.ca].<br />
# I am not a member of a research group developing software competitive to Gaussian.<br />
# I will not copy the Gaussian software, nor make it available to anyone else.<br />
# I will properly acknowledge Gaussian Inc. and Compute Canada in publications.<br />
# I will notify Compute Canada of any change in the above acknowledgement.<br />
<br />
<!--T:5--><br />
We will then grant you access to Gaussian.<br />
<br />
==Running Gaussian on Graham== <!--T:6--><br />
Gaussian g16.a03, g09.e01 and g03.d01 are installed on Graham cluster and available through the modules system. You should load the required version in your job script as shown in the mysub.sh example below.<br />
</translate> <br />
<br />
<translate><br />
===Job submission=== <!--T:7--><br />
Graham uses the Slurm scheduler; for details about submitting jobs, see [[Running jobs]].<br />
<br />
<!--T:8--><br />
Besides your input file (in our example name.com), you have to prepare a job script to define the compute resources for the job; both input file and job script must be in the same directory.<br />
<br />
<!--T:9--><br />
There are two options to run your Gaussian job on Graham based on the size of your job files:<br />
* g16, g09, g03 for regular size jobs<br />
* G16, G09, G03 for large jobs<br />
<br />
====g16 (g09, g03) for regular size jobs==== <!--T:10--><br />
<br />
<!--T:11--><br />
This option will save the runtime files (.rwf, .inp, .d2e, .int, .skr) to local scratch (/localscratch/username/) on the compute node where the job was scheduled to. The files on local scratch will be deleted by the scheduler afterwards; to keep trace of them we recommend that users note the computer node number.<br />
<br />
<!--T:12--><br />
The following example is a g16 job script; for a g09 or g03 job, simply change the module load line and g16 to g09 or g03.<br />
<br />
<!--T:31--><br />
Note that for coherence, we use the same name for each files, changing only the extension (name.sh, name.com, name.log).<br />
</translate><br />
{{File<br />
|name=mysub.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --mem=16G # <translate><!--T:13--><br />
memory, roughly 2 times %mem defined in the input name.com file</translate><br />
#SBATCH --time=02-00:00 # <translate><!--T:14--><br />
expect run time (DD-HH:MM)</translate><br />
#SBATCH --cpus-per-task=16 # <translate><!--T:15--><br />
No. of cpus for the job as defined by %nprocs in the name.com file</translate><br />
module load gaussian/g16.a03<br />
g16 < name.com >& name.log # <translate><!--T:16--><br />
g16 command, input: name.com, output: name.log</translate><br />
}}<br />
<translate><br />
<!--T:17--><br />
You can modify the script to fit your job's requirements for compute resources.<br />
<br />
====G16 (G09, G03) for large size jobs==== <!--T:18--><br />
<br />
<!--T:19--><br />
localscratch is ~800G shared by any jobs running on the node. If your job files are bigger than or close to that size range, you would instead use this option to save files to your /scratch. However it's hard for us to define what size of job is considered as a large job because we cannot predict how many jobs will be running on a node at certain time, how many jobs may save files and the size of the files to /localscratch. It is however possible to have multiple Gaussian jobs running on the same node sharing the ~800G space. <br />
<br />
<!--T:20--><br />
G16 provides a better way to manage your files as they are located within the /scratch/username/jobid/ directory, and it's easier to locate the .rwf file to restart a job in a later time.<br />
<br />
<!--T:21--><br />
The following example is a G16 job script; for a G09 or G03 job, simply change the module load line and G16 to G09 or G03.<br />
<br />
<!--T:32--><br />
Note that for coherence, we use the same name for each files, changing only the extension (name.sh, name.com, name.log).<br />
</translate><br />
{{File<br />
|name=mysub.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --mem=16G # <translate><!--T:22--><br />
memory, roughly 2 times %mem defined in the input name.com file</translate><br />
#SBATCH --time=02-00:00 # <translate><!--T:23--><br />
expect run time (DD-HH:MM)</translate><br />
#SBATCH --cpus-per-task=16 # <translate><!--T:24--><br />
No. of cpus for the job as defined by %nprocs in the name.com file</translate><br />
module load gaussian/g16.a03<br />
G16 name.com # <translate><!--T:25--><br />
G16 command, input: name.com, output: name.log by default</translate><br />
}}<br />
<translate><br />
====Submit the job==== <!--T:33--><br />
sbatch mysub.sh<br />
<br />
=== Interactive jobs === <!--T:26--><br />
You can run interactive Gaussian job for testing purpose on Graham. It's not a good practice to run interactive Gaussian jobs on a login node. You can start an interactive session on a compute node with salloc, the example for an hour, 8 cpus and 10G memory Gaussian job is like<br />
Goto the input file directory first, then use salloc command:<br />
</translate><br />
{{Command|salloc --time{{=}}1:0:0 --cpus-per-task{{=}}8 --mem{{=}}10g}}<br />
<br />
<translate><br />
<!--T:27--><br />
Then use either<br />
</translate><br />
{{Commands<br />
|module load gaussian/g16.a03<br />
|G16 g16_test2.com # <translate><!--T:28--><br />
G16 saves runtime file (.rwf etc.) to /scratch/yourid/93288/</translate><br />
}}<br />
<br />
<translate><!--T:29--><br />
or </translate><br />
{{Commands<br />
|module load gaussian/g16.a03<br />
|g16 < g16_test2.com >& g16_test2.log & # <translate><!--T:30--><br />
g16 saves runtime file to /localscratch/yourid/</translate><br />
}}<br />
<br />
===Examples===<br />
Sample script *.sh and input files can be found on Graham under<br />
/home/jemmyhu/tests/test_Gaussian/</div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=44811Graham2017-12-18T02:52:59Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: In production. RAC 2017's implemented June 30, 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[https://docs.computecanada.ca/wiki/Getting_Started_with_the_new_National_Systems Getting started with Graham]<br />
<br />
<!--T:34--><br />
[https://docs.computecanada.ca/wiki/Running_jobs How to run jobs]<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_Quotas_and_Policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_Quotas_and_Policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_Quotas_and_Policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Node types and characteristics= <!--T:5--><br />
A total of 35,520 cores and 320 GPU devices, spread across 1,107 nodes of different types.<br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
|-<br />
| base nodes || 864 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| large nodes (cloud configuration) || 56 nodes || 256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| GPU nodes || 160 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.<br />
|-<br />
| bigmem500 nodes|| 24 nodes || 0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| bigmem3000 nodes || 3 nodes || 3 TB of memory, 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.<br />
<br />
<!--T:35--><br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], $SLURM_TMPDIR. Note that this directory (and its' contents) will disappear upon job completion.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaadhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=35456Graham2017-08-04T20:00:20Z<p>Kaizaad: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: In production. RAC 2017's implemented June 30, 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[https://docs.computecanada.ca/wiki/Getting_Started_with_the_new_National_Systems Getting started with Graham]<br />
<br />
[https://docs.computecanada.ca/wiki/Running_jobs How to run jobs]<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Standard home directory.<br />
* Small, standard quota. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
|-<br />
| '''Scratch space'''<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Available to all nodes.<br />
* Not allocated.<br />
* Inactive data will be purged.<br />
* [http://e.huawei.com/en/products/cloud-computing-dc/storage Huawei OceanStor] storage system with approximately 3.6PB usable capacity and aggregate performance of approximately 30GB/s.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Available to all nodes.<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Node types and characteristics= <!--T:5--><br />
A total of 33,472 cores and 320 GPU devices, spread across 1,043 nodes of different types.<br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
|-<br />
| "Base" compute nodes || 800 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Large" nodes (cloud configuration) || 56 nodes || 256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem500" nodes|| 24 nodes || 0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem3000" nodes || 3 nodes || 3 TB of memory, 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.<br />
|-<br />
| "GPU" nodes || 160 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.<br />
|}<br />
<br />
<!--T:7--><br />
Local (on-node) storage in the above nodes will be available as /tmp.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Kaizaad