NAMD: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(→‎Verbs Job: update verbs script)
(Marked this version for translation)
 
(76 intermediate revisions by 10 users not shown)
Line 1: Line 1:
{{Draft}}
<languages />
[[Category:Software]]
[[Category:Software]][[Category:BiomolecularSimulation]]


= General =
<translate>
'''NAMD''' is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.
Simulation preparation and analysis is integrated into the visualization package [[VMD]].


* Project web site: http://www.ks.uiuc.edu/Research/namd/
<!--T:24-->
* Manual: http://www.ks.uiuc.edu/Research/namd/current/ug/
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.  
* Downloads: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
Simulation preparation and analysis is integrated into the [[VMD]] visualization package.
* Tutorials: http://www.ks.uiuc.edu/Training/Tutorials/
A registration required to download software.


'''Release notes:'''
* Version 2.11: http://www.ks.uiuc.edu/Research/namd/2.11/notes.html


'''NAMD''' Wiki, How to compile: https://proteusmaster.urcf.drexel.edu/urcfwiki/index.php/Compiling_NAMD
= Installation = <!--T:22-->
NAMD is installed by our software team and is available as a module.  If a new version is required or if for some reason you need to do your own installation, please contact [[Technical support]]. You can also ask for details of how our NAMD modules were compiled.


== Strengths ==
= Environment modules = <!--T:4-->


== Weak points ==
<!--T:48-->
The latest version of NAMD is 2.14 and it has been installed on all clusters.  We recommend users run the newest version.


= GPU support =
<!--T:49-->
Older versions 2.13 and 2.12 are also available.


= Quickstart Guide =
<!--T:50-->
This section summarizes configuration details.
To run jobs that span nodes, use UCX.


=== Environment Modules ===
= Submission scripts = <!--T:13-->


The following modules providing NAMD are available on graham and cedar.
<!--T:14-->
Please refer to the [[Running jobs]] page for help on using the SLURM workload manager.


Compiled without CUDA support:
== Serial and threaded jobs == <!--T:15-->
Below is a simple job script for a serial simulation (using only one core).  You can increase the number for --cpus-per-task to use more cores, up to the maximum number of cores available on a cluster node.
</translate>
{{File
  |name=serial_namd_job.sh
  |lang="sh"
  |contents=
#!/bin/bash
#
#SBATCH --cpus-per-task=1
#SBATCH --mem 2048            # memory in Mb, increase as needed   
#SBATCH -o slurm.%N.%j.out    # STDOUT file
#SBATCH -t 0:05:00            # time (D-HH:MM), increase as needed
#SBATCH --account=def-specifyaccount


* namd-multicore/2.12
module load StdEnv/2020
* namd-verbs/2.12
module load namd-multicore/2.14
 
namd2 +p$SLURM_CPUS_PER_TASK  +idlepoll apoa1.namd
Compiled with CUDA support:
}}
<translate>


* namd-multicore/2.12
== Parallel CPU jobs == <!--T:61-->
* namd-verbs-smp/2.12


To access these modules which require CUDA, first execute:
=== MPI jobs === <!--T:18-->
'''NOTE''': MPI should not be used.  Instead use UCX.  


module load cuda/8.0.44
=== Verbs jobs === <!--T:16-->


Note: using verbs library is more efficient than using OpenMPI, hence only verbs versions are provided.
<!--T:51-->
NOTE: For NAMD 2.14, use UCX. Instructions below apply only to NAMD versions 2.13 and 2.12.


=== Submission Scripts ===
<!--T:52-->
These provisional instructions will be refined further once this configuration can be fully tested on the new clusters.
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores.  This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham).  For best performance, NAMD jobs should use full nodes.


Please refer to the page "[[Running jobs]]" for help on using the SLURM workload manager.
<!--T:17-->
 
'''NOTES''':
==== Serial Job ====
*Verbs versions will not run on Cedar because of its different interconnect; use the MPI version instead.
Here's a simple job script for serial simulation:
*Verbs versions will not run on Béluga either because of its incompatible infiniband kernel drivers; use the UCX version instead.
</translate>
{{File
{{File
   |name=serial_namd_job.sh
   |name=verbs_namd_job.sh
   |lang="sh"
   |lang="sh"
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#
#
#SBATCH --ntasks 1            # number of tasks
#SBATCH --nodes=2
#SBATCH --mem 1024           # memory pool per process
#SBATCH --ntasks-per-node=32
#SBATCH --mem=0           # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -t 0:20:00            # time (D-HH:MM)
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
#SBATCH --account=def-specifyaccount


NODEFILE=nodefile.dat
slurm_hl2hl.py --format CHARM > $NODEFILE
P=$SLURM_NTASKS


module load namd-multicore/2.12
module load namd-verbs/2.12
namd2 +p1 +idlepoll apoa1.namd
CHARMRUN=`which charmrun`
NAMD2=`which namd2`
$CHARMRUN ++p $P ++nodelist $NODEFILE  $NAMD2  +idlepoll apoa1.namd
}}
}}
<translate>


==== Verbs Job ====
=== UCX jobs === <!--T:42-->
These provisional Instructions will be refined further once this configuration can be fully tested on the new clusters.
This example uses 80 processes in total on 2 nodes, each node running 40 processes, thus fully utilizing its 80 cores.  This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 40 (on Béluga).  For best performance, NAMD jobs should use full nodes.
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores.  This script assumes full nodes are used, thus ntasks/nodes should be 32 (on graham).  For best performance, NAMD jobs should use full nodes.


'''NOTE''': The verbs version will not run on cedar because of its different interconnect.  Use the MPI version instead.
 
<!--T:43-->
'''NOTE''': UCX versions should work on all clusters.
</translate>
{{File
{{File
   |name=verbs_namd_job.sh
   |name=ucx_namd_job.sh
   |lang="sh"
   |lang="sh"
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#
#
#SBATCH --ntasks 64            # number of tasks
#SBATCH --nodes=2
#SBATCH --nodes=2
#SBATCH --mem 0            # memory per node, 0 means all memory
#SBATCH --ntasks-per-node=40
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
#SBATCH --account=def-specifyaccount


cat << EOF > nodefile.py
module load StdEnv/2020 namd-ucx/2.14
#!/usr/bin/python
srun --mpi=pmi2 namd2 apoa1.namd
import sys
}}
a=sys.argv[1]
<translate>
nodefile=open("nodefile.dat","w")


cluster=a[0:3]
=== OFI jobs === <!--T:53-->
for st in a.lstrip(cluster+"[").rstrip("]").split(","):
    d=st.split("-")
    start=int(d[0])
    finish=start
    if(len(d)==2):
        finish=int(d[1])


    for i in range(start,finish+1):
<!--T:54-->
        nodefile.write("host "+cluster+str(i)+"\n")
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect. There have been some issues with OFI so it is better to use UCX.
</translate>
{{File
  |name=ucx_namd_job.sh
  |lang="sh"
  |contents=
#!/bin/bash
#SBATCH --account=def-specifyaccount
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out    # STDOUT


nodefile.close()
<!--T:55-->
module load StdEnv/2020 namd-ofi/2.14
srun --mpi=pmi2 namd2 stmv.namd
}}
<translate>


EOF
== Single GPU jobs == <!--T:19-->
This example uses 8 CPU cores and 1 P100 GPU on a single node.
</translate>
{{File
  |name=multicore_gpu_namd_job.sh
  |lang="sh"
  |contents=
#!/bin/bash


python nodefile.py $SLURM_NODELIST
#SBATCH --cpus-per-task=8
NODEFILE=nodefile.dat
#SBATCH --mem=2048           
P=$SLURM_NTASKS
#SBATCH --time=0:15:00
#SBATCH --gpus-per-node=p100:1
#SBATCH --account=def-specifyaccount


module load namd-verbs/2.12
module load StdEnv/2020
CHARMRUN=`which charmrun`
module load cuda/11.0
NAMD2=`which namd2`
module load namd-multicore/2.14
$CHARMRUN ++p $P ++nodelist $NODEFILE  $NAMD2 +idlepoll apoa1.namd
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd
}}
}}


==== MPI Job ====
<translate>
'''NOTE''': Use this only on cedar, where verbs version will not work.
 
== Parallel GPU jobs == <!--T:44-->
=== UCX GPU jobs ===
This example is for Béluga and it assumes that full nodes are used, which gives best performance for NAMD jobs. It uses 8 processes in total on 2 nodes, each process(task) using 10 threads and 1 GPU.  This fully utilizes Béluga GPU nodes which have 40 cores and 4 GPUs per node.  Note that 1 core per task has to be reserved for a communications thread, so NAMD will report that only 72 cores are being used but this is normal. 
 
<!--T:45-->
To use this script on other clusters, please look up the specifications of their available nodes and adjust --cpus-per-task and --gpus-per-node options accordingly.
 
<!--T:46-->
'''NOTE''': UCX can be used on all clusters.
{{File
{{File
   |name=mpi_namd_job.sh
   |name=ucx_namd_job.sh
   |lang="sh"
   |lang="sh"
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#
 
#SBATCH --ntasks 64            # number of tasks
<!--T:62-->
#SBATCH --nodes=2
#SBATCH --nodes=2
#SBATCH --mem 4024            # memory pool per process
#SBATCH --ntasks-per-node=4
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH --cpus-per-task=10 # number of threads per task (process)
#SBATCH -t 0:05:00           # time (D-HH:MM)
#SBATCH --gpus-per-node=v100:4
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH --time=0:15:00
#SBATCH --account=def-specifyaccount
#SBATCH --account=def-specifyaccount


module load namd-mpi/2.12
<!--T:47-->
NAMD2=`which namd2`
module load StdEnv/2020  intel/2020.1.217  cuda/11.0 namd-ucx-smp/2.14
srun $NAMD2 apoa1.namd
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )
srun --cpus-per-task=$SLURM_CPUS_PER_TASK --mpi=pmi2 namd2 ++ppn $NUM_PES apoa1.namd
}}
}}


==== GPU Job ====
=== OFI GPU jobs === <!--T:56-->
This example uses 8 CPU cores and 1 GPU on a single node.
 
<!--T:57-->
'''NOTE''': OFI versions will run '''ONLY''' on Cedar because of its different interconnect.  There have been some issues with OFI, so it is better to use UCX.
{{File
{{File
   |name=multicore_gpu_namd_job.sh
   |name=ucx_namd_job.sh
   |lang="sh"
   |lang="sh"
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#
#SBATCH --account=def-specifyaccount
#SBATCH --ntasks 8            # number of tasks
#SBATCH --ntasks 8            # number of tasks
#SBATCH --mem 1024            # memory pool per process
#SBATCH --nodes=2
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH --cpus-per-task=6
#SBATCH --gpus-per-node=p100:4
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --gres=gpu:1
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH --account=def-specifyaccount
 
<!--T:58-->
module load StdEnv/2020 cuda/11.0 namd-ofi-smp/2.14
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )
srun --cpus-per-task=$SLURM_CPUS_PER_TASK --mpi=pmi2 namd2 ++ppn $NUM_PES stmv.namd
}}


=== Verbs-GPU jobs === <!--T:20-->


module load cuda/8.0.44
<!--T:59-->
module load namd-multicore/2.12
NOTE:  For NAMD 2.14, use UCX GPU on all clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.
namd2 +p8 +idlepoll apoa1.namd
}}


==== Verbs-GPU Job ====
<!--T:60-->
These provisional Instructions will be refined further once this configuration can be fully tested on the new clusters.
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores.  Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus <code>ntasks-per-node</code> should be 32 (on Graham).  For best performance, NAMD jobs should use full nodes.
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores.  Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus ntasks/nodes should be 32 (on graham).  For best performance, NAMD jobs should use full nodes.


'''NOTE''': The verbs version will not run on cedar because of its different interconnect.  Use the MPI version instead.
<!--T:21-->
'''NOTE''': Verbs versions will not run on Cedar because of its different interconnect.   
</translate>
{{File
{{File
   |name=verbsgpu_namd_job.sh
   |name=verbsgpu_namd_job.sh
Line 172: Line 235:
#SBATCH --ntasks 64            # number of tasks
#SBATCH --ntasks 64            # number of tasks
#SBATCH --nodes=2
#SBATCH --nodes=2
#SBATCH --mem 1024           # memory pool per process
#SBATCH --ntasks-per-node=32
#SBATCH --gres=gpu:2
#SBATCH --mem 0           # memory per node, 0 means all memory
#SBATCH --gpus-per-node=p100:2
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
#SBATCH --account=def-specifyaccount


cat << EOF > nodefile.py
slurm_hl2hl.py --format CHARM > nodefile.dat
#!/usr/bin/python
import sys
a=sys.argv[1]
nodefile=open("nodefile.dat","w")
 
cluster=a[0:3]
for st in a.lstrip(cluster+"[").rstrip("]").split(","):
    d=st.split("-")
    start=int(d[0])
    finish=start
    if(len(d)==2):
        finish=int(d[1])
 
    for i in range(start,finish+1):
        nodefile.write("host "+cluster+str(i)+"\n")
 
nodefile.close()
 
EOF
 
python nodefile.py $SLURM_NODELIST
NODEFILE=nodefile.dat
NODEFILE=nodefile.dat
OMP_NUM_THREADS=32
OMP_NUM_THREADS=32
Line 210: Line 253:
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE  $NAMD2  +idlepoll apoa1.namd
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE  $NAMD2  +idlepoll apoa1.namd
}}
}}
<translate>
= Performance and benchmarking= <!--T:31-->
<!--T:71-->
A team at [https://www.ace-net.ca/ ACENET] has created a [https://mdbench.ace-net.ca/mdbench/ Molecular Dynamics Performance Guide] for Alliance clusters.
It can help you determine optimal conditions for AMBER, GROMACS, NAMD, and OpenMM jobs. The present section focuses on NAMD performance.
<!--T:32-->
Here is an example of how you should conduct benchmarking of NAMD.  Performance of NAMD will be different for different systems you are simulating, depending especially on the number of atoms in the simulation.  Therefore, if you plan to spend a significant amount of time simulating a particular system, it would be very useful to conduct the kind of benchmarking shown below.  Collecting and providing this kind of data is also very useful if you are applying for a RAC award.
<!--T:33-->
For a good benchmark, vary the number of steps so that your system runs for a few minutes, and that timing information is collected in reasonable time intervals of at least a few seconds.  If your run is too short, you might see fluctuations in your timing results. 
<!--T:34-->
The numbers below were obtained for the standard NAMD apoa1 benchmark.  The benchmarking was conducted on the Graham cluster, which has CPU nodes with 32 cores and GPU nodes with 32 cores and 2 GPUs.  Performing the benchmark on other clusters will have to take account of the different structure of their nodes.
<!--T:35-->
In the results shown in the first table below, we used NAMD 2.12 from the verbs module. Efficiency is computed from  (time with 1 core) / (N * (time with N cores) ).


<!--T:36-->
{| class="wikitable sortable"
|-
! # cores !! Wall time (s) per step !! Efficiency
|-
| 1 ||  0.8313||100%
|-
| 2 ||  0.4151||100%
|-
| 4 ||  0.1945|| 107%
|-
| 8 ||  0.0987|| 105%
|-
| 16 ||  0.0501|| 104%
|-
| 32  ||    0.0257|| 101%
|-
| 64 ||  0.0133|| 98%
|-
| 128 || 0.0074|| 88%
|-
| 256 || 0.0036|| 90%
|-
| 512 || 0.0021|| 77%
|-
|}


= Usage =
<!--T:37-->
These results show that for this system it is acceptable to use up to 256 cores.  Keep in mind that if you ask for more cores, your jobs will wait in the queue for a longer time, affecting your overall throughput.


= Installation =
<!--T:38-->
Now we perform benchmarking with GPUs.  NAMD multicore module is used for simulations that fit within 1 node, and NAMD verbs-smp module is used for runs spanning nodes.
 
<!--T:39-->
{| class="wikitable sortable"
|-
! # cores !! #GPUs !! Wall time (s) per step !! Notes
|-
| 4 || 1  ||  0.0165 || 1 node, multicore
|-
| 8 || 1  || 0.0088 || 1 node, multicore
|-
| 16 || 1  || 0.0071 || 1 node, multicore
|-
| 32  || 2  ||  0.0045  || 1 node, multicore
|-
| 64 || 4 || 0.0058  || 2 nodes, verbs-smp
|-
| 128 || 8 ||  0.0051 || 2 nodes, verbs-smp
|-
|}
 
<!--T:40-->
From this table it is clear that there is no point at all in using more than 1 node for this system, since performance actually becomes worse if we use 2 or more nodes.  Using only 1 node, it is best to use 1GPU/16 core as that has the greatest efficiency, but also acceptable to use 2GPU/32core if you need to get your results quickly.  Since on Graham GPU nodes your priority is charged the same for any job using up to 16 cores and 1 GPU, there is no benefit from running with 8 cores and 4 cores in this case.
 
<!--T:41-->
Finally, you have to ask whether to run with or without GPUs for this simulation.  From our numbers we can see that using a full GPU node of Graham (32 cores, 2 gpus) the job runs faster than it would on 4 non-GPU nodes of Graham.  Since a GPU node on Graham costs about twice what a non-GPU node costs, in this case it is more cost effective to run with GPUs.  You should run with GPUs if possible, however, given that there are fewer GPU than CPU nodes, you may need to consider submitting non-GPU jobs if your waiting time for GPU jobs is too long.
 
= NAMD 3 = <!--T:63-->
 
<!--T:64-->
NAMD 3 is now available as an ALPHA release.  It might offer better performance for certain system configurations.
 
<!--T:65-->
To use it, you can download the binary from the NAMD website and modify it so it can run on Alliance systems, like this (change alpha version as needed):
 
</translate>
tar xvfz NAMD_3.0alpha11_Linux-x86_64-multicore-CUDA-SingleNode.tar.gz
cd NAMD_3.0alpha11_Linux-x86_64-multicore-CUDA
setrpaths.sh  --path .
<translate>
 
<!--T:67-->
After this the <code>namd3</code> executable located in that directory will be linked to the correct libraries on our systems.
You can then submit a GPU job that uses that executable.
 
<!--T:68-->
For best performance, we recommend adding the following keyword to the configuration file:
 
</translate>
CUDASOAintegrate on
<translate>
 
<!--T:70-->
Please see the [https://www.ks.uiuc.edu/Research/namd/alpha/3.0alpha/ NAMD 3.0 Alpha web page] for more on this parameter and related changes in NAMD 3.
 
= References = <!--T:23-->
* Downloads: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD -- Registration is required to download the software.
*[http://www.ks.uiuc.edu/Research/namd/2.12/ug/ NAMD Users's guide for version 2.12]
*[http://www.ks.uiuc.edu/Research/namd/2.12/notes.html NAMD version 2.12 release notes]
* Tutorials: http://www.ks.uiuc.edu/Training/Tutorials/


= Links =
</translate>

Latest revision as of 16:55, 16 October 2024

Other languages:


NAMD is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems. Simulation preparation and analysis is integrated into the VMD visualization package.


Installation[edit]

NAMD is installed by our software team and is available as a module. If a new version is required or if for some reason you need to do your own installation, please contact Technical support. You can also ask for details of how our NAMD modules were compiled.

Environment modules[edit]

The latest version of NAMD is 2.14 and it has been installed on all clusters. We recommend users run the newest version.

Older versions 2.13 and 2.12 are also available.

To run jobs that span nodes, use UCX.

Submission scripts[edit]

Please refer to the Running jobs page for help on using the SLURM workload manager.

Serial and threaded jobs[edit]

Below is a simple job script for a serial simulation (using only one core). You can increase the number for --cpus-per-task to use more cores, up to the maximum number of cores available on a cluster node.

File : serial_namd_job.sh

#!/bin/bash
#
#SBATCH --cpus-per-task=1
#SBATCH --mem 2048            # memory in Mb, increase as needed    
#SBATCH -o slurm.%N.%j.out    # STDOUT file
#SBATCH -t 0:05:00            # time (D-HH:MM), increase as needed
#SBATCH --account=def-specifyaccount

module load StdEnv/2020
module load namd-multicore/2.14
namd2 +p$SLURM_CPUS_PER_TASK  +idlepoll apoa1.namd


Parallel CPU jobs[edit]

MPI jobs[edit]

NOTE: MPI should not be used. Instead use UCX.

Verbs jobs[edit]

NOTE: For NAMD 2.14, use UCX. Instructions below apply only to NAMD versions 2.13 and 2.12.

These provisional instructions will be refined further once this configuration can be fully tested on the new clusters. This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. This script assumes full nodes are used, thus ntasks-per-node should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.

NOTES:

  • Verbs versions will not run on Cedar because of its different interconnect; use the MPI version instead.
  • Verbs versions will not run on Béluga either because of its incompatible infiniband kernel drivers; use the UCX version instead.
File : verbs_namd_job.sh

#!/bin/bash
#
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --account=def-specifyaccount

NODEFILE=nodefile.dat
slurm_hl2hl.py --format CHARM > $NODEFILE
P=$SLURM_NTASKS

module load namd-verbs/2.12
CHARMRUN=`which charmrun`
NAMD2=`which namd2`
$CHARMRUN ++p $P ++nodelist $NODEFILE  $NAMD2  +idlepoll apoa1.namd


UCX jobs[edit]

This example uses 80 processes in total on 2 nodes, each node running 40 processes, thus fully utilizing its 80 cores. This script assumes full nodes are used, thus ntasks-per-node should be 40 (on Béluga). For best performance, NAMD jobs should use full nodes.


NOTE: UCX versions should work on all clusters.

File : ucx_namd_job.sh

#!/bin/bash
#
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --account=def-specifyaccount

module load StdEnv/2020 namd-ucx/2.14
srun --mpi=pmi2 namd2 apoa1.namd


OFI jobs[edit]

NOTE: OFI versions will run ONLY on Cedar because of its different interconnect. There have been some issues with OFI so it is better to use UCX.

File : ucx_namd_job.sh

#!/bin/bash
#SBATCH --account=def-specifyaccount
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out    # STDOUT

module load StdEnv/2020 namd-ofi/2.14
srun --mpi=pmi2 namd2 stmv.namd


Single GPU jobs[edit]

This example uses 8 CPU cores and 1 P100 GPU on a single node.

File : multicore_gpu_namd_job.sh

#!/bin/bash

#SBATCH --cpus-per-task=8 
#SBATCH --mem=2048            
#SBATCH --time=0:15:00
#SBATCH --gpus-per-node=p100:1
#SBATCH --account=def-specifyaccount

module load StdEnv/2020
module load cuda/11.0
module load namd-multicore/2.14
namd2 +p$SLURM_CPUS_PER_TASK  +idlepoll apoa1.namd



Parallel GPU jobs[edit]

UCX GPU jobs[edit]

This example is for Béluga and it assumes that full nodes are used, which gives best performance for NAMD jobs. It uses 8 processes in total on 2 nodes, each process(task) using 10 threads and 1 GPU. This fully utilizes Béluga GPU nodes which have 40 cores and 4 GPUs per node. Note that 1 core per task has to be reserved for a communications thread, so NAMD will report that only 72 cores are being used but this is normal.

To use this script on other clusters, please look up the specifications of their available nodes and adjust --cpus-per-task and --gpus-per-node options accordingly.

NOTE: UCX can be used on all clusters.

File : ucx_namd_job.sh

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=10 # number of threads per task (process)
#SBATCH --gpus-per-node=v100:4
#SBATCH --mem=0            # memory per node, 0 means all memory
#SBATCH --time=0:15:00
#SBATCH --account=def-specifyaccount

module load StdEnv/2020  intel/2020.1.217  cuda/11.0 namd-ucx-smp/2.14
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )
srun --cpus-per-task=$SLURM_CPUS_PER_TASK --mpi=pmi2 namd2 ++ppn $NUM_PES apoa1.namd


OFI GPU jobs[edit]

NOTE: OFI versions will run ONLY on Cedar because of its different interconnect. There have been some issues with OFI, so it is better to use UCX.

File : ucx_namd_job.sh

#!/bin/bash
#SBATCH --account=def-specifyaccount
#SBATCH --ntasks 8            # number of tasks
#SBATCH --nodes=2
#SBATCH --cpus-per-task=6
#SBATCH --gpus-per-node=p100:4
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --mem=0            # memory per node, 0 means all memory

module load StdEnv/2020 cuda/11.0 namd-ofi-smp/2.14
NUM_PES=$(expr $SLURM_CPUS_PER_TASK - 1 )
srun --cpus-per-task=$SLURM_CPUS_PER_TASK --mpi=pmi2 namd2 ++ppn $NUM_PES stmv.namd


Verbs-GPU jobs[edit]

NOTE: For NAMD 2.14, use UCX GPU on all clusters. Instructions below apply only to NAMD versions 2.13 and 2.12.

This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus ntasks-per-node should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.

NOTE: Verbs versions will not run on Cedar because of its different interconnect.

File : verbsgpu_namd_job.sh

#!/bin/bash
#
#SBATCH --ntasks 64            # number of tasks
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem 0            # memory per node, 0 means all memory
#SBATCH --gpus-per-node=p100:2
#SBATCH -o slurm.%N.%j.out    # STDOUT
#SBATCH -t 0:05:00            # time (D-HH:MM)
#SBATCH --account=def-specifyaccount

slurm_hl2hl.py --format CHARM > nodefile.dat
NODEFILE=nodefile.dat
OMP_NUM_THREADS=32
P=$SLURM_NTASKS

module load cuda/8.0.44
module load namd-verbs-smp/2.12
CHARMRUN=`which charmrun`
NAMD2=`which namd2`
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE  $NAMD2  +idlepoll apoa1.namd


Performance and benchmarking[edit]

A team at ACENET has created a Molecular Dynamics Performance Guide for Alliance clusters. It can help you determine optimal conditions for AMBER, GROMACS, NAMD, and OpenMM jobs. The present section focuses on NAMD performance.

Here is an example of how you should conduct benchmarking of NAMD. Performance of NAMD will be different for different systems you are simulating, depending especially on the number of atoms in the simulation. Therefore, if you plan to spend a significant amount of time simulating a particular system, it would be very useful to conduct the kind of benchmarking shown below. Collecting and providing this kind of data is also very useful if you are applying for a RAC award.

For a good benchmark, vary the number of steps so that your system runs for a few minutes, and that timing information is collected in reasonable time intervals of at least a few seconds. If your run is too short, you might see fluctuations in your timing results.

The numbers below were obtained for the standard NAMD apoa1 benchmark. The benchmarking was conducted on the Graham cluster, which has CPU nodes with 32 cores and GPU nodes with 32 cores and 2 GPUs. Performing the benchmark on other clusters will have to take account of the different structure of their nodes.

In the results shown in the first table below, we used NAMD 2.12 from the verbs module. Efficiency is computed from (time with 1 core) / (N * (time with N cores) ).

# cores Wall time (s) per step Efficiency
1 0.8313 100%
2 0.4151 100%
4 0.1945 107%
8 0.0987 105%
16 0.0501 104%
32 0.0257 101%
64 0.0133 98%
128 0.0074 88%
256 0.0036 90%
512 0.0021 77%

These results show that for this system it is acceptable to use up to 256 cores. Keep in mind that if you ask for more cores, your jobs will wait in the queue for a longer time, affecting your overall throughput.

Now we perform benchmarking with GPUs. NAMD multicore module is used for simulations that fit within 1 node, and NAMD verbs-smp module is used for runs spanning nodes.

# cores #GPUs Wall time (s) per step Notes
4 1 0.0165 1 node, multicore
8 1 0.0088 1 node, multicore
16 1 0.0071 1 node, multicore
32 2 0.0045 1 node, multicore
64 4 0.0058 2 nodes, verbs-smp
128 8 0.0051 2 nodes, verbs-smp

From this table it is clear that there is no point at all in using more than 1 node for this system, since performance actually becomes worse if we use 2 or more nodes. Using only 1 node, it is best to use 1GPU/16 core as that has the greatest efficiency, but also acceptable to use 2GPU/32core if you need to get your results quickly. Since on Graham GPU nodes your priority is charged the same for any job using up to 16 cores and 1 GPU, there is no benefit from running with 8 cores and 4 cores in this case.

Finally, you have to ask whether to run with or without GPUs for this simulation. From our numbers we can see that using a full GPU node of Graham (32 cores, 2 gpus) the job runs faster than it would on 4 non-GPU nodes of Graham. Since a GPU node on Graham costs about twice what a non-GPU node costs, in this case it is more cost effective to run with GPUs. You should run with GPUs if possible, however, given that there are fewer GPU than CPU nodes, you may need to consider submitting non-GPU jobs if your waiting time for GPU jobs is too long.

NAMD 3[edit]

NAMD 3 is now available as an ALPHA release. It might offer better performance for certain system configurations.

To use it, you can download the binary from the NAMD website and modify it so it can run on Alliance systems, like this (change alpha version as needed):

tar xvfz NAMD_3.0alpha11_Linux-x86_64-multicore-CUDA-SingleNode.tar.gz 
cd NAMD_3.0alpha11_Linux-x86_64-multicore-CUDA
setrpaths.sh  --path .

After this the namd3 executable located in that directory will be linked to the correct libraries on our systems. You can then submit a GPU job that uses that executable.

For best performance, we recommend adding the following keyword to the configuration file:

CUDASOAintegrate on

Please see the NAMD 3.0 Alpha web page for more on this parameter and related changes in NAMD 3.

References[edit]