NAMD: Difference between revisions
(modify threaded gpu job) |
m (add category BiomolecularSimulation) |
||
Line 1: | Line 1: | ||
<languages /> | <languages /> | ||
[[Category:Software]] | [[Category:Software]][[Category:BiomolecularSimulation]] | ||
<translate> | <translate> |
Revision as of 15:11, 30 July 2020
NAMD is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.
Simulation preparation and analysis is integrated into the VMD visualization package.
Installation[edit]
NAMD is installed by the Compute Canada software team and is available as a module. If a new version is required or if for some reason you need to do your own installation, please contact Technical support. You can also ask for details of how our NAMD modules were compiled.
Environment modules[edit]
The following modules are available:
- compiled without CUDA support:
- namd-multicore/2.12
- namd-verbs/2.12 (disabled on Cedar)
- namd-mpi/2.12 (disabled on Graham)
- compiled with CUDA support:
- namd-multicore/2.12
- namd-verbs-smp/2.12 (disabled on Cedar)
- To access these modules which require CUDA, first execute
module load cuda/8.0.44
Note: Using a verbs or UCX library is more efficient than using OpenMPI, hence only verbs or UCX versions are provided on systems where those are supported. Currently those versions do not work on Cedar as they are incompatible with the communications fabric, so use MPI versions instead.
Newest NAMD 2.13 is now also available. To load the GPU-enabled versions, first run
module load cuda/10.0.130
Submission scripts[edit]
Please refer to the Running jobs page for help on using the SLURM workload manager.
Serial jobs[edit]
Here is a simple job script for a serial simulation:
#!/bin/bash
#
#SBATCH --ntasks 1 # number of tasks
#SBATCH --mem 1024 # memory pool per process
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -t 0:20:00 # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
module load namd-multicore/2.12
namd2 +p1 +idlepoll apoa1.namd
Verbs jobs[edit]
These provisional instructions will be refined further once this configuration can be fully tested on the new clusters.
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. This script assumes full nodes are used, thus ntasks-per-node
should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.
NOTES:
- Verbs versions will not run on Cedar because of its different interconnect; use the MPI version instead.
- Verbs versions will not run on Béluga either because of its incompatible infiniband kernel drivers; use the UCX version instead.
#!/bin/bash
#
#SBATCH --ntasks 64 # number of tasks
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=0 # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -t 0:05:00 # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
NODEFILE=nodefile.dat
slurm_hl2hl.py --format CHARM > $NODEFILE
P=$SLURM_NTASKS
module load namd-verbs/2.12
CHARMRUN=`which charmrun`
NAMD2=`which namd2`
$CHARMRUN ++p $P ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd
UCX jobs[edit]
This example uses 80 processes in total on 2 nodes, each node running 40 processes, thus fully utilizing its 80 cores. This script assumes full nodes are used, thus ntasks-per-node
should be 40 (on Béluga). For best performance, NAMD jobs should use full nodes.
NOTE: UCX versions will not run on Cedar because of its different interconnect. Use the MPI version instead.
#!/bin/bash
#
#SBATCH --ntasks 80 # number of tasks
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --mem=0 # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -t 0:05:00 # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
module load namd-ucx/2.13
srun --mpi=pmi2 namd2 apoa1.namd
MPI jobs[edit]
NOTE: Use this only on Cedar, where verbs and UCX versions will not work.
#!/bin/bash
#
#SBATCH --ntasks 64 # number of tasks
#SBATCH --nodes=2
#SBATCH --mem 0 # memory per node, 0 means all memory
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -t 0:05:00 # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
module load namd-mpi/2.12
NAMD2=`which namd2`
srun $NAMD2 apoa1.namd
GPU jobs[edit]
This example uses 8 CPU cores and 1 GPU on a single node.
#!/bin/bash
#
#SBATCH --cpus-per-task=8
#SBATCH --mem 2048
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -t 0:05:00 # time (D-HH:MM)
#SBATCH --gres=gpu:1
#SBATCH --account=def-specifyaccount
module load cuda/8.0.44
module load namd-multicore/2.12
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll apoa1.namd
Verbs-GPU jobs[edit]
These provisional instructions will be refined further once this configuration can be fully tested on the new clusters.
This example uses 64 processes in total on 2 nodes, each node running 32 processes, thus fully utilizing its 32 cores. Each node uses 2 GPUs, so job uses 4 GPUs in total. This script assumes full nodes are used, thus ntasks-per-node
should be 32 (on Graham). For best performance, NAMD jobs should use full nodes.
NOTE: Verbs versions will not run on Cedar because of its different interconnect.
#!/bin/bash
#
#SBATCH --ntasks 64 # number of tasks
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem 0 # memory per node, 0 means all memory
#SBATCH --gres=gpu:2
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -t 0:05:00 # time (D-HH:MM)
#SBATCH --account=def-specifyaccount
slurm_hl2hl.py --format CHARM > nodefile.dat
NODEFILE=nodefile.dat
OMP_NUM_THREADS=32
P=$SLURM_NTASKS
module load cuda/8.0.44
module load namd-verbs-smp/2.12
CHARMRUN=`which charmrun`
NAMD2=`which namd2`
$CHARMRUN ++p $P ++ppn $OMP_NUM_THREADS ++nodelist $NODEFILE $NAMD2 +idlepoll apoa1.namd
Benchmarking NAMD[edit]
This section shows an example of how you should conduct benchmarking of NAMD. Performance of NAMD will be different for different systems you are simulating, depending especially on the number of atoms in the simulation. Therefore, if you plan to spend a significant amount of time simulating a particular system, it would be very useful to conduct the kind of benchmarking shown below. Collecting and providing this kind of data is also very useful if you are applying for a RAC award.
For a good benchmark, please vary the number of steps so that your system runs for a few minutes, and that timing information is collected in reasonable time intervals of at least a few seconds. If your run is too short, you might see fluctuations in your timing results.
The numbers below were obtained for the standard NAMD apoa1 benchmark. The benchmarking was conducted on the Graham cluster, which has CPU nodes with 32 cores and GPU nodes with 32 cores and 2 GPUs. Performing the benchmark on other clusters will have to take account of the different structure of their nodes.
In the results shown in the first table below, we used NAMD 2.12 from the verbs module. Efficiency is computed from (time with 1 core) / (N * (time with N cores) ).
# cores | Wall time (s) per step | Efficiency |
---|---|---|
1 | 0.8313 | 100% |
2 | 0.4151 | 100% |
4 | 0.1945 | 107% |
8 | 0.0987 | 105% |
16 | 0.0501 | 104% |
32 | 0.0257 | 101% |
64 | 0.0133 | 98% |
128 | 0.0074 | 88% |
256 | 0.0036 | 90% |
512 | 0.0021 | 77% |
These results show that for this system it is acceptable to use up to 256 cores. Keep in mind that if you ask for more cores, your jobs will wait in the queue for a longer time, affecting your overall throughput.
Now we perform benchmarking with GPUs. NAMD multicore module is used for simulations that fit within 1 node, and NAMD verbs-smp module is used for runs spanning nodes.
# cores | #GPUs | Wall time (s) per step | Notes |
---|---|---|---|
4 | 1 | 0.0165 | 1 node, multicore |
8 | 1 | 0.0088 | 1 node, multicore |
16 | 1 | 0.0071 | 1 node, multicore |
32 | 2 | 0.0045 | 1 node, multicore |
64 | 4 | 0.0058 | 2 nodes, verbs-smp |
128 | 8 | 0.0051 | 2 nodes, verbs-smp |
From this table it is clear that there is no point at all in using more than 1 node for this system, since performance actually becomes worse if we use 2 or more nodes. Using only 1 node, it is best to use 1GPU/16 core as that has the greatest efficiency, but also acceptable to use 2GPU/32core if you need to get your results quickly. Since on Graham GPU nodes your priority is charged the same for any job using up to 16 cores and 1 GPU, there is no benefit from running with 8 cores and 4 cores in this case.
Finally, you have to ask whether to run with or without GPUs for this simulation. From our numbers we can see that using a full GPU node of Graham (32 cores, 2 gpus) the job runs faster than it would on 4 non-GPU nodes of Graham. Since a GPU node on Graham costs about twice what a non-GPU node costs, in this case it is more cost effective to run with GPUs. You should run with GPUs if possible, however, given that there are fewer GPU than CPU nodes, you may need to consider submitting non-GPU jobs if your waiting time for GPU jobs is too long.
References[edit]
- Downloads: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD -- Registration is required to download the software.
- NAMD Users's guide for version 2.12
- NAMD version 2.12 release notes
- Tutorials: http://www.ks.uiuc.edu/Training/Tutorials/