AlphaFold: Difference between revisions

Revision as of 01:30, 13 July 2022

AlphaFold is a machine-learning model for the prediction of protein folding.

This page discusses how to use AlphaFold v2.0, the version that was entered in CASP14 and published in Nature.

Source code and documentation for AlphaFold can be found at their GitHub page. Any publication that discloses findings arising from using this source code or the model parameters should cite the AlphaFold paper.

Using Python wheels

Available wheels

You can list available wheels using the avail_wheels command.

[name@server ~]$ avail_wheels alphafold
name       version    python    arch
---------  ---------  --------  -------
alphafold  2.2.2      py3       generic

Installing AlphaFold in a Python virtual environment

1. Load AlphaFold dependencies.

[name@server ~]$ module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

As of July 2022, only Python 3.7 and 3.8 are supported.

2. Create and activate a Python virtual environment.

[name@server ~]$ virtualenv --no-download ~/alphafold_env
[name@server ~]$ source ~/alphafold_env/bin/activate

3. Install a specific version of AlphaFold and its Python dependencies.

(alphafold_env) [name@server ~] pip install --no-index --upgrade pip
(alphafold_env) [name@server ~] pip install --no-index alphafold==2.2.2

4. Validate it.

(alphafold_env) [name@server ~] run_alphafold.py --help

Databases

Note that AlphaFold requires a set of datasets/databases to be downloaded into the $SCRATCH.

Important: The database must live in the $SCRATCH.

GeneralGraham only

1. From a login node, create the data folder.

(alphafold_env) [name@server ~] export DOWNLOAD_DIR=$SCRATCH/alphafold/data
(alphafold_env) [name@server ~] mkdir -p $DOWNLOAD_DIR

2. With your virtual environment activated, you can download the data.

(alphafold_env) [name@server ~] download_all_data.sh $DOWNLOAD_DIR

Note that this step cannot be done from a compute node but rather from a login node. Since the download might take a while, we suggest starting the download in a screen or Tmux session.

1. Set DOWNLOAD_DIR.

(alphafold_env) [name@server ~] export DOWNLOAD_DIR=/datashare/alphafold

Afterwards, the structure of your data should be similar to

(alphafold_env) [name@server ~] tree -d $DOWNLOAD_DIR
$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
        mgy_clusters.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
        # LICENSE,
        # = 11 files.
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 206 GB (download: 46 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
    uniclust30/                            # ~ 87 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
    uniref90/                              # ~ 59 GB (download: 29.7 GB)
        uniref90.fasta

Running AlphaFold

Performance

AlphaFold has at most 8 CPUs hardcoded since it does not benefit from using more than 8.

Edit the following submission script according to your needs.

CPUGPU

File : alphafold-cpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your supporting data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install alphafold and its dependencies
pip install --no-index --upgrade pip
pip install --no-index alphafold==2.2.2

# Edit with the proper arguments, run your commands
# run_alphafold.py --help
run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${OUTPUT_DIR} \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --max_template_date=2020-05-14 \
   --model_preset=monomer_casp14 \
   --use_gpu_relax=False

File : alphafold-gpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your supporting data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install alphafold and its dependencies
pip install --no-index --upgrade pip
pip install --no-index alphafold==2.2.2

# Edit with the proper arguments, run your commands
# run_alphafold.py --help
run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${OUTPUT_DIR} \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --max_template_date=2020-05-14 \
   --model_preset=monomer_casp14 \
   --use_gpu_relax=True

Then, submit the job to the scheduler.

(alphafold_env) [name@server ~] sbatch --job-name alphafold-X alphafold-gpu.sh

Using Singularity

AlphaFold documentation explains how to run the software using Docker. We do not provide Docker, but Singularity instead. It is recommended to use a virtual environment and a Python wheel available from our "wheelhouse".

First, read our Singularity documentation as there are particularities for each cluster that must be taken into account. Then, build a Singularity container.

[name@server ~]$ cd $SCRATCH
[name@server ~]$ module load singularity
[name@server ~]$ singularity build alphafold.sif docker://uvarc/alphafold:2.2.0

Running AlphaFold within Singularity

Performance

AlphaFold has at most 8 CPUs hardcoded since it does not benefit from using more than 8.

Create a directory alphafold_output to hold the output files.

[name@server ~]$ mkdir $SCRATCH/alphafold_output

Then, edit the job submission script.

File : alphafold-singularity.sh

#!/bin/bash

#SBATCH --job-name alphafold-run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

module load singularity

export PYTHONNOUSERSITE=True

ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params

# Run the command
singularity run --nv \
    -B $ALPHAFOLD_DATA_PATH:/data \
    -B $ALPHAFOLD_MODELS \
    -B .:/etc \
    --pwd  /app/alphafold alphaFold.sif \
    --fasta_paths=/path/to/input.fasta  \
    --uniref90_database_path=/data/uniref90/uniref90.fasta  \
    --data_dir=/data \
    --mgnify_database_path=/data/mgnify/mgy_clusters.fa   \
    --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --pdb70_database_path=/data/pdb70/pdb70  \
    --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
    --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
    --max_template_date=2020-05-14   \
    --output_dir=alphafold_output  \
    --model_names='model_1' \
    --preset=casp14 \
    --use_gpu_relax=True

AlphaFold launches multithreaded analysis using up to 8 CPUs before running model inference on the GPU. Memory requirements will vary with different size proteins.

Bind-mount the current working directory to /etc inside the container for cache file ld.so.cache [-B .:/etc]. The --nv flag is used to enable GPU support. Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.

[name@server ~]$ sbatch alpharun_jobscript.sh

On successful completion, the output directory should have the following files:

[name@server ~]$ tree alphafold_output/input
alphafold_output
 └── input
    ├── features.pkl
    ├── msas
    │   ├── bfd_uniclust_hits.a3m
    │   ├── mgnify_hits.sto
    │   └── uniref90_hits.sto
    ├── ranked_0.pdb
    ├── ranking_debug.json
    ├── relaxed_model_1.pdb
    ├── result_model_1.pkl
    ├── timings.json
    └── unrelaxed_model_1.pdb
 2 directories, 10 files

@@ Line 11: / Line 11: @@
 Any publication that discloses findings arising from using this source code or the model parameters should [https://github.com/deepmind/alphafold#citing-this-work cite] the [https://doi.org/10.1038/s41586-021-03819-2 AlphaFold paper].
-== Using Python wheel == <!--T:4-->
+== Using Python wheels == <!--T:4-->
 === Available wheels === <!--T:5-->
-You can list available wheels using the <tt>avail_wheels</tt> command:
+You can list available wheels using the <tt>avail_wheels</tt> command.
 {{Command
 |avail_wheels alphafold
@@ Line 26: / Line 26: @@
 <!--T:7-->
-. Load AlphaFold dependencies:
+. Load AlphaFold dependencies.
 {{Command|module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8
 }}
-Only python 3.7 and 3.8 are currently supported.
+As of July 2022, only Python 3.7 and 3.8 are supported.
 <!--T:8-->
-. Create a Python virtual environment and activate it:
+. Create and activate a Python virtual environment.
 {{Commands
 |virtualenv --no-download ~/alphafold_env
@@ Line 40: / Line 40: @@
 <!--T:9-->
-. Install a specific version of AlphaFold and its python dependencies:
+. Install a specific version of AlphaFold and its Python dependencies.
 {{Commands
 |prompt=(alphafold_env) [name@server ~]
@@ Line 48: / Line 48: @@
 <!--T:10-->
-. Validate it
+. Validate it.
 {{Command
 |prompt=(alphafold_env) [name@server ~]
@@ Line 58: / Line 58: @@
 <!--T:12-->
-'''Important:''' The database must live in the <tt>$SCRATCH</tt>
+'''Important:''' The database must live in the <tt>$SCRATCH</tt>.
 <!--T:13-->
 <tabs>
 <tab name="General">
-. From a login node, create the data folder:
+. From a login node, create the data folder.
 {{Commands
 |prompt=(alphafold_env) [name@server ~]
@@ Line 71: / Line 71: @@
 <!--T:14-->
-. With your virtual environment activated, you can download the data:
+. With your virtual environment activated, you can download the data.
 {{Command
 |prompt=(alphafold_env) [name@server ~]
@@ Line 78: / Line 78: @@
 <!--T:15-->
-Note that this step '''cannot''' be done from compute nodes but rather from a login node. Since the download might take a while we suggest to start the download in a [https://linuxize.com/post/how-to-use-linux-screen/ screen] or [https://docs.computecanada.ca/wiki/Tmux Tmux] session.
+Note that this step '''cannot''' be done from a compute node but rather from a login node. Since the download might take a while, we suggest starting the download in a [https://linuxize.com/post/how-to-use-linux-screen/ screen] or [https://docs.computecanada.ca/wiki/Tmux Tmux] session.
 </tab>
 <!--T:16-->
 <tab name="Graham only">
-. Set <tt>DOWNLOAD_DIR</tt>:
+. Set <tt>DOWNLOAD_DIR</tt>.
 {{Command
 |prompt=(alphafold_env) [name@server ~]
@@ Line 92: / Line 92: @@
 <!--T:17-->
-Afterwards, the structure of your data should be similar to:
+Afterwards, the structure of your data should be similar to
 {{Command
 |prompt=(alphafold_env) [name@server ~]
@@ Line 123: / Line 123: @@
 {{Warning
 |title=Performance
-|content=AlphaFold has at most 8 cpus hardcoded since it does not benefit from using more than 8.
+|content=AlphaFold has at most 8 CPUs hardcoded since it does not benefit from using more than 8.
 }}
 <!--T:19-->
-Edit to your needs the following submission script:
+Edit the following submission script according to your needs.
 <tabs>
 <tab name="CPU">
@@ Line 247: / Line 247: @@
 <!--T:33-->
-Then submit the job to the scheduler:
+Then, submit the job to the scheduler.
 {{Command
 |prompt=(alphafold_env) [name@server ~]
@@ Line 254: / Line 254: @@
 == Using Singularity == <!--T:34-->
-AlphaFold documentation explains how to run the software using Docker. We do not provide Docker, but instead provide [[Singularity]]. It is recommended to use a virtual environment and a Python wheel available from the Compute Canada "wheelhouse".
+AlphaFold documentation explains how to run the software using Docker. We do not provide Docker, but [[Singularity]] instead. It is recommended to use a virtual environment and a Python wheel available from our "wheelhouse".
 <!--T:35-->
-First read our [[Singularity]] documentation as there are particularities of each cluster that one must take into account. Then [[Singularity#Creating_images_on_Compute_Canada_clusters| build a Singularity container]]:
+First, read our [[Singularity]] documentation as there are particularities for each cluster that must be taken into account. Then, [[Singularity#Creating_images_on_Compute_Canada_clusters| build a Singularity container]].
 {{Commands
 |cd $SCRATCH
@@ Line 267: / Line 267: @@
 {{Warning
 |title=Performance
-|content=AlphaFold has at most 8 cpus hardcoded since it does not benefit from using more than 8.
+|content=AlphaFold has at most 8 CPUs hardcoded since it does not benefit from using more than 8.
 }}
-Create a directory <tt>alphafold_output</tt> to hold the output files:
+Create a directory <tt>alphafold_output</tt> to hold the output files.
 {{Command
 |mkdir $SCRATCH/alphafold_output
@@ Line 275: / Line 275: @@
 <!--T:37-->
-Then edit the job submission script:
+Then, edit the job submission script.
 {{File
 |name=alphafold-singularity.sh
@@ Line 326: / Line 326: @@
 <!--T:43-->
-Bind-mount the current working directory to <tt>/etc</tt> inside the container for the cache file ld.so.cache [-B .:/etc]. The <tt>--nv</tt> flag is used to enable the GPU support.
+Bind-mount the current working directory to <tt>/etc</tt> inside the container for cache file ld.so.cache [-B .:/etc]. The <tt>--nv</tt> flag is used to enable GPU support.
 Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.
 {{Command

AlphaFold: Difference between revisions

Revision as of 01:30, 13 July 2022

Contents

Using Python wheels

Available wheels

Installing AlphaFold in a Python virtual environment

Databases

Running AlphaFold

Using Singularity

Running AlphaFold within Singularity

Navigation menu

AlphaFold: Difference between revisions

Revision as of 01:30, 13 July 2022

Using Python wheels

Available wheels

Installing AlphaFold in a Python virtual environment

Databases

Running AlphaFold

Using Singularity

Running AlphaFold within Singularity

Navigation menu

Search