AlphaFold: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
m (Added available wheels section)
(Major rework of Alphafold page to support new patched wheels. Added translation tags.)
Line 1: Line 1:
{{draft}}
<translate>
[https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology AlphaFold]
[https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology AlphaFold]
is a machine-learning model for the prediction of protein folding.  
is a machine-learning model for the prediction of protein folding.  
Line 8: Line 8:
Any publication that discloses findings arising from using this source code or the model parameters should [https://github.com/deepmind/alphafold#citing-this-work cite] the [https://doi.org/10.1038/s41586-021-03819-2 AlphaFold paper].
Any publication that discloses findings arising from using this source code or the model parameters should [https://github.com/deepmind/alphafold#citing-this-work cite] the [https://doi.org/10.1038/s41586-021-03819-2 AlphaFold paper].


== Usage in Compute Canada systems ==
== Using Python wheel ==
 
AlphaFold documentation explains how to run the software using Docker.
In Compute Canada we do not provide Docker, but instead provide [[Singularity]].
We will describe how to use AlphaFold with Singularity much further down this page,
but we recommend instead that you use a virtual environment and a Python wheel available from the Compute Canada "wheelhouse".


=== Available wheels ===
=== Available wheels ===
Line 25: Line 20:
}}
}}


=== AlphaFold in Python environment ===
=== Installing AlphaFold in a Python virtual environment ===


1. AlphaFold has a number of other dependencies that need to be loaded first.
1. Load AlphaFold dependencies:
These include Cuda, kalign, hmmer, and openmm, all of which are available in the Compute Canada software stack.
{{Command|module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8
Load these modules like this (FOR NARVAL CHANGE cuda/11.2.2 for cuda/11.4):
}}
Only python 3.7 and 3.8 are currently supported.


<pre>
[name@cluster ~]$ module load gcc/9.3.0 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0
[name@cluster ~]$ module load kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0
</pre>


2. Clone the AlphaFold repository in <tt>$SCRATCH</tt>:
2. Create a Python virtual environment and activate it:
{{Commands2
|virtualenv --no-download ~/alphafold_env
|source ~/alphafold_env/bin/activate
}}


<pre>
3. Install a specific version of AlphaFold and its python dependencies:
[name@cluster ~]$ cd $SCRATCH
{{Commands2
[name@cluster ~]$ git clone https://github.com/deepmind/alphafold.git -b v2.1.1
|prompt=(alphafold_env) [name@server ~]
[name@cluster ~]$ wget https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt -P alphafold/alphafold/common/
|pip install --no-index --upgrade pip
</pre>
|pip install --no-index alphafold{{=}}{{=}}2.2.2
}}


3. Create a Python virtual environment and activate it:
4. Validate it
{{Command
|prompt=(alphafold_env) [name@server ~]
|run_alphafold.py --help
}}


<pre>
=== Databases ===
[name@cluster ~]$ virtualenv --no-download ~/my_env
Note that AlphaFold requires a set of datasets/databases to be downloaded into the <tt>$SCRATCH</tt>.
[name@cluster ~]$ source ~/my_env/bin/activate
</pre>


4. Install AlphaFold and its dependencies by:
'''Important:''' The database must live in the <tt>$SCRATCH</tt>


<pre>
<tabs>
(my_env)[name@cluster ~]$ pip install --no-index pdbfixer==1.7 alphafold==2.1.1
<tab name="General">
</pre>
1. From a login node, create the data folder:
{{Commands2
|prompt=(alphafold_env) [name@server ~]
|export DOWNLOAD_DIR{{=}}$SCRATCH/alphafold/data
|mkdir -p $DOWNLOAD_DIR
}}


Now AlphaFold is ready to be used. Note that to use AlphaFold outside a container, you need to use the <code>run_alphafold.py</code> script that is provided in the repository.
2. With your virtual environment activated, you can download the data:
{{Command
|prompt=(alphafold_env) [name@server ~]
|download_all_data.sh $DOWNLOAD_DIR
}}


==== Creating the virtual environment in the job script ====
Note that this step '''cannot''' be done from compute nodes but rather from a login node. Since the download might take a while we suggest to start the download in a [https://linuxize.com/post/how-to-use-linux-screen/ screen] or [https://docs.computecanada.ca/wiki/Tmux Tmux] session.
</tab>


As discussed on the [[Python#Creating_virtual_environments_inside_of_your_jobs|Python]] page,
<tab name="Graham only">
your job may run faster if you create the virtual environment on node-local storage during the job.
1. Set <tt>DOWNLOAD_DIR</tt>:
If you do so, your job script should look something like this:
{{Command
 
|prompt=(alphafold_env) [name@server ~]
{{File
|export DOWNLOAD_DIR{{=}}/datashare/alphafold
|name=my_alphafoldjob.sh
|lang="bash"
|contents=
#!/bin/bash
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-12:00        # adjust this to match the walltime of your job
#SBATCH --nodes=1     
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1          # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --cpus-per-task=8      # adjust this if you are using parallel commands
#SBATCH --mem=4000            # adjust this according to the memory you need
 
# Load your modules as before
# ON NARVAL USE cuda/11.4!!!!
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0
module load kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0
 
cd $SCRATCH
 
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/my_env
source ${SLURM_TMPDIR}/my_env/bin/activate
 
# Install alphafold and dependencies
pip install --no-index pdbfixer==1.7 alphafold==2.1.1
 
# Run your commands
python $SCRATCH/alphafold/run_alphafold.py --help
}}
}}
</tab>
</tabs>


== Databases ==
Afterwards, the structure of your data should be similar to:
Note that AlphaFold requires a set of datasets/databases that need to be downloaded into the <tt>$SCRATCH</tt>. Also notice that we prefer you avoid using `aria2c`. To do so:
{{Command
 
|prompt=(alphafold_env) [name@server ~]
'''Important:''' The database must live in the <tt>$SCRATCH</tt>  <b>UNLESS</b> you are working with the NFS mount in graham (see below).
|tree -d $DOWNLOAD_DIR
 
|result=
'''Special Note for GRAHAM ONLY:''' The database is available in an NFS mount. See more information at https://helpwiki.sharcnet.ca/wiki/Graham_Reference_Dataset_Repository#AlphaFold.
 
1. Move to the AlphaFold repository and the scripts folder:
<pre>
[name@cluster ~]$ cd $SCRATCH/alphafold
[name@cluster ~]$ mkdir data
</pre>
 
2. Modify all the files there with the following command:
<pre>
[name@cluster scripts]$ sed -i -e 's/aria2c/wget/g' -e 's/--dir=/-P /g' -e 's/--preserve-permissions//g' scripts/*.sh
</pre>
 
3. Use the scripts to download the data:
<pre>
[name@cluster ~]$ bash scripts/download_all_data.sh $SCRATCH/alphafold/data
</pre>
 
Note that this might take a while and '''SHOULD NOT BE DONE IN THE COMPUTE NODES'''. Instead, you should use the [https://docs.computecanada.ca/wiki/Transferring_data data transfer nodes] or the login nodes. Since the download might take a while we recommend you do this in a [https://linuxize.com/post/how-to-use-linux-screen/ screen] or [https://docs.computecanada.ca/wiki/Tmux Tmux] sessions. If your path/to/download is stored in <code>$DOWNLOAD_DIR</code>, then the structure of your data should be:
 
<pre>
$DOWNLOAD_DIR/                            # Total: ~ 2.2 TB (download: 428 GB)
$DOWNLOAD_DIR/                            # Total: ~ 2.2 TB (download: 428 GB)
     bfd/                                  # ~ 1.8 TB (download: 271.6 GB)
     bfd/                                  # ~ 1.8 TB (download: 271.6 GB)
Line 144: Line 105:
     uniref90/                              # ~ 59 GB (download: 29.7 GB)
     uniref90/                              # ~ 59 GB (download: 29.7 GB)
         uniref90.fasta
         uniref90.fasta
</pre>
}}


This is important when passing the commands to AlphaFold.
=== Running AlphaFold ===
{{Warning
|title=Performance
|content=AlphaFold has at most 8 cpus hardcoded since it does not benefit from using more than 8.
}}
 
Edit to your needs the following submission script:
<tabs>
<tab name="CPU">
{{File
|name=alphafold-cpu.sh
|lang="bash"
|contents=
#!/bin/bash


== Running AlphaFold ==
#SBATCH --job-name=alphafold_run
<div class="alert alert-danger">
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
  <strong>AlphaFold2 has the number of CPUS hardcoded!</strong>. Plase do not use other number but 8 as these are the required CPUS.
#SBATCH --time=08:00:00          # adjust this to match the walltime of your job
</div>
#SBATCH --cpus-per-task=8        # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                # adjust this according to the memory you need


Once you have everything setup, you can run a production run of AlphaFold by:
# Load modules dependencies
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8


DOWNLOAD_DIR=$SCRATCH/alphafold/data  # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input    # set the appropriate path to your supporting data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate
# Install alphafold and its dependencies
pip install --no-index --upgrade pip
pip install --no-index alphafold==2.2.2
# Edit with the proper arguments, run your commands
# run_alphafold.py --help
run_alphafold.py \
  --data_dir=${DOWNLOAD_DIR} \
  --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
  --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
  --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
  --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
  --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
  --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
  --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
  --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
  --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
  --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
  --output_dir=${OUTPUT_DIR} \
  --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
  --max_template_date=2020-05-14 \
  --model_preset=monomer_casp14
}}
</tab>
<tab name="GPU">
{{File
{{File
|name=my_alphafoldjob.sh
|name=alphafold-gpu.sh
|lang="bash"
|lang="bash"
|contents=
|contents=
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=alphafold_run
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --account=def-someprof   # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-12:00:00     # adjust this to match the walltime of your job
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --nodes=1     
#SBATCH --gres=gpu:1             # a GPU helps to accelerate the inference part only
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --gres=gpu:1           # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --mem=20G                # adjust this according to the memory you need
#SBATCH --cpus-per-task=8     # DO NOT INCREASE THIS AS ALPHAFOLD CANNOT TAKE ADVANTAGE OF MORE
#SBATCH --mem=32G              # adjust this according to the memory requirement per node you need
#SBATCH --mail-user=you@youruniversity.ca # adjust this to match your email address
#SBATCH --mail-type=ALL


# Set the path to download dir
# Load modules dependencies
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # Set the appropriate path to your downloaded data
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8
INPUT_DIR=$SCRATCH/alphafold/input    # Set the appropriate path to your supporting data
REPO_DIR=$SCRATCH/alphafold # Set the appropriate path to AlphaFold's cloned repo


# Load your modules as before
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # set the appropriate path to your downloaded data
# ON NARVAL USE cuda/11.4 !!!
INPUT_DIR=$SCRATCH/alphafold/input    # set the appropriate path to your supporting data
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data
module load kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0
 
cd $SCRATCH # Set the appropriate folder where the repo is contained


# Generate your virtual environment in $SLURM_TMPDIR
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/my_env
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/my_env/bin/activate
source ${SLURM_TMPDIR}/env/bin/activate


# Install alphafold and dependencies
# Install alphafold and its dependencies
pip install --no-index pdbfixer==1.7 alphafold==2.1.1
pip install --no-index --upgrade pip
pip install --no-index alphafold==2.2.2


# Run your commands
# Edit with the proper arguments, run your commands
python ${REPO_DIR}/run_alphafold.py \
# run_alphafold.py --help
run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
Line 204: Line 209:
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${SCRATCH}/alphafold_output \
   --output_dir=${OUTPUT_DIR} \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --max_template_date=2020-05-14 \
   --max_template_date=2020-05-14 \
   --model_preset=monomer_casp14 \
   --model_preset=monomer_casp14 \
   --use_gpu_relax=True
   --use_gpu_relax=True
}}
</tab>
</tabs>


Then submit the job to the scheduler:
{{Command
|prompt=(alphafold_env) [name@server ~]
|sbatch --job-name alphafold-X alphafold-gpu.sh
}}
}}


== Using singularity ==
== Using Singularity ==
If you want to try the containerized version (NOT our preferred option), first read our [[Singularity]] documentation as there are particularities of each cluster that you must take into account. Then you can [[Singularity#Creating_images_on_Compute_Canada_clusters| build a Singularity container]] like so:
AlphaFold documentation explains how to run the software using Docker. We do not provide Docker, but instead provide [[Singularity]]. It is recommended to use a virtual environment and a Python wheel available from the Compute Canada "wheelhouse".


<pre>
First read our [[Singularity]] documentation as there are particularities of each cluster that one must take into account. Then [[Singularity#Creating_images_on_Compute_Canada_clusters| build a Singularity container]]:
[name@cluster ~]$ module load singularity
{{Commands2
[name@cluster ~]$ singularity build alphafold.sif docker://uvarc/alphafold:2.0.0
|cd $SCRATCH
</pre>
|module load singularity
|singularity build alphafold.sif docker://uvarc/alphafold:2.2.0
}}
 
=== Running AlphaFold within Singularity ===
{{Warning
|title=Performance
|content=AlphaFold has at most 8 cpus hardcoded since it does not benefit from using more than 8.
}}
Create a directory <tt>alphafold_output</tt> to hold the output files:
{{Command
|mkdir $SCRATCH/alphafold_output
}}
 
Then edit the job submission script:
{{File
|name=alphafold-singularity.sh
|lang="bash"
|contents=
#!/bin/bash


#SBATCH --job-name alphafold-run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00          # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8        # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                # adjust this according to the memory you need


=== Running AlphaFold within Singularity ===
module load singularity
 
export PYTHONNOUSERSITE=True
 
ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params


Here is an example to run the containerized version of alphafold2 on a given protein sequence. The protein sequence is saved in fasta format as below:
# Run the command
[name@cluster ~]$ cat input.fasta
singularity run --nv \
>5ZE6_1
    -B $ALPHAFOLD_DATA_PATH:/data \
MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA
    -B $ALPHAFOLD_MODELS \
FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG
    -B .:/etc \
TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP
    --pwd /app/alphafold alphaFold.sif \
WREALIGLAHIAVQRDR
    --fasta_paths=/path/to/input.fasta \
The reference databases and models were downloaded to predict the structure of the above protein sequence.  
    --uniref90_database_path=/data/uniref90/uniref90.fasta \
  [name@cluster ~]$ tree databases/
    --data_dir=/data \
databases/
    --mgnify_database_path=/data/mgnify/mgy_clusters.fa   \
├── bfd
    --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
    --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
    --pdb70_database_path=/data/pdb70/pdb70  \
  │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
    --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
    --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
  │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
     --max_template_date=2020-05-14  \
│   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
     --output_dir=alphafold_output \
├── mgnify
     --model_names='model_1' \
│   └── mgy_clusters.fa
     --preset=casp14 \
├── params
     --use_gpu_relax=True
│   ├── LICENSE
}}
│   ├── params_model_1.npz
AlphaFold launches multithreaded analysis using up to 8 CPUs before running model inference on the GPU.
│   ├── params_model_1_ptm.npz
Memory requirements will vary with different size proteins.
│   ├── params_model_2.npz
│   ├── params_model_2_ptm.npz
│   ├── params_model_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
  │   ├── md5sum
│   ├── pdb70_a3m.ffdata
│   ├── pdb70_a3m.ffindex
│   ├── pdb70_clu.tsv
│   ├── pdb70_cs219.ffdata
│   ├── pdb70_cs219.ffindex
│   ├── pdb70_hhm.ffdata
│   ├── pdb70_hhm.ffindex
│   └── pdb_filter.dat
├── pdb_mmcif
  │   ├── mmcif_files
  │   │   ├── 100d.cif
│   │   ├── 101d.cif
│   │   ├── 101m.cif
│   │   ├── ...
│   │   ├── ...
│   │   ├── 9wga.cif
│   │   ├── 9xia.cif
│   │   └── 9xim.cif
│   └── obsolete.dat
├── uniclust30
│   └── uniclust30_2018_08
│       ├── uniclust30_2018_08_a3m_db -> uniclust30_2018_08_a3m.ffdata
│       ├── uniclust30_2018_08_a3m_db.index
  │      ├── uniclust30_2018_08_a3m.ffdata
│       ├── uniclust30_2018_08_a3m.ffindex
│       ├── uniclust30_2018_08.cs219
│       ├── uniclust30_2018_08_cs219.ffdata
│      ├── uniclust30_2018_08_cs219.ffindex
│      ├── uniclust30_2018_08.cs219.sizes
│      ├── uniclust30_2018_08_hhm_db -> uniclust30_2018_08_hhm.ffdata
│      ├── uniclust30_2018_08_hhm_db.index
│      ├── uniclust30_2018_08_hhm.ffdata
│      ├── uniclust30_2018_08_hhm.ffindex
│      └── uniclust30_2018_08_md5sum
└── uniref90
    └── uniref90.fasta


As an example, let us suppose we want to run alphafold2 from the directory <code>scratch/run_alphafold2</code>.
Bind-mount the current working directory to <tt>/etc</tt> inside the container for the cache file ld.so.cache [-B .:/etc]. The <tt>--nv</tt> flag is used to enable the GPU support.
We create a sub-directory <code>alphafold_output</code> to hold the output files,
Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.
and list the directory contents to ensure the the Singularity image file (<code>.sif</code>) is available:
{{Command
[name@cluster ~]$ cd scratch/run_alphafold2
|sbatch alpharun_jobscript.sh
[name@cluster run_alphafold2]$ mkdir alphafold_output
}}
[name@cluster run_alphafold2]$ ls
alphafold_output alphaFold.sif input.fasta


Alphafold2 launches a couple of multithreaded analyses using up to 8 CPUs before running model inference on the GPU.
On successful completion, the output directory should have the following files:
Memory requirements will vary with different size proteins.
{{Command
We created a batch input file for the above protein sequence as below.
|tree alphafold_output/input
#!/bin/bash
|result=
#SBATCH --job-name alphafold-run
#SBATCH --account=def-someuser
#SBATCH --time=08:00:00
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=20G
''#set the environment PATH''
export PYTHONNOUSERSITE=True
module load singularity
ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params
''#Run the command''
singularity run --nv \
  -B $ALPHAFOLD_DATA_PATH:/data \
  -B $ALPHAFOLD_MODELS \
  -B .:/etc \
  --pwd  /app/alphafold alphaFold.sif \
  --fasta_paths=input.fasta  \
  --uniref90_database_path=/data/uniref90/uniref90.fasta  \
  --data_dir=/data \
  --mgnify_database_path=/data/mgnify/mgy_clusters.fa  \
  --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
  --pdb70_database_path=/data/pdb70/pdb70  \
  --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
  --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
  --max_template_date=2020-05-14  \
  --output_dir=alphafold_output  \
  --model_names='model_1' \
  --preset=casp14
Bind-mount the current working directory to /etc inside the container for the cache file ld.so.cache [-B .:/etc]. The --nv flag is used to enable the GPU support. Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.
[name@cluster run_alphafold2]$ sbatch alpharun_jobscript.sh
On the successful completion, the output directory should have the following files:
[name@cluster run_alphafold2]$ $ tree alphafold_output/input
  alphafold_output
  alphafold_output
  └── input
  └── input
Line 355: Line 313:
     └── unrelaxed_model_1.pdb
     └── unrelaxed_model_1.pdb
  2 directories, 10 files
  2 directories, 10 files
}}
</translate>

Revision as of 14:43, 28 June 2022

AlphaFold is a machine-learning model for the prediction of protein folding.

This page discusses how to use AlphaFold v2.0, the version that was entered in CASP14 and published in Nature.

Source code and documentation for AlphaFold can be found at their GitHub page. Any publication that discloses findings arising from using this source code or the model parameters should cite the AlphaFold paper.

Using Python wheel

Available wheels

You can list available wheels using the avail_wheels command:

Question.png
[name@server ~]$ avail_wheels alphafold
name       version    python    arch
---------  ---------  --------  -------
alphafold  2.2.2      py3       generic

Installing AlphaFold in a Python virtual environment

1. Load AlphaFold dependencies:

Question.png
[name@server ~]$ module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

Only python 3.7 and 3.8 are currently supported.


2. Create a Python virtual environment and activate it:

[name@server ~]$ virtualenv --no-download ~/alphafold_env
[name@server ~]$ source ~/alphafold_env/bin/activate


3. Install a specific version of AlphaFold and its python dependencies:

(alphafold_env) [name@server ~] pip install --no-index --upgrade pip
(alphafold_env) [name@server ~] pip install --no-index alphafold==2.2.2


4. Validate it

Question.png
(alphafold_env) [name@server ~] run_alphafold.py --help

Databases

Note that AlphaFold requires a set of datasets/databases to be downloaded into the $SCRATCH.

Important: The database must live in the $SCRATCH

1. From a login node, create the data folder:

(alphafold_env) [name@server ~] export DOWNLOAD_DIR=$SCRATCH/alphafold/data
(alphafold_env) [name@server ~] mkdir -p $DOWNLOAD_DIR


2. With your virtual environment activated, you can download the data:

Question.png
(alphafold_env) [name@server ~] download_all_data.sh $DOWNLOAD_DIR

Note that this step cannot be done from compute nodes but rather from a login node. Since the download might take a while we suggest to start the download in a screen or Tmux session.

1. Set DOWNLOAD_DIR:

Question.png
(alphafold_env) [name@server ~] export DOWNLOAD_DIR=/datashare/alphafold

Afterwards, the structure of your data should be similar to:

Question.png
(alphafold_env) [name@server ~] tree -d $DOWNLOAD_DIR
$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
        mgy_clusters.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
        # LICENSE,
        # = 11 files.
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 206 GB (download: 46 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
    uniclust30/                            # ~ 87 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
    uniref90/                              # ~ 59 GB (download: 29.7 GB)
        uniref90.fasta

Running AlphaFold

Performance

AlphaFold has at most 8 cpus hardcoded since it does not benefit from using more than 8.



Edit to your needs the following submission script:

File : alphafold-cpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your supporting data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install alphafold and its dependencies
pip install --no-index --upgrade pip
pip install --no-index alphafold==2.2.2

# Edit with the proper arguments, run your commands
# run_alphafold.py --help
run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${OUTPUT_DIR} \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --max_template_date=2020-05-14 \
   --model_preset=monomer_casp14


File : alphafold-gpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your supporting data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install alphafold and its dependencies
pip install --no-index --upgrade pip
pip install --no-index alphafold==2.2.2

# Edit with the proper arguments, run your commands
# run_alphafold.py --help
run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${OUTPUT_DIR} \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --max_template_date=2020-05-14 \
   --model_preset=monomer_casp14 \
   --use_gpu_relax=True


Then submit the job to the scheduler:

Question.png
(alphafold_env) [name@server ~] sbatch --job-name alphafold-X alphafold-gpu.sh

Using Singularity

AlphaFold documentation explains how to run the software using Docker. We do not provide Docker, but instead provide Singularity. It is recommended to use a virtual environment and a Python wheel available from the Compute Canada "wheelhouse".

First read our Singularity documentation as there are particularities of each cluster that one must take into account. Then build a Singularity container:

[name@server ~]$ cd $SCRATCH
[name@server ~]$ module load singularity
[name@server ~]$ singularity build alphafold.sif docker://uvarc/alphafold:2.2.0


Running AlphaFold within Singularity

Performance

AlphaFold has at most 8 cpus hardcoded since it does not benefit from using more than 8.



Create a directory alphafold_output to hold the output files:

Question.png
[name@server ~]$ mkdir $SCRATCH/alphafold_output

Then edit the job submission script:

File : alphafold-singularity.sh

#!/bin/bash

#SBATCH --job-name alphafold-run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

module load singularity

export PYTHONNOUSERSITE=True

ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params

# Run the command
singularity run --nv \
    -B $ALPHAFOLD_DATA_PATH:/data \
    -B $ALPHAFOLD_MODELS \
    -B .:/etc \
    --pwd  /app/alphafold alphaFold.sif \
    --fasta_paths=/path/to/input.fasta  \
    --uniref90_database_path=/data/uniref90/uniref90.fasta  \
    --data_dir=/data \
    --mgnify_database_path=/data/mgnify/mgy_clusters.fa   \
    --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --pdb70_database_path=/data/pdb70/pdb70  \
    --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
    --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
    --max_template_date=2020-05-14   \
    --output_dir=alphafold_output  \
    --model_names='model_1' \
    --preset=casp14 \
    --use_gpu_relax=True


AlphaFold launches multithreaded analysis using up to 8 CPUs before running model inference on the GPU. Memory requirements will vary with different size proteins.

Bind-mount the current working directory to /etc inside the container for the cache file ld.so.cache [-B .:/etc]. The --nv flag is used to enable the GPU support. Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.

Question.png
[name@server ~]$ sbatch alpharun_jobscript.sh

On successful completion, the output directory should have the following files:

Question.png
[name@server ~]$ tree alphafold_output/input
alphafold_output
 └── input
    ├── features.pkl
    ├── msas
    │   ├── bfd_uniclust_hits.a3m
    │   ├── mgnify_hits.sto
    │   └── uniref90_hits.sto
    ├── ranked_0.pdb
    ├── ranking_debug.json
    ├── relaxed_model_1.pdb
    ├── result_model_1.pkl
    ├── timings.json
    └── unrelaxed_model_1.pdb
 2 directories, 10 files