AlphaFold: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
m (Pin Alphafold repo version)
m (move Category tag outside of translated area of the page)
 
(79 intermediate revisions by 9 users not shown)
Line 1: Line 1:
{{draft}}
<languages />
[[Category:Software]]
 
<translate>
<!--T:1-->
[https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology AlphaFold]
[https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology AlphaFold]
is a machine-learning model for the prediction of protein folding.  
is a machine learning model for the prediction of protein folding.  


<!--T:2-->
This page discusses how to use AlphaFold v2.0, the version that was entered in CASP14 and published in Nature.
This page discusses how to use AlphaFold v2.0, the version that was entered in CASP14 and published in Nature.


<!--T:3-->
Source code and documentation for AlphaFold can be found at their [https://github.com/deepmind/alphafold GitHub page].
Source code and documentation for AlphaFold can be found at their [https://github.com/deepmind/alphafold GitHub page].
Any publication that discloses findings arising from using this source code or the model parameters should [https://github.com/deepmind/alphafold#citing-this-work cite] the [https://doi.org/10.1038/s41586-021-03819-2 AlphaFold paper].
Any publication that discloses findings arising from use of this source code or the model parameters should [https://github.com/deepmind/alphafold#citing-this-work cite] the [https://doi.org/10.1038/s41586-021-03819-2 AlphaFold paper].


== Usage in Compute Canada systems ==
== Available versions == <!--T:5-->
AlphaFold is available on our clusters as prebuilt Python packages (wheels). You can list available versions with <code>avail_wheels</code>.
{{Command
|avail_wheels alphafold --all-versions
|result=
name      version    python    arch
---------  ---------  --------  -------
alphafold  2.3.1      py3      generic
alphafold  2.3.0      py3      generic
alphafold  2.2.4      py3      generic
alphafold  2.2.3      py3      generic
alphafold  2.2.2      py3      generic
alphafold  2.2.1      py3      generic
alphafold  2.1.1      py3      generic
alphafold  2.0.0      py3      generic
}}


AlphaFold documentation explains how to run the software using Docker.
== Installing AlphaFold in a Python virtual environment == <!--T:6-->
In Compute Canada we do not provide Docker, but instead provide [[Singularity]].
We will describe how to use AlphaFold with Singularity much further down this page,
but we recommend instead that you use a virtual environment and a Python wheel available from the Compute Canada "wheelhouse".


=== AlphaFold in Python environment ===
<!--T:7-->
1. Load AlphaFold dependencies.
{{Command|module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8
}}
As of July 2022, only Python 3.7 and 3.8 are supported.


1. AlphaFold has a number of other dependencies that need to be loaded first.
These include Cuda, kalign, hmmer, and openmm, all of which are available in the Compute Canada software stack.
Load these modules like this (FOR NARVAL CHANGE cuda/11.2.2 for cuda/11.4):


<pre>
<!--T:8-->
[name@cluster ~]$ module load gcc/9.3.0 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0
2. Create and activate a Python virtual environment.
[name@cluster ~]$ module load kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0
{{Commands
</pre>
|virtualenv --no-download ~/alphafold_env
|source ~/alphafold_env/bin/activate
}}


2. Clone the AlphaFold repository in <tt>$SCRATCH</tt>:
<!--T:9-->
3. Install a specific version of AlphaFold and its Python dependencies.
{{Commands
|prompt=(alphafold_env) [name@server ~]
|pip install --no-index --upgrade pip
|pip install --no-index alphafold{{=}}{{=}}X.Y.Z
}}
where <code>X.Y.Z</code> is the exact desired version, for instance <code>2.2.4</code>.
You can omit to specify the version in order to install the latest one available from the wheelhouse.


<pre>
<!--T:10-->
[name@cluster ~]$ cd $SCRATCH
4. Validate it.
[name@cluster ~]$ git clone https://github.com/deepmind/alphafold.git -b v2.1.1
{{Command
[name@cluster ~]$ wget https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt -P alphafold/alphafold/common/
|prompt=(alphafold_env) [name@server ~]
</pre>
|run_alphafold.py --help
}}


3. Create a Python virtual environment and activate it:
<!--T:45-->
5. Freeze the environment and requirements set.
{{Command
|prompt=(alphafold_env) [name@server ~]
|pip freeze > ~/alphafold-requirements.txt
}}


<pre>
== Databases == <!--T:11-->
[name@cluster ~]$ virtualenv --no-download ~/my_env
Note that AlphaFold requires a set of databases.
[name@cluster ~]$ source ~/my_env/bin/activate
</pre>


4. Install AlphaFold and its dependencies by:
<!--T:65-->
The databases are available in
<code>/cvmfs/bio.data.computecanada.ca/content/databases/Core/alphafold2_dbs/</code>.


<pre>
<!--T:63-->
(my_env)[name@cluster ~]$ pip install --no-index pdbfixer==1.7 alphafold==2.1.1
AlphaFold databases on CVMFS undergo yearly updates. In January 2024, the database was updated and is accessible in folder <code>2024_01</code>.
</pre>
{{Command
|prompt=(alphafold_env) [name@server ~]
|export DOWNLOAD_DIR{{=}}/cvmfs/bio.data.computecanada.ca/content/databases/Core/alphafold2_dbs/2024_01/
}}


Now AlphaFold is ready to be used. Note that to use AlphaFold outside a container, you need to use the <code>run_alphafold.py</code> script that is provided in the repository.
<!--T:66-->
You can also choose to download the databases locally into your <code>$SCRATCH</code> directory.


==== Creating the virtual environment in the job script ====
<!--T:12-->
<b>Important:</b> The databases must live in the <code>$SCRATCH</code> directory.


As discussed on the [[Python#Creating_virtual_environments_inside_of_your_jobs|Python]] page,
<!--T:13-->
your job may run faster if you create the virtual environment on node-local storage during the job.
<tabs>
If you do so, your job script should look something like this:
<tab name="General">
1. From a DTN or login node, create the data folder.
{{Commands
|prompt=(alphafold_env) [name@server ~]
|export DOWNLOAD_DIR{{=}}$SCRATCH/alphafold/data
|mkdir -p $DOWNLOAD_DIR
}}


{{File
<!--T:14-->
|name=my_alphafoldjob.sh
2. With your modules loaded and virtual environment activated, you can download the data.
|lang="bash"
{{Command
|contents=
|prompt=(alphafold_env) [name@server ~]
#!/bin/bash
|download_all_data.sh $DOWNLOAD_DIR
#SBATCH --job-name=alphafold_run
}}
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-12:00        # adjust this to match the walltime of your job
#SBATCH --nodes=1     
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1          # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --cpus-per-task=8      # adjust this if you are using parallel commands
#SBATCH --mem=4000            # adjust this according to the memory you need
 
# Load your modules as before
# ON NARVAL USE cuda/11.4!!!!
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0
module load kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0
 
cd $SCRATCH


# Generate your virtual environment in $SLURM_TMPDIR
<!--T:15-->
virtualenv --no-download ${SLURM_TMPDIR}/my_env
Note that this step <b>cannot</b> be done from a compute node. It should be done on a data transfer node (DTN) on clusters that have them (see [[Transferring data]]). On clusters that have no DTN, use a login node instead. Since the download can take up to a full day, we suggest using a [[Prolonging_terminal_sessions#Terminal_multiplexers|terminal multiplexer]]. You may encounter a <code>Client_loop: send disconnect: Broken pipe</code> error message. See [[AlphaFold#Broken pipe error message|Troubleshooting]] below.
source ${SLURM_TMPDIR}/my_env/bin/activate


# Install alphafold and dependencies
<!--T:67-->
pip install --no-index pdbfixer==1.7 alphafold==2.1.1
</tab>


# Run your commands
<!--T:16-->
python $SCRATCH/alphafold/run_alphafold.py --help
<tab name="Graham only">
1. Set <code>DOWNLOAD_DIR</code>.
{{Command
|prompt=(alphafold_env) [name@server ~]
|export DOWNLOAD_DIR{{=}}/datashare/alphafold
}}
}}


== Databases ==
<!--T:62-->
Note that AlphaFold requires a set of datasets/databases that need to be downloaded into the <tt>$SCRATCH</tt>. Also notice that we prefer you avoid using `aria2c`. To do so:
</tab>
</tabs>


'''Important:''' The database must live in the <tt>$SCRATCH</tt>  <b>UNLESS</b> you are working with the NFS mount in graham (see below).


'''Special Note for GRAHAM ONLY:''' The database is available in an NFS mount. See more information at https://helpwiki.sharcnet.ca/wiki/Graham_Reference_Dataset_Repository#AlphaFold.
<!--T:47-->
 
Afterwards, the structure of your data should be similar to
1. Move to the AlphaFold repository and the scripts folder:
<tabs>
<pre>
<tab name=2.3>
[name@cluster ~]$ cd $SCRATCH/alphafold
{{Command
[name@cluster ~]$ mkdir data
|prompt=(alphafold_env) [name@server ~]
</pre>
|tree -d $DOWNLOAD_DIR
 
|result=
2. Modify all the files there with the following command:
$DOWNLOAD_DIR/                             # ~ 2.6 TB (total)
<pre>
    bfd/                                  # ~ 1.8 TB
[name@cluster scripts]$ sed -i -e 's/aria2c/wget/g' -e 's/--dir=/-P /g' -e 's/--preserve-permissions//g' scripts/*.sh
        # 6 files
</pre>
    mgnify/                               # ~ 120 GB
 
        mgy_clusters.fa
3. Use the scripts to download the data:
    params/                                # ~ 5.3 GB
<pre>
        # LICENSE
[name@cluster ~]$ bash scripts/download_all_data.sh $SCRATCH/alphafold/data
        # 15 models
</pre>
        # 16 files (total)
    pdb70/                                # ~ 56 GB
        # 9 files
    pdb_mmcif/                             # ~ 246 GB
        mmcif_files/
            # 202,764 files
        obsolete.dat
    pdb_seqres/                           # ~ 237 MB
        pdb_seqres.txt
    uniprot/                               # ~ 111 GB
        uniprot.fasta
    uniref30/                              # ~ 206 GB
        # 7 files
    uniref90/                              # ~ 73 GB
        uniref90.fasta
}}
</tab>


Note that this might take a while and '''SHOULD NOT BE DONE IN THE COMPUTE NODES'''. Instead, you should use the [https://docs.computecanada.ca/wiki/Transferring_data data transfer nodes] or the login nodes. Since the download might take a while we recommend you do this in a [https://linuxize.com/post/how-to-use-linux-screen/ screen] or [https://docs.computecanada.ca/wiki/Tmux Tmux] sessions. If your path/to/download is stored in <code>$DOWNLOAD_DIR</code>, then the structure of your data should be:
<!--T:17-->
 
<tab name=2.2>
<pre>
{{Command
|prompt=(alphafold_env) [name@server ~]
|tree -d $DOWNLOAD_DIR
|result=
$DOWNLOAD_DIR/                            # Total: ~ 2.2 TB (download: 428 GB)
$DOWNLOAD_DIR/                            # Total: ~ 2.2 TB (download: 428 GB)
     bfd/                                  # ~ 1.8 TB (download: 271.6 GB)
     bfd/                                  # ~ 1.8 TB (download: 271.6 GB)
Line 134: Line 188:
     uniref90/                              # ~ 59 GB (download: 29.7 GB)
     uniref90/                              # ~ 59 GB (download: 29.7 GB)
         uniref90.fasta
         uniref90.fasta
</pre>
}}
</tab>
</tabs>
 
== Running AlphaFold == <!--T:18-->
{{Warning
|title=Performance
|content=You can request at most 8 CPU cores when running AlphaFold because it is hardcoded to not use more and does not benefit from using more.
}}
 
<!--T:19-->
Edit one of following submission scripts according to your needs.
<tabs>
<tab name="2.3 on CPU">
{{File
|name=alphafold-2.3-cpu.sh
|lang="bash"
|contents=
#!/bin/bash
 
<!--T:48-->
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00          # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8        # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --mem=20G                # adjust this according to the memory you need
 
<!--T:49-->
# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8


This is important when passing the commands to AlphaFold.
<!--T:50-->
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input    # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data


== Running AlphaFold ==
<!--T:51-->
<div class="alert alert-danger">
# Generate your virtual environment in $SLURM_TMPDIR.
  <strong>AlphaFold2 has the number of CPUS hardcoded!</strong>. Plase do not use other number but 8 as these are the required CPUS.
virtualenv --no-download ${SLURM_TMPDIR}/env
</div>
source ${SLURM_TMPDIR}/env/bin/activate


Once you have everything setup, you can run a production run of AlphaFold by:
<!--T:52-->
# Install AlphaFold and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt


<!--T:53-->
# Edit with the proper arguments and run your commands.
# run_alphafold.py --help
run_alphafold.py \
  --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
  --output_dir=${OUTPUT_DIR} \
  --data_dir=${DOWNLOAD_DIR} \
  --db_preset=full_dbs \
  --model_preset=multimer \
  --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2022_05.fa \
  --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
  --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
  --pdb_seqres_database_path=${DOWNLOAD_DIR}/pdb_seqres/pdb_seqres.txt \
  --uniprot_database_path=${DOWNLOAD_DIR}/uniprot/uniprot.fasta \
  --uniref30_database_path=${DOWNLOAD_DIR}/uniref30/UniRef30_2021_03 \
  --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta \
  --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
  --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
  --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
  --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
  --max_template_date=2022-01-01 \
  --use_gpu_relax=False
}}
</tab>
<!--T:54-->
<tab name="2.3 on GPU">
{{File
{{File
|name=my_alphafoldjob.sh
|name=alphafold-2.3-gpu.sh
|lang="bash"
|lang="bash"
|contents=
|contents=
#!/bin/bash
#!/bin/bash
<!--T:55-->
#SBATCH --job-name=alphafold_run
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --account=def-someprof   # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-12:00:00     # adjust this to match the walltime of your job
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --nodes=1     
#SBATCH --cpus-per-task=8        # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1             # a GPU helps to accelerate the inference part only
#SBATCH --gres=gpu:1           # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --mem=20G                # adjust this according to the memory you need
#SBATCH --cpus-per-task=8      # DO NOT INCREASE THIS AS ALPHAFOLD CANNOT TAKE ADVANTAGE OF MORE
 
#SBATCH --mem=32G              # adjust this according to the memory requirement per node you need
<!--T:56-->
#SBATCH --mail-user=you@youruniversity.ca # adjust this to match your email address
# Load modules dependencies.
#SBATCH --mail-type=ALL
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8
 
<!--T:57-->
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input    # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data
 
<!--T:58-->
# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate
 
<!--T:59-->
# Install AlphaFold and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt


# Set the path to download dir
<!--T:60-->
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # Set the appropriate path to your downloaded data
# Edit with the proper arguments and run your commands.
INPUT_DIR=$SCRATCH/alphafold/input    # Set the appropriate path to your supporting data
# run_alphafold.py --help
REPO_DIR=$SCRATCH/alphafold # Set the appropriate path to AlphaFold's cloned repo
run_alphafold.py \
  --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
  --output_dir=${OUTPUT_DIR} \
  --data_dir=${DOWNLOAD_DIR} \
  --db_preset=full_dbs \
  --model_preset=multimer \
  --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2022_05.fa \
  --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
  --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
  --pdb_seqres_database_path=${DOWNLOAD_DIR}/pdb_seqres/pdb_seqres.txt \
  --uniprot_database_path=${DOWNLOAD_DIR}/uniprot/uniprot.fasta \
  --uniref30_database_path=${DOWNLOAD_DIR}/uniref30/UniRef30_2021_03 \
  --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta \
  --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
  --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
  --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
  --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
  --max_template_date=2022-01-01 \
  --use_gpu_relax=True
}}
</tab>


# Load your modules as before
<!--T:61-->
# ON NARVAL USE cuda/11.4 !!!
<tab name="2.2 on CPU">
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0
{{File
module load kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0
|name=alphafold-cpu.sh
|lang="bash"
|contents=
#!/bin/bash
 
<!--T:20-->
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00          # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8        # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --mem=20G                # adjust this according to the memory you need


cd $SCRATCH # Set the appropriate folder where the repo is contained
<!--T:21-->
# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8


# Generate your virtual environment in $SLURM_TMPDIR
<!--T:22-->
virtualenv --no-download ${SLURM_TMPDIR}/my_env
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # set the appropriate path to your downloaded data
source ${SLURM_TMPDIR}/my_env/bin/activate
INPUT_DIR=$SCRATCH/alphafold/input    # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data


# Install alphafold and dependencies
<!--T:23-->
pip install --no-index pdbfixer==1.7 alphafold==2.1.1
# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate


# Run your commands
<!--T:24-->
python ${REPO_DIR}/run_alphafold.py \
# Install AlphaFold and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt
 
<!--T:25-->
# Edit with the proper arguments and run your commands.
# Note that the `--uniclust30_database_path` option below was renamed to
# `--uniref30_database_path` in 2.3.
# run_alphafold.py --help
run_alphafold.py \
  --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
  --output_dir=${OUTPUT_DIR} \
   --data_dir=${DOWNLOAD_DIR} \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --model_preset=monomer_casp14 \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
Line 193: Line 378:
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
  --max_template_date=2020-05-14 \
  --use_gpu_relax=False
}}
</tab>
<!--T:26-->
<tab name="2.2 on GPU">
{{File
|name=alphafold-gpu.sh
|lang="bash"
|contents=
#!/bin/bash
<!--T:27-->
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00          # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8        # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --mem=20G                # adjust this according to the memory you need
<!--T:28-->
# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8
<!--T:29-->
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input    # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data
<!--T:30-->
# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate
<!--T:31-->
# Install AlphaFold  and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt
<!--T:32-->
# Edit with the proper arguments and run your commands.
# Note that the `--uniclust30_database_path` option below was renamed to
# `--uniref30_database_path` in 2.3.
# run_alphafold.py --help
run_alphafold.py \
  --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
  --output_dir=${OUTPUT_DIR} \
  --data_dir=${DOWNLOAD_DIR} \
  --model_preset=monomer_casp14 \
  --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${SCRATCH}/alphafold_output \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
  --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
  --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
  --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
  --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
  --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
  --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
  --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --max_template_date=2020-05-14 \
   --max_template_date=2020-05-14 \
  --model_preset=monomer_casp14 \
   --use_gpu_relax=True
   --use_gpu_relax=True
}}
</tab>
</tabs>


<!--T:33-->
Then, submit the job to the scheduler.
{{Command
|prompt=(alphafold_env) [name@server ~]
|sbatch --job-name alphafold-X alphafold-gpu.sh
}}
}}


== Using singularity ==
== Troubleshooting == <!--T:68-->
If you want to try the containerized version (NOT our preferred option), first read our [[Singularity]] documentation as there are particularities of each cluster that you must take into account.  Then you can [[Singularity#Creating_images_on_Compute_Canada_clusters| build a Singularity container]] like so:
=== Broken pipe error message ===
 
When downloading the database, you may encounter a <code>Client_loop: send disconnect: Broken pipe</code> error message. It is hard to find the exact cause for this error message. It could be as simple as an unusually high number of users working on the login node, leaving less space for you to upload data.
<pre>
[name@cluster ~]$ module load singularity
[name@cluster ~]$ singularity build alphafold.sif docker://uvarc/alphafold:2.0.0
</pre>
 


=== Running AlphaFold within Singularity ===
<!--T:69-->
*One solution is to use a [[Prolonging_terminal_sessions#Terminal_multiplexers|terminal multiplexer]]. Note that you could still encounter this error message but less are the chances.


Here is an example to run the containerized version of alphafold2 on a given protein sequence. The protein sequence is saved in fasta format as below:
<!--T:70-->
[name@cluster ~]$ cat input.fasta
*A second solution is to use the database that is already present on the cluster. <code>/cvmfs/bio.data.computecanada.ca/content/databases/Core/alphafold2_dbs/2023_07/</code>.
>5ZE6_1
MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA
FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG
TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP
WREALIGLAHIAVQRDR
The reference databases and models were downloaded to predict the structure of the above protein sequence.
[name@cluster ~]$ tree databases/
databases/
├── bfd
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── mgnify
│   └── mgy_clusters.fa
├── params
│   ├── LICENSE
│   ├── params_model_1.npz
│   ├── params_model_1_ptm.npz
│   ├── params_model_2.npz
│   ├── params_model_2_ptm.npz
│   ├── params_model_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
│   ├── md5sum
│   ├── pdb70_a3m.ffdata
│   ├── pdb70_a3m.ffindex
│   ├── pdb70_clu.tsv
│   ├── pdb70_cs219.ffdata
│   ├── pdb70_cs219.ffindex
│   ├── pdb70_hhm.ffdata
│   ├── pdb70_hhm.ffindex
│   └── pdb_filter.dat
├── pdb_mmcif
  │   ├── mmcif_files
│   │   ├── 100d.cif
│   │   ├── 101d.cif
│   │   ├── 101m.cif
│   │   ├── ...
│   │   ├── ...
│   │   ├── 9wga.cif
│   │   ├── 9xia.cif
│   │   └── 9xim.cif
│   └── obsolete.dat
├── uniclust30
│   └── uniclust30_2018_08
│      ├── uniclust30_2018_08_a3m_db -> uniclust30_2018_08_a3m.ffdata
│      ├── uniclust30_2018_08_a3m_db.index
│      ├── uniclust30_2018_08_a3m.ffdata
│      ├── uniclust30_2018_08_a3m.ffindex
│      ├── uniclust30_2018_08.cs219
│      ├── uniclust30_2018_08_cs219.ffdata
│      ├── uniclust30_2018_08_cs219.ffindex
│      ├── uniclust30_2018_08.cs219.sizes
│      ├── uniclust30_2018_08_hhm_db -> uniclust30_2018_08_hhm.ffdata
│      ├── uniclust30_2018_08_hhm_db.index
│      ├── uniclust30_2018_08_hhm.ffdata
│      ├── uniclust30_2018_08_hhm.ffindex
│      └── uniclust30_2018_08_md5sum
└── uniref90
    └── uniref90.fasta


As an example, let us suppose we want to run alphafold2 from the directory <code>scratch/run_alphafold2</code>.
<!--T:71-->
We create a sub-directory <code>alphafold_output</code> to hold the output files,
*Another option is to download the full database in sections. To have access to the different download scripts, after loading the module and activated your virtual environment, you simply enter <code>download_</code> in your terminal and tap twice on the <code>tab</code> keyboard key to visualize all the scripts that are available. You can manually download sections of the database by using the available script, as for instance <code>download_pdb.sh</code>.  
and list the directory contents to ensure the the Singularity image file (<code>.sif</code>) is available:
[name@cluster ~]$ cd scratch/run_alphafold2
[name@cluster run_alphafold2]$ mkdir alphafold_output
[name@cluster run_alphafold2]$ ls
alphafold_output alphaFold.sif input.fasta


Alphafold2 launches a couple of multithreaded analyses using up to 8 CPUs before running model inference on the GPU.
</translate>
Memory requirements will vary with different size proteins.
We created a batch input file for the above protein sequence as below.
#!/bin/bash
#SBATCH --job-name alphafold-run
#SBATCH --account=def-someuser
#SBATCH --time=08:00:00
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=20G
''#set the environment PATH''
export PYTHONNOUSERSITE=True
module load singularity
ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params
''#Run the command''
singularity run --nv \
  -B $ALPHAFOLD_DATA_PATH:/data \
  -B $ALPHAFOLD_MODELS \
  -B .:/etc \
  --pwd  /app/alphafold alphaFold.sif \
  --fasta_paths=input.fasta  \
  --uniref90_database_path=/data/uniref90/uniref90.fasta  \
  --data_dir=/data \
  --mgnify_database_path=/data/mgnify/mgy_clusters.fa  \
  --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
  --pdb70_database_path=/data/pdb70/pdb70  \
  --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
  --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
  --max_template_date=2020-05-14  \
  --output_dir=alphafold_output  \
  --model_names='model_1' \
  --preset=casp14
Bind-mount the current working directory to /etc inside the container for the cache file ld.so.cache [-B .:/etc]. The --nv flag is used to enable the GPU support. Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.
[name@cluster run_alphafold2]$ sbatch alpharun_jobscript.sh
On the successful completion, the output directory should have the following files:
[name@cluster run_alphafold2]$ $ tree alphafold_output/input
alphafold_output
└── input
    ├── features.pkl
    ├── msas
    │   ├── bfd_uniclust_hits.a3m
    │   ├── mgnify_hits.sto
    │   └── uniref90_hits.sto
    ├── ranked_0.pdb
    ├── ranking_debug.json
    ├── relaxed_model_1.pdb
    ├── result_model_1.pkl
    ├── timings.json
    └── unrelaxed_model_1.pdb
2 directories, 10 files

Latest revision as of 12:47, 1 May 2024

Other languages:

AlphaFold is a machine learning model for the prediction of protein folding.

This page discusses how to use AlphaFold v2.0, the version that was entered in CASP14 and published in Nature.

Source code and documentation for AlphaFold can be found at their GitHub page. Any publication that discloses findings arising from use of this source code or the model parameters should cite the AlphaFold paper.

Available versions[edit]

AlphaFold is available on our clusters as prebuilt Python packages (wheels). You can list available versions with avail_wheels.

Question.png
[name@server ~]$ avail_wheels alphafold --all-versions
name       version    python    arch
---------  ---------  --------  -------
alphafold  2.3.1      py3       generic
alphafold  2.3.0      py3       generic
alphafold  2.2.4      py3       generic
alphafold  2.2.3      py3       generic
alphafold  2.2.2      py3       generic
alphafold  2.2.1      py3       generic
alphafold  2.1.1      py3       generic
alphafold  2.0.0      py3       generic

Installing AlphaFold in a Python virtual environment[edit]

1. Load AlphaFold dependencies.

Question.png
[name@server ~]$ module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

As of July 2022, only Python 3.7 and 3.8 are supported.


2. Create and activate a Python virtual environment.

[name@server ~]$ virtualenv --no-download ~/alphafold_env
[name@server ~]$ source ~/alphafold_env/bin/activate


3. Install a specific version of AlphaFold and its Python dependencies.

(alphafold_env) [name@server ~] pip install --no-index --upgrade pip
(alphafold_env) [name@server ~] pip install --no-index alphafold==X.Y.Z

where X.Y.Z is the exact desired version, for instance 2.2.4. You can omit to specify the version in order to install the latest one available from the wheelhouse.

4. Validate it.

Question.png
(alphafold_env) [name@server ~] run_alphafold.py --help

5. Freeze the environment and requirements set.

Question.png
(alphafold_env) [name@server ~] pip freeze > ~/alphafold-requirements.txt

Databases[edit]

Note that AlphaFold requires a set of databases.

The databases are available in /cvmfs/bio.data.computecanada.ca/content/databases/Core/alphafold2_dbs/.

AlphaFold databases on CVMFS undergo yearly updates. In January 2024, the database was updated and is accessible in folder 2024_01.

Question.png
(alphafold_env) [name@server ~] export DOWNLOAD_DIR=/cvmfs/bio.data.computecanada.ca/content/databases/Core/alphafold2_dbs/2024_01/

You can also choose to download the databases locally into your $SCRATCH directory.

Important: The databases must live in the $SCRATCH directory.

1. From a DTN or login node, create the data folder.

(alphafold_env) [name@server ~] export DOWNLOAD_DIR=$SCRATCH/alphafold/data
(alphafold_env) [name@server ~] mkdir -p $DOWNLOAD_DIR


2. With your modules loaded and virtual environment activated, you can download the data.

Question.png
(alphafold_env) [name@server ~] download_all_data.sh $DOWNLOAD_DIR

Note that this step cannot be done from a compute node. It should be done on a data transfer node (DTN) on clusters that have them (see Transferring data). On clusters that have no DTN, use a login node instead. Since the download can take up to a full day, we suggest using a terminal multiplexer. You may encounter a Client_loop: send disconnect: Broken pipe error message. See Troubleshooting below.

1. Set DOWNLOAD_DIR.

Question.png
(alphafold_env) [name@server ~] export DOWNLOAD_DIR=/datashare/alphafold


Afterwards, the structure of your data should be similar to

Question.png
(alphafold_env) [name@server ~] tree -d $DOWNLOAD_DIR
$DOWNLOAD_DIR/                             # ~ 2.6 TB (total)
    bfd/                                   # ~ 1.8 TB
        # 6 files
    mgnify/                                # ~ 120 GB
        mgy_clusters.fa
    params/                                # ~ 5.3 GB
        # LICENSE
        # 15 models
        # 16 files (total)
    pdb70/                                 # ~ 56 GB
        # 9 files
    pdb_mmcif/                             # ~ 246 GB
        mmcif_files/
            # 202,764 files
        obsolete.dat
    pdb_seqres/                            # ~ 237 MB
        pdb_seqres.txt
    uniprot/                               # ~ 111 GB
        uniprot.fasta
    uniref30/                              # ~ 206 GB
        # 7 files
    uniref90/                              # ~ 73 GB
        uniref90.fasta
Question.png
(alphafold_env) [name@server ~] tree -d $DOWNLOAD_DIR
$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
        mgy_clusters.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
        # LICENSE,
        # = 11 files.
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 206 GB (download: 46 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
    uniclust30/                            # ~ 87 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
    uniref90/                              # ~ 59 GB (download: 29.7 GB)
        uniref90.fasta

Running AlphaFold[edit]

Performance

You can request at most 8 CPU cores when running AlphaFold because it is hardcoded to not use more and does not benefit from using more.



Edit one of following submission scripts according to your needs.

File : alphafold-2.3-cpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data

# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install AlphaFold and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt

# Edit with the proper arguments and run your commands.
# run_alphafold.py --help
run_alphafold.py \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --output_dir=${OUTPUT_DIR} \
   --data_dir=${DOWNLOAD_DIR} \
   --db_preset=full_dbs \
   --model_preset=multimer \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2022_05.fa \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --pdb_seqres_database_path=${DOWNLOAD_DIR}/pdb_seqres/pdb_seqres.txt \
   --uniprot_database_path=${DOWNLOAD_DIR}/uniprot/uniprot.fasta \
   --uniref30_database_path=${DOWNLOAD_DIR}/uniref30/UniRef30_2021_03 \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --max_template_date=2022-01-01 \
   --use_gpu_relax=False


File : alphafold-2.3-gpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data

# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install AlphaFold and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt

# Edit with the proper arguments and run your commands.
# run_alphafold.py --help
run_alphafold.py \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --output_dir=${OUTPUT_DIR} \
   --data_dir=${DOWNLOAD_DIR} \
   --db_preset=full_dbs \
   --model_preset=multimer \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2022_05.fa \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --pdb_seqres_database_path=${DOWNLOAD_DIR}/pdb_seqres/pdb_seqres.txt \
   --uniprot_database_path=${DOWNLOAD_DIR}/uniprot/uniprot.fasta \
   --uniref30_database_path=${DOWNLOAD_DIR}/uniref30/UniRef30_2021_03 \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --max_template_date=2022-01-01 \
   --use_gpu_relax=True


File : alphafold-cpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data

# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install AlphaFold and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt

# Edit with the proper arguments and run your commands.
# Note that the `--uniclust30_database_path` option below was renamed to
# `--uniref30_database_path` in 2.3.
# run_alphafold.py --help
run_alphafold.py \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --output_dir=${OUTPUT_DIR} \
   --data_dir=${DOWNLOAD_DIR} \
   --model_preset=monomer_casp14 \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --max_template_date=2020-05-14 \
   --use_gpu_relax=False


File : alphafold-gpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, AlphaFold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your input data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your output data

# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install AlphaFold  and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt

# Edit with the proper arguments and run your commands.
# Note that the `--uniclust30_database_path` option below was renamed to
# `--uniref30_database_path` in 2.3.
# run_alphafold.py --help
run_alphafold.py \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --output_dir=${OUTPUT_DIR} \
   --data_dir=${DOWNLOAD_DIR} \
   --model_preset=monomer_casp14 \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --max_template_date=2020-05-14 \
   --use_gpu_relax=True


Then, submit the job to the scheduler.

Question.png
(alphafold_env) [name@server ~] sbatch --job-name alphafold-X alphafold-gpu.sh

Troubleshooting[edit]

Broken pipe error message[edit]

When downloading the database, you may encounter a Client_loop: send disconnect: Broken pipe error message. It is hard to find the exact cause for this error message. It could be as simple as an unusually high number of users working on the login node, leaving less space for you to upload data.

  • One solution is to use a terminal multiplexer. Note that you could still encounter this error message but less are the chances.
  • A second solution is to use the database that is already present on the cluster. /cvmfs/bio.data.computecanada.ca/content/databases/Core/alphafold2_dbs/2023_07/.
  • Another option is to download the full database in sections. To have access to the different download scripts, after loading the module and activated your virtual environment, you simply enter download_ in your terminal and tap twice on the tab keyboard key to visualize all the scripts that are available. You can manually download sections of the database by using the available script, as for instance download_pdb.sh.