AlphaFold: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
Line 180: Line 180:


=== Running AlphaFold within Singularity ===
=== Running AlphaFold within Singularity ===
Here is an example to run the containerized version of alphafold2 on the following protein sequence. The input sequence is saved in fasta format.
$ cat input.fasta
>5ZE6_1
MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA
FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG
TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP
WREALIGLAHIAVQRDR
The following databases and models were downloaded for the prediction.
$ ls
bfd  mgnify  params  pdb70  pdb_mmcif  uniclust30  uniref90
Let's say we want to run alphafold2 from the directory scratch/run_alphafold2
$ cd scratch/run_alphafold2
$ mkdir output_dir # create directory for the output files
$ ls # list the directory contents to ensure the singularity image file (.sif) is available
output_dir alphaFold.sif
Example job script to submit a batch job
#!/bin/bash
#SBATCH -J alphafold-1
#SBATCH --account=def-plotkin
#SBATCH --time=08:00:00
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30G
''#set the environment PATH''
export PYTHONNOUSERSITE=True
ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params
''#Run the command''
singularity run --nv \
  -B $ALPHAFOLD_DATA_PATH:/data \
  -B $ALPHAFOLD_MODELS \
  -B .:/etc \
  --pwd  /app/alphafold alphaFold.sif \
  --fasta_paths=input.fasta  \
  --uniref90_database_path=/data/uniref90/uniref90.fasta  \
  --data_dir=/data \
  --mgnify_database_path=/data/mgnify/mgy_clusters_2018_12.fa  \
  --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
  --pdb70_database_path=/data/pdb70/pdb70  \
  --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
  --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
  --max_template_date=2020-05-14  \
  --output_dir=alphafold_output  \
  --model_names='model_1' \
  --preset=casp14
Submit the jobscript using the sbatch command
$sbatch jobscript.sh
On the successful completion, the output directory should have the following files:
$ ls alphafold_output/input/
  features.pkl  ranked_0.pdb        relaxed_model_1.pdb  timings.json
  msas          ranking_debug.json  result_model_1.pkl  unrelaxed_model_1.pdb

Revision as of 05:14, 10 August 2021


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.



This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP14 and published in Nature. For simplicity, we refer to this model as AlphaFold throughout the rest of this document.

Any publication that discloses findings arising from using this source code or the model parameters should cite the AlphaFold paper.

The source code of this package can be found in their GitHub page along with some documentation.

Usage in Compute Canada systems

As you might have seen from their documentation, they explain the usage via Docker. In Compute Canada we do not provide Docker as container, but singularity (see our documentation at https://docs.computecanada.ca/wiki/Singularity). However, we have created a wheel to use AlphaFold in a python environment. In the case of AlphaFold, make sure your python is either 3.7 or 3.8 (the latter explain in the examples below).

AlphaFold in Python environment

1. AlphaFold has a number of non-python dependencies that need to be loaded ahead of time. For example, cuda, kalign, hmmer, and openmm. Luckily all these dependencies are available though our stack:

[name@cluster ~]$ module load gcc/9 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

2. Clone the AlphaFold repository in $SCRATCH:

[name@cluster ~]$ cd $SCRATCH
[name@cluster ~]$ git clone https://github.com/deepmind/alphafold.git

3. Then you can proceed to create the python virtual environment and activate it by:

[name@cluster ~]$ virtualenv --no-download ~/my_env && source ~/my_env/bin/activate

3. Now you can install AlphaFold and its dependencies by:

(my_env)[name@cluster ~]$ pip install --no-index pdbfixer alphafold

Now AlphaFold is ready to be used. Note that to use AlphaFold outside a container, you need to use the run_alphafold.py script that is provided in the repository.

Creating the virtual environment in the job script

As you probably have read in Creating_and_using_a_virtual_environment, you can also take advantage of the local installs on compute nodes:


File : my_alphafoldjob.sh

#!/bin/bash
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-03:00         # adjust this to match the walltime of your job
#SBATCH --nodes=1      
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1           # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --cpus-per-task=1      # adjust this if you are using parallel commands
#SBATCH --mem=4000             # adjust this according to the memory requirement per node you need
#SBATCH --mail-user=you@youruniversity.ca # adjust this to match your email address
#SBATCH --mail-type=ALL

# Load your modules as before
module load gcc/9 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

cd $SCRATCH 

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/my_env && source ${SLURM_TMPDIR}/my_env/bin/activate

# Install alphafold and dependencies
pip install --no-index scipy==1.4.1 pdbfixer alphafold --upgrade

# Run your commands
python $SCRATCH/alphafold/run_alphafold.py --help


Databases

Note that AlphaFold requires a set of datasets/databases that need to be downloaded into the $SCRATCH. Also notice that we prefer you avoid using `aria2c`. To do so:

Important: The database must live in the $SCRATCH.

1. Move to the AlphaFold repository and the scripts folder:

[name@cluster ~]$ cd $SCRATCH/alphafold
[name@cluster ~]$ mkdir data

2. Modify all the files there with the following command:

[name@cluster scripts]$ sed -i -e 's/aria2c/wget/g' -e 's/--dir=/-P /g' -e 's/--preserve-permissions//g' scripts/*.sh

3. Use the scripts to download the data:

[name@cluster ~]$ bash scripts/download_all_data.sh $SCRATCH/alphafold/data

Note that this might take a while and SHOULD NOT BE DONE IN THE COMPUTE NODES. Instead, you should use the data transfer nodes or the login nodes. Since the download might take a while we recommend you do this in a screen or Tmux sessions. If your path/to/download is stored in $DOWNLOAD_DIR, then the structure of your data should be:

$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
        mgy_clusters.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
        # LICENSE,
        # = 11 files.
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 206 GB (download: 46 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
    uniclust30/                            # ~ 87 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
    uniref90/                              # ~ 59 GB (download: 29.7 GB)
        uniref90.fasta

This is important when passing the commands to AlphaFold.

Running AlphaFold

Once you have everything setup, you can run a production run of AlphaFold by:


File : my_alphafoldjob.sh

#!/bin/bash
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-03:00         # adjust this to match the walltime of your job
#SBATCH --nodes=1      
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1           # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --cpus-per-task=8      # adjust this if you are using parallel commands
#SBATCH --mem=32G              # adjust this according to the memory requirement per node you need
#SBATCH --mail-user=you@youruniversity.ca # adjust this to match your email address
#SBATCH --mail-type=ALL

cd $SCRATCH 

# Set the path to download dir
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # Set the appropriate path to your downloaded data
DATA_DIR=$SCRATCH/alphafold/input     # Set the appropriate path to your supporting data
REPO_DIR=$SCRATCH/alphafold # Set the appropriate path to AlphaFold's cloned repo

# Load your modules as before
module load gcc openmpi cuda/11.1 cudacore/.11.1.1 cudnn/8.2.0 kalign hmmer hh-suite openmm python/3.7

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/my_env && source ${SLURM_TMPDIR}/my_env/bin/activate

# Install alphafold and dependencies
pip install --no-index six==1.15 numpy==1.19.2 scipy==1.4.1 pdbfixer alphafold

# Run your commands
python ${REPO_DIR}/run_alphafold.py --bfd_database_path ${DOWNLOAD_DIR}/bfd --data_dir ${DATA_DIR} \
   --fasta_paths ${DATA_DIR}/fasta1.fasta,${DATA_DIR}/fasta2.fasta,${DATA_DIR}/fasta3.fasta \
   --hhblits_binary_path ${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path ${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path ${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path ${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path ${DOWNLOAD_DIR}/mgnify --model_names model1,model2 \ # use the actual models you want
   --output_dir ~/scratch/alphafold_output --pdb70_database_path ${DOWNLOAD_DIR}/pdb70 \
   --template_mmcif_dir ${DATA_DIR}/Templates --uniclust30_database_path ${DOWNLOAD_DIR}/uniclust30 \
   --uniref90_database_path ${DOWNLOAD_DIR}/uniref90


Using singularity

In case you want to try the conternarized version (NOT our preferred option), you can build a singularity container:

[name@cluster ~]$ module load singularity
[name@cluster ~]$ singularity build alphafold.sif docker://uvarc/alphafold:2.0.0

Before trying to build it or run it, check our singularity documentation as there are particularities of each system that need to be taken into account.

Running AlphaFold within Singularity

Here is an example to run the containerized version of alphafold2 on the following protein sequence. The input sequence is saved in fasta format.

$ cat input.fasta
>5ZE6_1
MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA
FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG
TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP
WREALIGLAHIAVQRDR

The following databases and models were downloaded for the prediction.

$ ls
bfd  mgnify  params  pdb70  pdb_mmcif  uniclust30  uniref90

Let's say we want to run alphafold2 from the directory scratch/run_alphafold2

$ cd scratch/run_alphafold2
$ mkdir output_dir # create directory for the output files
$ ls # list the directory contents to ensure the singularity image file (.sif) is available
output_dir alphaFold.sif

Example job script to submit a batch job

#!/bin/bash
#SBATCH -J alphafold-1
#SBATCH --account=def-plotkin 
#SBATCH --time=08:00:00
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30G

#set the environment PATH
export PYTHONNOUSERSITE=True
ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params

#Run the command
singularity run --nv \
 -B $ALPHAFOLD_DATA_PATH:/data \
 -B $ALPHAFOLD_MODELS \
 -B .:/etc \
 --pwd  /app/alphafold alphaFold.sif \
 --fasta_paths=input.fasta  \
 --uniref90_database_path=/data/uniref90/uniref90.fasta  \
 --data_dir=/data \
 --mgnify_database_path=/data/mgnify/mgy_clusters_2018_12.fa   \
 --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
 --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 --pdb70_database_path=/data/pdb70/pdb70  \
 --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
 --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
 --max_template_date=2020-05-14   \
 --output_dir=alphafold_output  \
 --model_names='model_1' \
 --preset=casp14

Submit the jobscript using the sbatch command

$sbatch jobscript.sh 

On the successful completion, the output directory should have the following files:

$ ls alphafold_output/input/
  features.pkl  ranked_0.pdb        relaxed_model_1.pdb  timings.json
  msas          ranking_debug.json  result_model_1.pkl   unrelaxed_model_1.pdb