AlphaFold: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
(Remove Singularity instructions since they are incomplete and downloading the DB does not work out of the box)
Line 12: Line 12:
Any publication that discloses findings arising from using this source code or the model parameters should [https://github.com/deepmind/alphafold#citing-this-work cite] the [https://doi.org/10.1038/s41586-021-03819-2 AlphaFold paper].
Any publication that discloses findings arising from using this source code or the model parameters should [https://github.com/deepmind/alphafold#citing-this-work cite] the [https://doi.org/10.1038/s41586-021-03819-2 AlphaFold paper].


== Using Python wheels == <!--T:4-->
== Available versions == <!--T:5-->
 
AlphaFold is available on our clusters as pre-built Python packages (wheels). You can list available versions with <code>avail_wheels</code>.
=== Available wheels === <!--T:5-->
You can list available wheels using the <code>avail_wheels</code> command.
{{Command
{{Command
|avail_wheels alphafold --all-versions
|avail_wheels alphafold --all-versions
Line 29: Line 27:
}}
}}


=== Installing AlphaFold in a Python virtual environment === <!--T:6-->
== Installing AlphaFold in a Python virtual environment == <!--T:6-->


<!--T:7-->
<!--T:7-->
Line 69: Line 67:
}}
}}


=== Databases === <!--T:11-->
== Databases == <!--T:11-->
Note that AlphaFold requires a set of datasets/databases to be downloaded into the <code>$SCRATCH</code>.
Note that AlphaFold requires a set of datasets/databases to be downloaded into the <code>$SCRATCH</code>.


Line 135: Line 133:
}}
}}


=== Running AlphaFold === <!--T:18-->
== Running AlphaFold == <!--T:18-->
{{Warning
{{Warning
|title=Performance
|title=Performance
Line 268: Line 266:
|prompt=(alphafold_env) [name@server ~]
|prompt=(alphafold_env) [name@server ~]
|sbatch --job-name alphafold-X alphafold-gpu.sh
|sbatch --job-name alphafold-X alphafold-gpu.sh
}}
== Using Singularity == <!--T:34-->
AlphaFold documentation explains how to run the software using Docker. We do not provide Docker, but [[Singularity]] instead. It is recommended to use a virtual environment and a Python wheel available from our "wheelhouse".
<!--T:35-->
First, read our [[Singularity]] documentation as there are particularities for each cluster that must be taken into account. Then, [[Singularity#Creating_images_on_Compute_Canada_clusters| build a Singularity container]].
{{Commands
|cd $SCRATCH
|module load singularity
|singularity build alphafold.sif docker://uvarc/alphafold:2.2.0
}}
=== Running AlphaFold within Singularity === <!--T:36-->
{{Warning
|title=Performance
|content=You can request at most 8 CPU cores when running AlphaFold because it is hardcoded to not use more and does not benefit from using more.
}}
Create a directory <code>alphafold_output</code> to hold the output files.
{{Command
|mkdir $SCRATCH/alphafold_output
}}
<!--T:37-->
Then, edit the job submission script.
{{File
|name=alphafold-singularity.sh
|lang="bash"
|contents=
#!/bin/bash
<!--T:38-->
#SBATCH --job-name alphafold-run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00          # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8        # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                # adjust this according to the memory you need
<!--T:39-->
module load singularity
<!--T:40-->
export PYTHONNOUSERSITE=True
<!--T:41-->
ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params
<!--T:42-->
# v2.3.0 `uniclust30_database_path` argument was renamed to `uniref30_database_path`.
# Run the command
singularity run --nv \
    -B $ALPHAFOLD_DATA_PATH:/data \
    -B $ALPHAFOLD_MODELS \
    -B .:/etc \
    --pwd  /app/alphafold alphaFold.sif \
    --fasta_paths=/path/to/input.fasta  \
    --uniref90_database_path=/data/uniref90/uniref90.fasta  \
    --data_dir=/data \
    --mgnify_database_path=/data/mgnify/mgy_clusters.fa  \
    --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniref30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --pdb70_database_path=/data/pdb70/pdb70  \
    --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
    --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
    --max_template_date=2020-05-14  \
    --output_dir=alphafold_output  \
    --model_names='model_1' \
    --preset=casp14 \
    --use_gpu_relax=True
}}
AlphaFold launches multithreaded analysis using up to 8 CPUs before running model inference on the GPU.
Memory requirements will vary with different size proteins.
<!--T:43-->
Bind-mount the current working directory to <code>/etc</code> inside the container for cache file ld.so.cache [-B .:/etc]. The <code>--nv</code> flag is used to enable GPU support.
Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.
{{Command
|sbatch alpharun_jobscript.sh
}}
<!--T:44-->
On successful completion, the output directory should have the following files:
{{Command
|tree alphafold_output/input
|result=
alphafold_output
└── input
    ├── features.pkl
    ├── msas
    │   ├── bfd_uniclust_hits.a3m
    │   ├── mgnify_hits.sto
    │   └── uniref90_hits.sto
    ├── ranked_0.pdb
    ├── ranking_debug.json
    ├── relaxed_model_1.pdb
    ├── result_model_1.pkl
    ├── timings.json
    └── unrelaxed_model_1.pdb
2 directories, 10 files
}}
}}
</translate>
</translate>

Revision as of 20:27, 6 March 2023

Other languages:

AlphaFold is a machine-learning model for the prediction of protein folding.

This page discusses how to use AlphaFold v2.0, the version that was entered in CASP14 and published in Nature.

Source code and documentation for AlphaFold can be found at their GitHub page. Any publication that discloses findings arising from using this source code or the model parameters should cite the AlphaFold paper.

Available versions

AlphaFold is available on our clusters as pre-built Python packages (wheels). You can list available versions with avail_wheels.

Question.png
[name@server ~]$ avail_wheels alphafold --all-versions
name       version    python    arch
---------  ---------  --------  -------
alphafold  2.2.4      py3       generic
alphafold  2.2.3      py3       generic
alphafold  2.2.2      py3       generic
alphafold  2.2.1      py3       generic
alphafold  2.1.1      py3       generic
alphafold  2.0.0      py3       generic

Installing AlphaFold in a Python virtual environment

1. Load AlphaFold dependencies.

Question.png
[name@server ~]$ module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

As of July 2022, only Python 3.7 and 3.8 are supported.


2. Create and activate a Python virtual environment.

[name@server ~]$ virtualenv --no-download ~/alphafold_env
[name@server ~]$ source ~/alphafold_env/bin/activate


3. Install a specific version of AlphaFold and its Python dependencies.

(alphafold_env) [name@server ~] pip install --no-index --upgrade pip
(alphafold_env) [name@server ~] pip install --no-index alphafold==X.Y.Z

where X.Y.Z is the exact desired version, for instance 2.2.4. You can omit to specify the version in order to install the latest one available from the wheelhouse.

4. Validate it.

Question.png
(alphafold_env) [name@server ~] run_alphafold.py --help

5. Freeze the environment and requirements set.

Question.png
(alphafold_env) [name@server ~] pip freeze > ~/alphafold-requirements.txt

Databases

Note that AlphaFold requires a set of datasets/databases to be downloaded into the $SCRATCH.

Important: The database must live in the $SCRATCH.

1. From a login node, create the data folder.

(alphafold_env) [name@server ~] export DOWNLOAD_DIR=$SCRATCH/alphafold/data
(alphafold_env) [name@server ~] mkdir -p $DOWNLOAD_DIR


2. With your virtual environment activated, you can download the data.

Question.png
(alphafold_env) [name@server ~] download_all_data.sh $DOWNLOAD_DIR

Note that this step cannot be done from a compute node but rather from a login node. Since the download might take a while, we suggest starting the download in a screen or Tmux session.

1. Set DOWNLOAD_DIR.

Question.png
(alphafold_env) [name@server ~] export DOWNLOAD_DIR=/datashare/alphafold

Afterwards, the structure of your data should be similar to

Question.png
(alphafold_env) [name@server ~] tree -d $DOWNLOAD_DIR
$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
        mgy_clusters.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
        # LICENSE,
        # = 11 files.
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 206 GB (download: 46 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
    uniclust30/                            # ~ 87 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
    uniref90/                              # ~ 59 GB (download: 29.7 GB)
        uniref90.fasta

Running AlphaFold

Performance

You can request at most 8 CPU cores when running AlphaFold because it is hardcoded to not use more and does not benefit from using more.



Edit the following submission script according to your needs.

File : alphafold-cpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your supporting data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install alphafold and its dependencies
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt

# Edit with the proper arguments, run your commands
# v2.3.0 `uniclust30_database_path` argument was renamed to `uniref30_database_path`.
# run_alphafold.py --help
run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --uniref30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${OUTPUT_DIR} \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --max_template_date=2020-05-14 \
   --model_preset=monomer_casp14 \
   --use_gpu_relax=False


File : alphafold-gpu.sh

#!/bin/bash

#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=08:00:00           # adjust this to match the walltime of your job
#SBATCH --gres=gpu:1              # a GPU helps to accelerate the inference part only
#SBATCH --cpus-per-task=8         # a MAXIMUM of 8 core, Alpafold has no benefit to use more
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies
module load gcc/9.3.0 openmpi/4.0.3 cuda/11.4 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

DOWNLOAD_DIR=$SCRATCH/alphafold/data   # set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # set the appropriate path to your supporting data
OUTPUT_DIR=${SCRATCH}/alphafold/output # set the appropriate path to your supporting data

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install alphafold and its dependencies
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/alphafold-requirements.txt

# Edit with the proper arguments, run your commands
# v2.3.0 `uniclust30_database_path` argument was renamed to `uniref30_database_path`.
# run_alphafold.py --help
run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --uniref30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2018_12.fa \
   --output_dir=${OUTPUT_DIR} \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --max_template_date=2020-05-14 \
   --model_preset=monomer_casp14 \
   --use_gpu_relax=True


Then, submit the job to the scheduler.

Question.png
(alphafold_env) [name@server ~] sbatch --job-name alphafold-X alphafold-gpu.sh