MetaPhlAn: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
 
(11 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Draft}}
<languages />
<translate>
<!--T:1-->
MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its [https://github.com/biobakery/MetaPhlAn GitHub repository]. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install recent versions using a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]].


MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its [https://github.com/biobakery/MetaPhlAn GitHub repository]. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install more recent versions using a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. Wheels are available in our wheelhouse for these more recent versions of the MetaPhlAn software: 3.0.0a1, 3.0.7 and 4.0.2. You should begin by loading certain modules needed by the Python wheel,
<!--T:2-->
{{Command|module load gcc blast samtools bedtools bowtie2 python/3.9}}
For more information on how to use MetaPhlan, see their [https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4 wiki]
after which you can create the virtual environment
{{Command|virtualenv --no-download --clear $HOME/ENV}}
You should then enter the virtual environment,
{{Command|source $HOME/ENV/bin/activate}}
update pip if necessary,
{{Command|prompt=(ENV) [name@server ~]|pip install --no-index --upgrade pip}}
and finally install the wheel,
{{Command|prompt=(ENV) [name@server ~]|pip install metaphlan}}


==Initialization==
= Available wheels = <!--T:3-->
In order to be used correctly, MetaPhlAn needs to download certain databases from a remote server and then compute indices derived from the components of these databases. On those clusters which do not permit Internet access from the compute nodes, these databases will have to be downloaded using a login node using a tool such as wget,
You can list available wheels using the <tt>avail_wheels</tt> command:
{{Command|prompt=(ENV) [name@server ~]|parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2}}
{{Command
You can then untar the database file and compute the indices using a job, so as not to put an undue computational burden on the shared login node which you are using. A sample script is the following,
|avail_wheels metaphlan --all-versions
|result=
name      version    python    arch
---------  ---------  --------  -------
MetaPhlAn  4.0.3      py3      generic
MetaPhlAn  3.0.7      py3      generic
}}
 
= Downloading databases = <!--T:4-->
Note that MetaPhlAn requires a set of databases to be downloaded into the <tt>$SCRATCH</tt>.
 
<!--T:5-->
'''Important:''' The database must live in the <tt>$SCRATCH</tt>
 
<!--T:6-->
Databases can be downloaded from [http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases Segatalab FTP ].
 
<!--T:7-->
1. From a login node, create the data folder:
{{Commands
|export DB_DIR{{=}}$SCRATCH/metaphlan_databases
|mkdir -p $DB_DIR
|cd $DB_DIR
}}
 
<!--T:8-->
2. Download the data:
{{Command
|parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2
}}
Note that this step '''cannot''' be done from a compute node but must be done from a login node.
 
<!--T:9-->
3. Extract the downloaded data, for example using an interactive job:
{{Command
|salloc --account{{=}}<your account> --cpus-per-task{{=}}2 --mem{{=}}10G
}}
Untar and unzip the databases:
{{Commands
| tar -xf mpa_vJan21_CHOCOPhlAnSGB_202103.tar
| parallel bunzip2 ::: *.bz2
}}
 
= Running MetaPhlAn = <!--T:10-->
Once the database files have been downloaded and extracted, you can submit a job. You may edit the following job submission script
according to your needs:
{{File
{{File
   |name=job.sh
   |name=metaphlan-job.sh
   |lang="sh"
   |lang="sh"
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
<!--T:11-->
#SBATCH --account=def-someuser
#SBATCH --account=def-someuser
#SBATCH --time=01:00:00
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --cpus-per-task=4       # Number of cores
#SBATCH --mem=10G
#SBATCH --mem=15G                # requires at least 15 GB of memory
 
<!--T:12-->
# Load the required modules
module load gcc blast samtools bedtools bowtie2 python/3.10
 
<!--T:13-->
# Move to the scratch
cd $SCRATCH
 
<!--T:14-->
DB_DIR{{=}}$SCRATCH/metaphlan_databases
 
<!--T:15-->
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate


module load gcc blast samtools bedtools bowtie2 python/3.9
<!--T:16-->
cd $HOME
# Install metaphlan and its dependencies
pbunzip2 -p4 mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2
pip install --no-index --upgrade pip
source ENV/bin/activate
pip install --no-index metaphlan==X.Y.Z  # EDIT: the required version here, e.g. 4.0.3
metaphlan -nproc 4 --install --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $PWD
 
<!--T:17-->
# Reuse the number of core allocated to our job from `--cpus-per-task=4`
# It is important to use --index and --bowtie2db so that MetaPhlAn can run inside the job
metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt --nproc $SLURM_CPUS_PER_TASK --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $DB_DIR --bowtie2out metagenome.bowtie2.bz2
}}
}}
<!--T:18-->
Then submit the job to the scheduler:
{{Command
|sbatch metaphlan-job.sh
}}
</translate>

Latest revision as of 16:28, 11 November 2022

Other languages:

MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its GitHub repository. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install recent versions using a Python virtual environment.

For more information on how to use MetaPhlan, see their wiki

Available wheels

You can list available wheels using the avail_wheels command:

Question.png
[name@server ~]$ avail_wheels metaphlan --all-versions
name       version    python    arch
---------  ---------  --------  -------
MetaPhlAn  4.0.3      py3       generic
MetaPhlAn  3.0.7      py3       generic

Downloading databases

Note that MetaPhlAn requires a set of databases to be downloaded into the $SCRATCH.

Important: The database must live in the $SCRATCH

Databases can be downloaded from Segatalab FTP .

1. From a login node, create the data folder:

[name@server ~]$ export DB_DIR=$SCRATCH/metaphlan_databases
[name@server ~]$ mkdir -p $DB_DIR
[name@server ~]$ cd $DB_DIR


2. Download the data:

Question.png
[name@server ~]$ parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2

Note that this step cannot be done from a compute node but must be done from a login node.

3. Extract the downloaded data, for example using an interactive job:

Question.png
[name@server ~]$ salloc --account=<your account> --cpus-per-task=2 --mem=10G

Untar and unzip the databases:

[name@server ~]$ tar -xf mpa_vJan21_CHOCOPhlAnSGB_202103.tar
[name@server ~]$ parallel bunzip2 ::: *.bz2


Running MetaPhlAn

Once the database files have been downloaded and extracted, you can submit a job. You may edit the following job submission script according to your needs:

File : metaphlan-job.sh

#!/bin/bash

#SBATCH --account=def-someuser
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4        # Number of cores
#SBATCH --mem=15G                # requires at least 15 GB of memory

# Load the required modules
module load gcc blast samtools bedtools bowtie2 python/3.10

# Move to the scratch
cd $SCRATCH

DB_DIR=$SCRATCH/metaphlan_databases

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install metaphlan and its dependencies
pip install --no-index --upgrade pip
pip install --no-index metaphlan==X.Y.Z  # EDIT: the required version here, e.g. 4.0.3

# Reuse the number of core allocated to our job from `--cpus-per-task=4`
# It is important to use --index and --bowtie2db so that MetaPhlAn can run inside the job
metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt --nproc $SLURM_CPUS_PER_TASK --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $DB_DIR --bowtie2out metagenome.bowtie2.bz2


Then submit the job to the scheduler:

Question.png
[name@server ~]$ sbatch metaphlan-job.sh