MetaPhlAn: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
(Rework of instruction on how to use metaphlan)
Line 1: Line 1:
{{Draft}}
{{Draft}}


MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its [https://github.com/biobakery/MetaPhlAn GitHub repository]. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install more recent versions using a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. Wheels are available in our wheelhouse for these more recent versions of the MetaPhlAn software: 3.0.0a1, 3.0.7 and 4.0.2. You should begin by loading certain modules needed by the Python wheel,
MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its [https://github.com/biobakery/MetaPhlAn GitHub repository]. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install recent versions using a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]].
{{Command|module load gcc blast samtools bedtools bowtie2 python/3.9}}
after which you can create the virtual environment
{{Command|virtualenv --no-download --clear $HOME/ENV}}
You should then enter the virtual environment,
{{Command|source $HOME/ENV/bin/activate}}
update pip if necessary,
{{Command|prompt=(ENV) [name@server ~]|pip install --no-index --upgrade pip}}
and finally install the wheel,
{{Command|prompt=(ENV) [name@server ~]|pip install metaphlan}}


==Initialization==
For more information on how to use MetaPhlan, see their [https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4 wiki]
In order to be used correctly, MetaPhlAn needs to download certain databases from a remote server and then compute indices derived from the components of these databases. On those clusters which do not permit Internet access from the compute nodes, these databases will have to be downloaded using a login node using a tool such as wget,
 
{{Command|prompt=(ENV) [name@server ~]|parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2}}
= Available wheels =
You can then untar the database file and compute the indices using a job, so as not to put an undue computational burden on the shared login node which you are using. A sample script is the following,
You can list available wheels using the <tt>avail_wheels</tt> command:
{{Command
|avail_wheels metaphlan --all-versions
|result=
name      version    python    arch
---------  ---------  --------  -------
MetaPhlAn  4.0.3      py3      generic
MetaPhlAn  3.0.7      py3      generic
}}
 
= Downloading databases =
Note that MetaPhlAn requires a set of databases to be downloaded into the <tt>$SCRATCH</tt>.
 
'''Important:''' The database must live in the <tt>$SCRATCH</tt>
 
Databases can be downloaded from [http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases Segatalab FTP ].
 
1. From a login node, create the data folder:
{{Commands
|export DB_DIR{{=}}$SCRATCH/metaphlan_databases
|mkdir -p $DB_DIR
|cd $DB_DIR
}}
 
2. Download the data:
{{Command
|parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2
}}
 
Note that this step '''cannot''' be done from compute nodes but rather from a login node.
 
3. Unpack the data:
From an interactive job:
{{Command
|salloc --account{{=}}<your account> --cpus-per-task{{=}}2 --mem{{=}}10G
}}
Untar and unzip the databases:
{{Commands
| tar -xf mpa_vJan21_CHOCOPhlAnSGB_202103.tar
| parallel bunzip2 ::: *.bz2
}}
 
= Running MetaPhlAn =
Once the databases are downloaded and unpacked, you can submit a job.
Edit to your needs the following submission script:
{{File
{{File
   |name=job.sh
   |name=metaphlan-job.sh
   |lang="sh"
   |lang="sh"
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --account=def-someuser
#SBATCH --time=01:00:00
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --cpus-per-task=4       # Number of cores
#SBATCH --mem=10G
#SBATCH --mem=15G                # requires at least 15 GB of memory
 
# Load the required modules
module load gcc blast samtools bedtools bowtie2 python/3.10
 
# Move to the scratch
cd $SCRATCH
 
DB_DIR{{=}}$SCRATCH/metaphlan_databases
 
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate
 
# Install metaphlan and its dependencies
pip install --no-index --upgrade pip
pip install --no-index metaphlan==X.Y.Z  # EDIT: the required version here, e.g. 4.0.3


module load gcc blast samtools bedtools bowtie2 python/3.9
# Reuse the number of core allocated to our job from `--cpus-per-task=4`
cd $HOME
# It is important to use --index and --bowtie2db so that MetaPhlAn can run inside the job
pbunzip2 -p4 mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2
metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt -nproc $SLURM_CPUS_PER_TASK --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $DB_DIR --bowtie2out metagenome.bowtie2.bz2
source ENV/bin/activate
metaphlan -nproc 4 --install --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $PWD
}}
}}

Revision as of 17:30, 9 November 2022


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.




MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its GitHub repository. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install recent versions using a Python virtual environment.

For more information on how to use MetaPhlan, see their wiki

Available wheels

You can list available wheels using the avail_wheels command:

Question.png
[name@server ~]$ avail_wheels metaphlan --all-versions
name       version    python    arch
---------  ---------  --------  -------
MetaPhlAn  4.0.3      py3       generic
MetaPhlAn  3.0.7      py3       generic

Downloading databases

Note that MetaPhlAn requires a set of databases to be downloaded into the $SCRATCH.

Important: The database must live in the $SCRATCH

Databases can be downloaded from Segatalab FTP .

1. From a login node, create the data folder:

[name@server ~]$ export DB_DIR=$SCRATCH/metaphlan_databases
[name@server ~]$ mkdir -p $DB_DIR
[name@server ~]$ cd $DB_DIR


2. Download the data:

Question.png
[name@server ~]$ parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2

Note that this step cannot be done from compute nodes but rather from a login node.

3. Unpack the data: From an interactive job:

Question.png
[name@server ~]$ salloc --account=<your account> --cpus-per-task=2 --mem=10G

Untar and unzip the databases:

[name@server ~]$ tar -xf mpa_vJan21_CHOCOPhlAnSGB_202103.tar
[name@server ~]$ parallel bunzip2 ::: *.bz2


Running MetaPhlAn

Once the databases are downloaded and unpacked, you can submit a job. Edit to your needs the following submission script:

File : metaphlan-job.sh

#!/bin/bash

#SBATCH --account=def-someuser
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4        # Number of cores
#SBATCH --mem=15G                # requires at least 15 GB of memory

# Load the required modules
module load gcc blast samtools bedtools bowtie2 python/3.10

# Move to the scratch
cd $SCRATCH

DB_DIR=$SCRATCH/metaphlan_databases

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install metaphlan and its dependencies
pip install --no-index --upgrade pip
pip install --no-index metaphlan==X.Y.Z  # EDIT: the required version here, e.g. 4.0.3

# Reuse the number of core allocated to our job from `--cpus-per-task=4`
# It is important to use --index and --bowtie2db so that MetaPhlAn can run inside the job
metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt -nproc $SLURM_CPUS_PER_TASK --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $DB_DIR --bowtie2out metagenome.bowtie2.bz2