MetaPhlAn: Difference between revisions
(Rework of instruction on how to use metaphlan) |
No edit summary |
||
(10 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
<languages /> | |||
<translate> | |||
<!--T:1--> | |||
MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its [https://github.com/biobakery/MetaPhlAn GitHub repository]. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install recent versions using a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. | MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its [https://github.com/biobakery/MetaPhlAn GitHub repository]. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install recent versions using a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. | ||
<!--T:2--> | |||
For more information on how to use MetaPhlan, see their [https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4 wiki] | For more information on how to use MetaPhlan, see their [https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4 wiki] | ||
= Available wheels = | = Available wheels = <!--T:3--> | ||
You can list available wheels using the <tt>avail_wheels</tt> command: | You can list available wheels using the <tt>avail_wheels</tt> command: | ||
{{Command | {{Command | ||
Line 16: | Line 18: | ||
}} | }} | ||
= Downloading databases = | = Downloading databases = <!--T:4--> | ||
Note that MetaPhlAn requires a set of databases to be downloaded into the <tt>$SCRATCH</tt>. | Note that MetaPhlAn requires a set of databases to be downloaded into the <tt>$SCRATCH</tt>. | ||
<!--T:5--> | |||
'''Important:''' The database must live in the <tt>$SCRATCH</tt> | '''Important:''' The database must live in the <tt>$SCRATCH</tt> | ||
<!--T:6--> | |||
Databases can be downloaded from [http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases Segatalab FTP ]. | Databases can be downloaded from [http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases Segatalab FTP ]. | ||
<!--T:7--> | |||
1. From a login node, create the data folder: | 1. From a login node, create the data folder: | ||
{{Commands | {{Commands | ||
Line 30: | Line 35: | ||
}} | }} | ||
<!--T:8--> | |||
2. Download the data: | 2. Download the data: | ||
{{Command | {{Command | ||
|parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2 | |parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2 | ||
}} | }} | ||
Note that this step '''cannot''' be done from a compute node but must be done from a login node. | |||
<!--T:9--> | |||
3. Extract the downloaded data, for example using an interactive job: | |||
3. | |||
{{Command | {{Command | ||
|salloc --account{{=}}<your account> --cpus-per-task{{=}}2 --mem{{=}}10G | |salloc --account{{=}}<your account> --cpus-per-task{{=}}2 --mem{{=}}10G | ||
Line 48: | Line 53: | ||
}} | }} | ||
= Running MetaPhlAn = | = Running MetaPhlAn = <!--T:10--> | ||
Once the | Once the database files have been downloaded and extracted, you can submit a job. You may edit the following job submission script | ||
according to your needs: | |||
{{File | {{File | ||
|name=metaphlan-job.sh | |name=metaphlan-job.sh | ||
Line 57: | Line 62: | ||
#!/bin/bash | #!/bin/bash | ||
<!--T:11--> | |||
#SBATCH --account=def-someuser | #SBATCH --account=def-someuser | ||
#SBATCH --time=01:00:00 | #SBATCH --time=01:00:00 | ||
Line 62: | Line 68: | ||
#SBATCH --mem=15G # requires at least 15 GB of memory | #SBATCH --mem=15G # requires at least 15 GB of memory | ||
<!--T:12--> | |||
# Load the required modules | # Load the required modules | ||
module load gcc blast samtools bedtools bowtie2 python/3.10 | module load gcc blast samtools bedtools bowtie2 python/3.10 | ||
<!--T:13--> | |||
# Move to the scratch | # Move to the scratch | ||
cd $SCRATCH | cd $SCRATCH | ||
<!--T:14--> | |||
DB_DIR{{=}}$SCRATCH/metaphlan_databases | DB_DIR{{=}}$SCRATCH/metaphlan_databases | ||
<!--T:15--> | |||
# Generate your virtual environment in $SLURM_TMPDIR | # Generate your virtual environment in $SLURM_TMPDIR | ||
virtualenv --no-download ${SLURM_TMPDIR}/env | virtualenv --no-download ${SLURM_TMPDIR}/env | ||
source ${SLURM_TMPDIR}/env/bin/activate | source ${SLURM_TMPDIR}/env/bin/activate | ||
<!--T:16--> | |||
# Install metaphlan and its dependencies | # Install metaphlan and its dependencies | ||
pip install --no-index --upgrade pip | pip install --no-index --upgrade pip | ||
pip install --no-index metaphlan==X.Y.Z # EDIT: the required version here, e.g. 4.0.3 | pip install --no-index metaphlan==X.Y.Z # EDIT: the required version here, e.g. 4.0.3 | ||
<!--T:17--> | |||
# Reuse the number of core allocated to our job from `--cpus-per-task=4` | # Reuse the number of core allocated to our job from `--cpus-per-task=4` | ||
# It is important to use --index and --bowtie2db so that MetaPhlAn can run inside the job | # It is important to use --index and --bowtie2db so that MetaPhlAn can run inside the job | ||
metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt -nproc $SLURM_CPUS_PER_TASK --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $DB_DIR --bowtie2out metagenome.bowtie2.bz2 | metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt --nproc $SLURM_CPUS_PER_TASK --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $DB_DIR --bowtie2out metagenome.bowtie2.bz2 | ||
}} | |||
<!--T:18--> | |||
Then submit the job to the scheduler: | |||
{{Command | |||
|sbatch metaphlan-job.sh | |||
}} | }} | ||
</translate> |
Latest revision as of 16:28, 11 November 2022
MetaPhlAn is a "computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling", according to its GitHub repository. While the software stack on our clusters does contain modules for a couple of older versions (2.2.0 and 2.8) of this software, we now expect users to install recent versions using a Python virtual environment.
For more information on how to use MetaPhlan, see their wiki
Available wheels
You can list available wheels using the avail_wheels command:
[name@server ~]$ avail_wheels metaphlan --all-versions
name version python arch
--------- --------- -------- -------
MetaPhlAn 4.0.3 py3 generic
MetaPhlAn 3.0.7 py3 generic
Downloading databases
Note that MetaPhlAn requires a set of databases to be downloaded into the $SCRATCH.
Important: The database must live in the $SCRATCH
Databases can be downloaded from Segatalab FTP .
1. From a login node, create the data folder:
[name@server ~]$ export DB_DIR=$SCRATCH/metaphlan_databases
[name@server ~]$ mkdir -p $DB_DIR
[name@server ~]$ cd $DB_DIR
2. Download the data:
[name@server ~]$ parallel wget ::: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.tar http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_marker_info.txt.bz2 http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_species.txt.bz2
Note that this step cannot be done from a compute node but must be done from a login node.
3. Extract the downloaded data, for example using an interactive job:
[name@server ~]$ salloc --account=<your account> --cpus-per-task=2 --mem=10G
Untar and unzip the databases:
[name@server ~]$ tar -xf mpa_vJan21_CHOCOPhlAnSGB_202103.tar
[name@server ~]$ parallel bunzip2 ::: *.bz2
Running MetaPhlAn
Once the database files have been downloaded and extracted, you can submit a job. You may edit the following job submission script according to your needs:
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4 # Number of cores
#SBATCH --mem=15G # requires at least 15 GB of memory
# Load the required modules
module load gcc blast samtools bedtools bowtie2 python/3.10
# Move to the scratch
cd $SCRATCH
DB_DIR=$SCRATCH/metaphlan_databases
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate
# Install metaphlan and its dependencies
pip install --no-index --upgrade pip
pip install --no-index metaphlan==X.Y.Z # EDIT: the required version here, e.g. 4.0.3
# Reuse the number of core allocated to our job from `--cpus-per-task=4`
# It is important to use --index and --bowtie2db so that MetaPhlAn can run inside the job
metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt --nproc $SLURM_CPUS_PER_TASK --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db $DB_DIR --bowtie2out metagenome.bowtie2.bz2
Then submit the job to the scheduler:
[name@server ~]$ sbatch metaphlan-job.sh