BUSCO: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Rewrite of BUSCO to use python package and slurm tips.)
mNo edit summary
 
(45 intermediate revisions by 5 users not shown)
Line 5: Line 5:


<!--T:1-->
<!--T:1-->
BUSCO stands for "Benchmarking sets of Universal Single-Copy Orthologs". 
BUSCO (<i>Benchmarking sets of Universal Single-Copy Orthologs</i>) is an application for assessing genome assembly and annotation completeness.
It is an application for assessing genome assembly and annotation completeness.


<!--T:2-->
For more information, see the [https://busco.ezlab.org/busco_userguide.html user manual].
For more information, see the [https://busco.ezlab.org/busco_userguide.html user manual].


== Available versions == <!--T:10-->
== Available versions == <!--T:3-->
Recent versions are available as wheels. Older version is available as a module, please see the module section below.
Recent versions are available as wheels. Older versions are available as a module; please see the [[#Modules|Modules]] section below.


To see the latest available version, run:
<!--T:28-->
To see the latest available version, run
{{Command|avail_wheel busco}}
{{Command|avail_wheel busco}}
== Python Wheel ==
 
== Python wheel == <!--T:4-->
=== Installation ===
=== Installation ===
'''1.''' Load the necessary modules:
<b>1.</b> Load the necessary modules.
{{Command|module load StdEnv/2020 gcc python augustus hmmer blast+ metaeuk prodigal r}}
{{Commands
|module load StdEnv/2020 gcc/9.3.0 python/3.10 augustus/3.5.0 hmmer/3.3.2 blast+/2.13.0 metaeuk/6 prodigal/2.6.3 r/4.3.1 bbmap/38.86}}


'''2.''' Create the virtual environment:
<!--T:5-->
<b>2.</b> Create the virtual environment.
{{Commands
{{Commands
|virtualenv busco_env
|virtualenv ~/busco_env
|source busco_env/bin/activate
|source ~/busco_env/bin/activate
}}
 
<!--T:6-->
<b>3.</b> Install the wheel and its dependencies.
{{Command
|prompt=(busco_env) $
|pip install --no-index biopython{{=}}{{=}}1.81 pandas{{=}}{{=}}2.1.0 busco{{=}}{{=}}5.5.0
}}
 
<!--T:7-->
<b>4.</b> Validate the installation.
{{Command
|prompt=(busco_env) $
|busco --help
}}
}}


'''3.''' Install the wheel and its dependencies:
<!--T:22-->
<b>5.</b>  Freeze the environment and requirements set. To use the requirements text file, see the <i>bash</i> submission script shown at point 8.
{{Command
{{Command
|prompt=(busco_env) $
|prompt=(busco_env) $
|pip install biopython pandas busco --no-index
|pip freeze > ~/busco-requirements.txt
}}
}}


=== Usage ===
=== Usage === <!--T:23-->
==== Datasets ====
==== Datasets ====
You must pre-download any datasets from [https://busco-data.ezlab.org/v5/data/ busco data] before submitting your job.
<b>6.</b> You must pre-download any datasets from [https://busco-data.ezlab.org/v5/data/ BUSCO data] before submitting your job.
 
<!--T:29-->
You can access the available datasets in your terminal by typing <code>busco --list-datasets</code>.
 
<!--T:30-->
You have <b>two</b> options to download datasets:<br>
*use the <code>busco</code> command,
*use the <code>wget</code> command.


==== Test ====
===== <b>6.1</b>  Using the <code>busco</code> command ===== <!--T:31-->
'''4.''' Download test data:
This is the preferred option. Type this command in your working directory to download a particular dataset, for example
{{Command
|busco --download bacteria_odb10
}}
 
<!--T:34-->
It is also possible to do a bulk download by replacing the dataset name by the following arguments: <code>all</code>, <code>prokaryota</code>, <code>eukaryota</code>, or <code>virus</code>, for example
 
<!--T:35-->
{{Commands
|busco --download virus
}}
This will
::1. create a BUSCO directory hierarchy for the datasets,
::2. download the appropriate datasets,
::3. decompress the file(s),
::4. if you download multiple files, they will all be automatically added to the lineages directory.
 
<!--T:39-->
The hierarchy will look like this:
<blockquote>
* busco_downloads/
 
<!--T:40-->
::* information/
 
<!--T:41-->
::::lineages_list.2021-12-14.txt
 
<!--T:42-->
::* lineages/
 
<!--T:43-->
::::bacteria_odb10
 
<!--T:44-->
::::actinobacteria_class_odb10
 
<!--T:45-->
::::actinobacteria_phylum_odb10
 
<!--T:46-->
::* placement_files/
 
<!--T:47-->
::::list_of_reference_markers.archaea_odb10.2019-12-16.txt
</blockquote>
 
<!--T:48-->
Doing so, all your lineage files should be in <b>busco_downloads/lineages/</b>. When referring to <code>--download_path busco_downloads/</code> in the BUSCO command line, it will know where to find the lineage dataset argument <code>--lineage_dataset bacteria_odb10</code>. If the <i>busco_download </i> directory is not in your working directory, you will need to provide the full path.
 
=====<b>6.2</b> Using the <code>wget</code> command ===== <!--T:49-->
 
<!--T:50-->
All files must be decompressed with <code>tar -xvf file.tar.gz</code>.
{{Commands
|mkdir -p busco_downloads/lineages
|cd busco_downloads/lineages
|wget https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
|tar -xvf bacteria_odb10.2020-03-06.tar.gz
}}
 
==== Test ==== <!--T:8-->
<b>7.</b> Download a genome file.
 
<!--T:51-->
{{Commands
{{Commands
|wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna
|wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna
|wget https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
}}
}}


'''5.''' Run:
<!--T:9-->
{{Command|busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK-1} }}
<b>8.</b> Run.
 
<!--T:52-->
Command to run a single genome:
 
<!--T:53-->
{{Command|busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/}}
 
<!--T:54-->
Command to run multiple genomes that would be saved in the genome directory (in this example, the <i>genome/</i> folder would need to be in the current directory; otherwise, you need to provide the full path):
 
<!--T:55-->
{{Command|busco --offline --in genome/ --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/}}
 
<!--T:24-->
The single genome command should take less than 60 seconds to complete. Production runs which take longer must be submitted to the [[Running jobs|scheduler]].
 
===== BUSCO tips ===== <!--T:10-->
 
<!--T:56-->
Specify <code>--in genome.fna</code> for single file analysis.
 
<!--T:57-->
Specify <code>--in genome/</code> for multiple files analysis.
 
===== Slurm tips ===== <!--T:11-->
Specify <code>--offline</code> to avoid using the internet.
 
<!--T:12-->
Specify <code>--cpu</code> to <code>$SLURM_CPUS_PER_TASK</code> in your job submission script to use the number of CPUs allocated.
 
<!--T:13-->
Specify <code>--restart</code> to restart from a partial run.
 
====Job submission==== <!--T:58-->
 
<!--T:59-->
Here you have an example of a submission script. You can submit as so: <code>sbatch run_busco.sh</code>.
 
</translate>
{{File
  |name=run_busco.sh
  |lang="bash"
  |contents=
 
 
#!/bin/bash
 
#SBATCH --job-name=busco9_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=01:00:00          # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8        # adjust depending on the size of the genome(s)/protein(s)/transcriptome(s)
#SBATCH --mem=20G                # adjust this according to the memory you need
 
# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 python/3.10 augustus/3.5.0 hmmer/3.3.2 blast+/2.13.0 metaeuk/6 prodigal/2.6.3 r/4.3.1 bbmap/38.86
 
# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate
 
# Install busco and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/busco-requirements.txt
 
# Edit with the proper arguments, run your commands.
busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/
 
}}
<translate>
 
====Augustus parameters==== <!--T:60-->
<b>9.</b> Advanced users may want to use Augustus parameters: <code>--augustus_parameters="--yourAugustusParameter"</code>.
 
<!--T:61-->
*Copy the Augustus <i>config</i> directory to a writable location.
{{Command|cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config}}
 
<!--T:62-->
*Make sure to define the <code>AUGUSTUS_CONFIG_PATH</code> environment variable.
{{Command|export AUGUSTUS_CONFIG_PATH{{=}}$HOME/augustus_config}}
 
====SEPP parameters==== <!--T:63-->
<b>10.</b> To use SEPP parameters, you need to install SEPP locally in your virtual environment. This should be done from the login node.
 
<!--T:64-->
<b>10.1.</b> Activate your BUSCO virtual environment.
{{Commands
|source busco_env/bin/activate
}}
 
<!--T:65-->
<b>10.2.</b> Install DendroPy.
{{Commands
|pip install 'dendropy<4.6'
}}


The command should take less than 60 seconds to complete. Production runs which take longer must be submitted to the [[Running jobs|scheduler]].
<!--T:66-->
<b>10.3.</b> Install SEPP.
{{Commands
|git clone https://github.com/smirarab/sepp.git
|cd sepp
|python setup.py config
|python setup.py install
}}


=== Slurm tips ===
<!--T:67-->
Specify <tt>--offline</tt> to avoid using internet.
<b>10.4.</b> Validate the installation.
{{Commands
|cd
|run_sepp.py -h
}}


Specify <tt>--cpu</tt> to <tt>$SLURM_CPUS_PER_TASK</tt> in your job submission script to use the number of cpus allocated.
<!--T:68-->
<b>10.5.</b> Because SEPP is installed locally, you cannot create the virtual environment as described in the previous submission script. To activate your local virtual environment, simply add the following command immediately under the line to load the module:
{{Commands
|source ~/busco_env/bin/activate
}}


Specify <tt>--restart</tt> to restart from a partial run.
== Modules == <!--T:14-->  


== Module == <!--T:2-->
<!--T:69-->
{{Warning
|title=Deprecation
|content=This section is outdated and deprecated. You should use the wheels available.
}}


<!--T:15-->
<!--T:15-->
'''1.''' Load the necessary modules:
<b>1.</b> Load the necessary modules.
{{Command|module load StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.4 busco/3.0.2 r/4.0.2}}
{{Command|module load StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.4 busco/3.0.2 r/4.0.2}}
This will also load modules for <code>augustus, blast+, hmmer</code> and some other
This will also load modules for Augustus, BLAST+, HMMER and some other
software packages that BUSCO relies upon.
software packages that BUSCO relies upon.


<!--T:3-->
<!--T:16-->
'''2.''' Copy the configuration file:
<b>2.</b> Copy the configuration file.
{{Command|cp -v $EBROOTBUSCO/config/config.ini.default $HOME/busco_config.ini}}
{{Command|cp -v $EBROOTBUSCO/config/config.ini.default $HOME/busco_config.ini}}
or
or
{{Command|wget -O $HOME/busco_config.ini https://gitlab.com/ezlab/busco/raw/master/config/config.ini.default}}
{{Command|wget -O $HOME/busco_config.ini https://gitlab.com/ezlab/busco/raw/master/config/config.ini.default}}


<!--T:4-->
<!--T:17-->
'''3.''' Edit the configuration file. The locations of external tools are all specified in the last section, which is shown below:
<b>3.</b> Edit the configuration file. The locations of external tools are all specified in the last section, which is shown below:
</translate>
</translate>
{{File
{{File
Line 110: Line 315:
<translate>
<translate>


<!--T:8-->
<!--T:18-->
'''4.''' Copy the Augustus config directory to a writable location:
<b>4.</b> Copy the Augustus <code>config</code> directory to a writable location.
{{Command|cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config}}
{{Command|cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config}}


<!--T:5-->
<!--T:19-->
'''5.''' Check that it runs.
<b>5.</b> Check that it runs.


<!--T:7-->
<!--T:20-->
{{Commands
{{Commands
|export BUSCO_CONFIG_FILE{{=}}$HOME/busco_config.ini
|export BUSCO_CONFIG_FILE{{=}}$HOME/busco_config.ini
Line 124: Line 329:
}}
}}


<!--T:16-->
<!--T:21-->
The <code>run_BUSCO.py</code> command should take less than 60 seconds to complete.
The <code>run_BUSCO.py</code> command should take less than 60 seconds to complete.
Production runs which take longer should be submitted to the [[Running jobs|scheduler]].
Production runs which take longer should be submitted to the [[Running jobs|scheduler]].


= Troubleshooting = <!--T:9-->
= Troubleshooting = <!--T:25-->
== Cannot write to Augustus config path ==
== Cannot write to Augustus config path ==
Make sure you have copied the config directory to a writable location and exported the <tt>AUGUSTUS_CONFIG_PATH</tt> variable.
Make sure you have copied the <i>config</i> directory to a writable location and exported the <code>AUGUSTUS_CONFIG_PATH</code> variable.
</translate>
</translate>

Latest revision as of 23:06, 1 May 2024

Other languages:


BUSCO (Benchmarking sets of Universal Single-Copy Orthologs) is an application for assessing genome assembly and annotation completeness.

For more information, see the user manual.

Available versions

Recent versions are available as wheels. Older versions are available as a module; please see the Modules section below.

To see the latest available version, run

Question.png
[name@server ~]$ avail_wheel busco

Python wheel

Installation

1. Load the necessary modules.

[name@server ~]$ module load StdEnv/2020 gcc/9.3.0 python/3.10 augustus/3.5.0 hmmer/3.3.2 blast+/2.13.0 metaeuk/6 prodigal/2.6.3 r/4.3.1 bbmap/38.86


2. Create the virtual environment.

[name@server ~]$ virtualenv ~/busco_env
[name@server ~]$ source ~/busco_env/bin/activate


3. Install the wheel and its dependencies.

Question.png
(busco_env) $ pip install --no-index biopython==1.81 pandas==2.1.0 busco==5.5.0

4. Validate the installation.

Question.png
(busco_env) $ busco --help

5. Freeze the environment and requirements set. To use the requirements text file, see the bash submission script shown at point 8.

Question.png
(busco_env) $ pip freeze > ~/busco-requirements.txt

Usage

Datasets

6. You must pre-download any datasets from BUSCO data before submitting your job.

You can access the available datasets in your terminal by typing busco --list-datasets.

You have two options to download datasets:

  • use the busco command,
  • use the wget command.
6.1 Using the busco command

This is the preferred option. Type this command in your working directory to download a particular dataset, for example

Question.png
[name@server ~]$ busco --download bacteria_odb10

It is also possible to do a bulk download by replacing the dataset name by the following arguments: all, prokaryota, eukaryota, or virus, for example

[name@server ~]$ busco --download virus

This will

1. create a BUSCO directory hierarchy for the datasets,
2. download the appropriate datasets,
3. decompress the file(s),
4. if you download multiple files, they will all be automatically added to the lineages directory.

The hierarchy will look like this:

  • busco_downloads/
  • information/
lineages_list.2021-12-14.txt
  • lineages/
bacteria_odb10
actinobacteria_class_odb10
actinobacteria_phylum_odb10
  • placement_files/
list_of_reference_markers.archaea_odb10.2019-12-16.txt

Doing so, all your lineage files should be in busco_downloads/lineages/. When referring to --download_path busco_downloads/ in the BUSCO command line, it will know where to find the lineage dataset argument --lineage_dataset bacteria_odb10. If the busco_download directory is not in your working directory, you will need to provide the full path.

6.2 Using the wget command

All files must be decompressed with tar -xvf file.tar.gz.

[name@server ~]$ mkdir -p busco_downloads/lineages
[name@server ~]$ cd busco_downloads/lineages
[name@server ~]$ wget https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
[name@server ~]$ tar -xvf bacteria_odb10.2020-03-06.tar.gz


Test

7. Download a genome file.

[name@server ~]$ wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna


8. Run.

Command to run a single genome:

Question.png
[name@server ~]$ busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/

Command to run multiple genomes that would be saved in the genome directory (in this example, the genome/ folder would need to be in the current directory; otherwise, you need to provide the full path):

Question.png
[name@server ~]$ busco --offline --in genome/ --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/

The single genome command should take less than 60 seconds to complete. Production runs which take longer must be submitted to the scheduler.

BUSCO tips

Specify --in genome.fna for single file analysis.

Specify --in genome/ for multiple files analysis.

Slurm tips

Specify --offline to avoid using the internet.

Specify --cpu to $SLURM_CPUS_PER_TASK in your job submission script to use the number of CPUs allocated.

Specify --restart to restart from a partial run.

Job submission

Here you have an example of a submission script. You can submit as so: sbatch run_busco.sh.


File : run_busco.sh

#!/bin/bash

#SBATCH --job-name=busco9_run
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=01:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8         # adjust depending on the size of the genome(s)/protein(s)/transcriptome(s)
#SBATCH --mem=20G                 # adjust this according to the memory you need

# Load modules dependencies.
module load StdEnv/2020 gcc/9.3.0 python/3.10 augustus/3.5.0 hmmer/3.3.2 blast+/2.13.0 metaeuk/6 prodigal/2.6.3 r/4.3.1 bbmap/38.86

# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

# Install busco and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/busco-requirements.txt

# Edit with the proper arguments, run your commands.
busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/


Augustus parameters

9. Advanced users may want to use Augustus parameters: --augustus_parameters="--yourAugustusParameter".

  • Copy the Augustus config directory to a writable location.
Question.png
[name@server ~]$ cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config
  • Make sure to define the AUGUSTUS_CONFIG_PATH environment variable.
Question.png
[name@server ~]$ export AUGUSTUS_CONFIG_PATH=$HOME/augustus_config

SEPP parameters

10. To use SEPP parameters, you need to install SEPP locally in your virtual environment. This should be done from the login node.

10.1. Activate your BUSCO virtual environment.

[name@server ~]$ source busco_env/bin/activate


10.2. Install DendroPy.

[name@server ~]$ pip install 'dendropy<4.6'


10.3. Install SEPP.

[name@server ~]$ git clone https://github.com/smirarab/sepp.git
[name@server ~]$ cd sepp
[name@server ~]$ python setup.py config
[name@server ~]$ python setup.py install


10.4. Validate the installation.

[name@server ~]$ cd
[name@server ~]$ run_sepp.py -h


10.5. Because SEPP is installed locally, you cannot create the virtual environment as described in the previous submission script. To activate your local virtual environment, simply add the following command immediately under the line to load the module:

[name@server ~]$ source ~/busco_env/bin/activate


Modules

Deprecation

This section is outdated and deprecated. You should use the wheels available.



1. Load the necessary modules.

Question.png
[name@server ~]$ module load StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.4 busco/3.0.2 r/4.0.2

This will also load modules for Augustus, BLAST+, HMMER and some other software packages that BUSCO relies upon.

2. Copy the configuration file.

Question.png
[name@server ~]$ cp -v $EBROOTBUSCO/config/config.ini.default $HOME/busco_config.ini

or

Question.png
[name@server ~]$ wget -O $HOME/busco_config.ini https://gitlab.com/ezlab/busco/raw/master/config/config.ini.default

3. Edit the configuration file. The locations of external tools are all specified in the last section, which is shown below:

File : partial_busco_config.ini

[tblastn]
# path to tblastn
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/blast+/2.7.1/bin/

[makeblastdb]
# path to makeblastdb
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/blast+/2.7.1/bin/

[augustus]
# path to augustus
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/bin/

[etraining]
# path to augustus etraining
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/bin/

# path to augustus perl scripts, redeclare it for each new script
[gff2gbSmallDNA.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/
[new_species.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/
[optimize_augustus.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/

[hmmsearch]
# path to HMMsearch executable
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/hmmer/3.1b2/bin/

[Rscript]
# path to Rscript, if you wish to use the plot tool
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/r/4.0.2/bin/


4. Copy the Augustus config directory to a writable location.

Question.png
[name@server ~]$ cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config

5. Check that it runs.

[name@server ~]$ export BUSCO_CONFIG_FILE=$HOME/busco_config.ini
[name@server ~]$ export AUGUSTUS_CONFIG_PATH=$HOME/augustus_config
[name@server ~]$ run_BUSCO.py --in $EBROOTBUSCO/sample_data/target.fa --out TEST --lineage_path $EBROOTBUSCO/sample_data/example --mode genome


The run_BUSCO.py command should take less than 60 seconds to complete. Production runs which take longer should be submitted to the scheduler.

Troubleshooting

Cannot write to Augustus config path

Make sure you have copied the config directory to a writable location and exported the AUGUSTUS_CONFIG_PATH variable.