BUSCO: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Marked this version for translation)
(Rewrite of BUSCO to use python package and slurm tips.)
Line 7: Line 7:
BUSCO stands for "Benchmarking sets of Universal Single-Copy Orthologs".   
BUSCO stands for "Benchmarking sets of Universal Single-Copy Orthologs".   
It is an application for assessing genome assembly and annotation completeness.
It is an application for assessing genome assembly and annotation completeness.
For more information see the [https://gitlab.com/ezlab/busco/blob/master/BUSCO_v3_userguide.pdf user manual].
 
For more information, see the [https://busco.ezlab.org/busco_userguide.html user manual].


== Available versions == <!--T:10-->
== Available versions == <!--T:10-->
Recent versions are available as wheels. Older version is available as a module, please see the module section below.


<!--T:11-->
To see the latest available version, run:
Version 3.0.2 of BUSCO is installed as a module on CVMFS and accessible on all clusters. See below how to use it.
{{Command|avail_wheel busco}}
== Python Wheel ==
=== Installation ===
'''1.''' Load the necessary modules:
{{Command|module load StdEnv/2020 gcc python augustus hmmer blast+ metaeuk prodigal r}}


<!--T:14-->
'''2.''' Create the virtual environment:
For the [https://gitlab.com/ezlab/busco newer versions], you can install them in your own account using a [[Python#Creating_and_using_a_virtual_environment|virtual environment]] as follows:
{{Commands
|virtualenv busco_env
|source busco_env/bin/activate
}}


<!--T:12-->
'''3.''' Install the wheel and its dependencies:
{{Commands|
{{Command
~ $ module load python/3.7.4
|prompt=(busco_env) $
~ $ git clone https://gitlab.com/ezlab/busco.git
|pip install biopython pandas busco --no-index
~ $ virtualenv /home/$USER/busco_env
~ $ source /home/$USER/busco_env/bin/activate
(busco_env) [~]$ pip install Biopython
(busco_env) [~]$ cd ~/busco
(busco_env) [~]$ python setup.py install
(busco_env) [~]$ cp -r scripts test_data /home/$USER/busco_env/
}}
}}


<!--T:13-->
=== Usage ===
and add "home/$USER/busco_env/scripts" to your path.
==== Datasets ====
You must pre-download any datasets from [https://busco-data.ezlab.org/v5/data/ busco data] before submitting your job.
 
==== Test ====
'''4.''' Download test data:
{{Commands
|wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna
|wget https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
}}
 
'''5.''' Run:
{{Command|busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK-1} }}
 
The command should take less than 60 seconds to complete. Production runs which take longer must be submitted to the [[Running jobs|scheduler]].
 
=== Slurm tips ===
Specify <tt>--offline</tt> to avoid using internet.
 
Specify <tt>--cpu</tt> to <tt>$SLURM_CPUS_PER_TASK</tt> in your job submission script to use the number of cpus allocated.
 
Specify <tt>--restart</tt> to restart from a partial run.


== Using BUSCO from CVMFS == <!--T:2-->
== Module == <!--T:2-->


<!--T:15-->
<!--T:15-->

Revision as of 00:07, 30 March 2021

Other languages:


BUSCO stands for "Benchmarking sets of Universal Single-Copy Orthologs". It is an application for assessing genome assembly and annotation completeness.

For more information, see the user manual.

Available versions

Recent versions are available as wheels. Older version is available as a module, please see the module section below.

To see the latest available version, run:

Question.png
[name@server ~]$ avail_wheel busco

Python Wheel

Installation

1. Load the necessary modules:

Question.png
[name@server ~]$ module load StdEnv/2020 gcc python augustus hmmer blast+ metaeuk prodigal r

2. Create the virtual environment:

[name@server ~]$ virtualenv busco_env
[name@server ~]$ source busco_env/bin/activate


3. Install the wheel and its dependencies:

Question.png
(busco_env) $ pip install biopython pandas busco --no-index

Usage

Datasets

You must pre-download any datasets from busco data before submitting your job.

Test

4. Download test data:

[name@server ~]$ wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna
[name@server ~]$ wget https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz


5. Run:

Question.png
[name@server ~]$ busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK-1}

The command should take less than 60 seconds to complete. Production runs which take longer must be submitted to the scheduler.

Slurm tips

Specify --offline to avoid using internet.

Specify --cpu to $SLURM_CPUS_PER_TASK in your job submission script to use the number of cpus allocated.

Specify --restart to restart from a partial run.

Module

1. Load the necessary modules:

Question.png
[name@server ~]$ module load StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.4 busco/3.0.2 r/4.0.2

This will also load modules for augustus, blast+, hmmer and some other software packages that BUSCO relies upon.

2. Copy the configuration file:

Question.png
[name@server ~]$ cp -v $EBROOTBUSCO/config/config.ini.default $HOME/busco_config.ini

or

Question.png
[name@server ~]$ wget -O $HOME/busco_config.ini https://gitlab.com/ezlab/busco/raw/master/config/config.ini.default

3. Edit the configuration file. The locations of external tools are all specified in the last section, which is shown below:

File : partial_busco_config.ini

[tblastn]
# path to tblastn
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/blast+/2.7.1/bin/

[makeblastdb]
# path to makeblastdb
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/blast+/2.7.1/bin/

[augustus]
# path to augustus
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/bin/

[etraining]
# path to augustus etraining
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/bin/

# path to augustus perl scripts, redeclare it for each new script
[gff2gbSmallDNA.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/
[new_species.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/
[optimize_augustus.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/

[hmmsearch]
# path to HMMsearch executable
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/hmmer/3.1b2/bin/

[Rscript]
# path to Rscript, if you wish to use the plot tool
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/r/4.0.2/bin/


4. Copy the Augustus config directory to a writable location:

Question.png
[name@server ~]$ cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config

5. Check that it runs.

[name@server ~]$ export BUSCO_CONFIG_FILE=$HOME/busco_config.ini
[name@server ~]$ export AUGUSTUS_CONFIG_PATH=$HOME/augustus_config
[name@server ~]$ run_BUSCO.py --in $EBROOTBUSCO/sample_data/target.fa --out TEST --lineage_path $EBROOTBUSCO/sample_data/example --mode genome


The run_BUSCO.py command should take less than 60 seconds to complete. Production runs which take longer should be submitted to the scheduler.

Troubleshooting

Cannot write to Augustus config path

Make sure you have copied the config directory to a writable location and exported the AUGUSTUS_CONFIG_PATH variable.