BUSCO/en: Difference between revisions
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
Line 9: | Line 9: | ||
== Available versions == | == Available versions == | ||
Recent versions are available as wheels. Older | Recent versions are available as wheels. Older versions are available as a module, please see the module section below. | ||
To see the latest available version, run: | To see the latest available version, run: | ||
{{Command|avail_wheel busco}} | {{Command|avail_wheel busco}} | ||
== Python | |||
== Python Wheel == | |||
=== Installation === | === Installation === | ||
'''1.''' Load the necessary modules: | |||
{{Commands | {{Commands | ||
|module load StdEnv/2020 gcc python/3.10 augustus hmmer blast+ metaeuk prodigal r}} | |module load StdEnv/2020 gcc python/3.10 augustus hmmer blast+ metaeuk prodigal r bbmap}} | ||
'''2.''' Create the virtual environment: | |||
{{Commands | {{Commands | ||
|virtualenv busco_env | |virtualenv ~/busco_env | ||
|source busco_env/bin/activate | |source ~/busco_env/bin/activate | ||
}} | }} | ||
'''3.''' Install the wheel and its dependencies: | |||
{{Command | {{Command | ||
|prompt=(busco_env) $ | |prompt=(busco_env) $ | ||
|pip install biopython pandas busco --no-index | |pip install biopython pandas busco --no-index | ||
}} | |||
'''4.''' Validate it: | |||
{{Command | |||
|prompt=(busco_env) $ | |||
|busco --help | |||
}} | |||
'''5.''' Freeze the environment and requirements set. For requirements text file usage, have a look at the bash submission script described in point number 8. | |||
{{Command | |||
|prompt=(busco_env) $ | |||
|pip freeze > ~/busco-requirements.txt | |||
}} | }} | ||
=== Usage === | === Usage === | ||
==== Datasets ==== | ==== Datasets ==== | ||
You must pre-download any datasets from [https://busco-data.ezlab.org/v5/data/ busco data] before submitting your job. | '''6.''' You must pre-download any datasets from [https://busco-data.ezlab.org/v5/data/ busco data] before submitting your job. | ||
You can access the available datasets in your terminal by typing <code>busco --list-datasets</code>. | |||
You have '''two''' options for datasets download: | |||
===== Busco download command ===== | |||
'''6.1''' Use busco download command (preferred method). Here is one example: | |||
Type this command in your working directory to download one particular dataset: | |||
{{Commands | |||
|busco --download bacteria_odb10 | |||
}} | |||
It is also possible to do a bulk download by using the following arguments in place of the dataset name: "all", "prokaryota", "eukaryota", or "virus". | |||
{{Commands | |||
|busco --download virus | |||
}} | |||
This will: | |||
::1. Create Busco directory hierarchy for datasets. | |||
::2. Download the appropriate datasets. | |||
::3. Decompress the file(s). | |||
::4. If you download multiple files, they will all be automatically added in the lineages directory. | |||
Directories hierarchy will look as follows: | |||
<blockquote> | |||
* busco_downloads/ | |||
::* information/ | |||
::::lineages_list.2021-12-14.txt | |||
::* lineages/ | |||
::::bacteria_odb10 | |||
::::actinobacteria_class_odb10 | |||
::::actinobacteria_phylum_odb10 | |||
::* placement_files/ | |||
::::list_of_reference_markers.archaea_odb10.2019-12-16.txt | |||
</blockquote> | |||
Doing so, all your lineage files should be in '''busco_downloads/lineages/'''. When referring <code>--download_path busco_downloads/</code> in your busco command line, it will know where to find the lineage dataset argument <code>--lineage_dataset bacteria_odb10</code>. If the busco_download directory is not in your working directory, you would need to provide full path. | |||
=====Wget download command ===== | |||
'''6.2''' Use wget download command. Here is one example: | |||
All files must be decompressed: <code>tar -xvf file.tar.gz</code> | |||
{{Commands | |||
|mkdir -p busco_downloads/lineages | |||
|cd busco_downloads/lineages | |||
|wget https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz | |||
|tar -xvf bacteria_odb10.2020-03-06.tar.gz | |||
}} | |||
==== Test ==== | ==== Test ==== | ||
'''7.''' Download a genome file. | |||
{{Commands | {{Commands | ||
|wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna | |wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna | ||
}} | }} | ||
< | '''8.''' Run: | ||
{{ | |||
Command to run a single genome: | |||
{{Command|busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/}} | |||
Command to run multiple genomes that would be saved in the '''genome/''' directory: (As describe here, genome folder would need to be in the current directory or you would need to provide the full path). | |||
{{Command|busco --offline --in genome/ --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/}} | |||
The single genome command should take less than 60 seconds to complete. Production runs which take longer must be submitted to the [[Running jobs|scheduler]]. | |||
===== Busco tips ===== | |||
Specify <tt>--in genome.fna</tt> for single file analysis, | |||
Specify <tt>--in genome/</tt> for multiple files analysis. | |||
===== Slurm tips ===== | |||
Specify <tt>--offline</tt> to avoid using the internet. | |||
Specify <tt>--cpu</tt> to <tt>$SLURM_CPUS_PER_TASK</tt> in your job submission script to use the number of CPUs allocated. | |||
Specify <tt>--restart</tt> to restart from a partial run. | |||
====Job submission==== | |||
Here you have an example of a submission script. You can submit as so: <code>sbatch run_busco.sh</code>. | |||
{{File | |||
|name=run_busco.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --job-name=busco9_run | |||
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs | |||
#SBATCH --time=01:00:00 # adjust this to match the walltime of your job | |||
#SBATCH --cpus-per-task=8 # adjust depending on the size of the genome(s)/protein(s)/transcriptome(s) | |||
#SBATCH --mem=20G # adjust this according to the memory you need | |||
# Load modules dependencies. | |||
module load StdEnv/2020 gcc python augustus hmmer blast+ metaeuk prodigal r bbmap | |||
# Generate your virtual environment in $SLURM_TMPDIR. | |||
virtualenv --no-download ${SLURM_TMPDIR}/env | |||
source ${SLURM_TMPDIR}/env/bin/activate | |||
# Install busco and its dependencies. | |||
pip install --no-index --upgrade pip | |||
pip install --no-index --requirement ~/busco-requirements.txt | |||
# Edit with the proper arguments, run your commands. | |||
busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/ | |||
}} | |||
====Augustus parameters==== | |||
'''9.''' For advanced users who want to use Augustus parameters: <code>--augustus_parameters="--yourAugustusParameter".</code> | |||
Copy the Augustus config directory to a writable location: | |||
{{Command|cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config}} | |||
Make sure to define the <code>AUGUSTUS_CONFIG_PATH</code> environment variable: | |||
{{Command|export AUGUSTUS_CONFIG_PATH{{=}}$HOME/augustus_config}} | |||
====SEPP parameters==== | |||
'''10.''' To use SEPP parameters, you need to install SEPP locally in your virtual environment. This should be done in a login node. | |||
'''10.1.''' Activate your BUSCO virtual environment: | |||
{{Commands | |||
|source busco_env/bin/activate | |||
}} | |||
'''10.2.''' Install dendropy: | |||
{{Commands | |||
|pip install 'dendropy<4.6' | |||
}} | |||
'''10.3.''' Install SEPP: | |||
{{Commands | |||
|git clone https://github.com/smirarab/sepp.git | |||
|cd sepp | |||
|python setup.py config | |||
|python setup.py install | |||
}} | |||
'''10.4.''' Validate the installation: | |||
{{Commands | |||
|cd | |||
|run_sepp.py -h | |||
}} | |||
'''10.5.''' When using SEPP, because it is installed locally you cannot create the virtual environment as we have described in previous submission script demo. You simply need to add this command which activates your local virtual environment just after the loading module command line: | |||
{{Commands | |||
|source ~/busco_env/bin/activate | |||
}} | |||
== Modules == | == Modules == | ||
{{Warning | |||
|title=Deprecation | |||
|content=This section is outdated. We are currently working on updating it. | |||
}} | |||
'''1.''' Load the necessary modules: | '''1.''' Load the necessary modules: | ||
Line 120: | Line 289: | ||
= Troubleshooting = | = Troubleshooting = | ||
== Cannot write to Augustus config path == | == Cannot write to Augustus config path == | ||
Make sure you have copied the config directory to a writable location and exported the < | Make sure you have copied the config directory to a writable location and exported the <tt>AUGUSTUS_CONFIG_PATH</tt> variable. |
Revision as of 20:07, 21 November 2023
BUSCO stands for "Benchmarking sets of Universal Single-Copy Orthologs".
It is an application for assessing genome assembly and annotation completeness.
For more information, see the user manual.
Available versions
Recent versions are available as wheels. Older versions are available as a module, please see the module section below.
To see the latest available version, run:
[name@server ~]$ avail_wheel busco
Python Wheel
Installation
1. Load the necessary modules:
[name@server ~]$ module load StdEnv/2020 gcc python/3.10 augustus hmmer blast+ metaeuk prodigal r bbmap
2. Create the virtual environment:
[name@server ~]$ virtualenv ~/busco_env
[name@server ~]$ source ~/busco_env/bin/activate
3. Install the wheel and its dependencies:
(busco_env) $ pip install biopython pandas busco --no-index
4. Validate it:
(busco_env) $ busco --help
5. Freeze the environment and requirements set. For requirements text file usage, have a look at the bash submission script described in point number 8.
(busco_env) $ pip freeze > ~/busco-requirements.txt
Usage
Datasets
6. You must pre-download any datasets from busco data before submitting your job.
You can access the available datasets in your terminal by typing busco --list-datasets
.
You have two options for datasets download:
Busco download command
6.1 Use busco download command (preferred method). Here is one example:
Type this command in your working directory to download one particular dataset:
[name@server ~]$ busco --download bacteria_odb10
It is also possible to do a bulk download by using the following arguments in place of the dataset name: "all", "prokaryota", "eukaryota", or "virus".
[name@server ~]$ busco --download virus
This will:
- 1. Create Busco directory hierarchy for datasets.
- 2. Download the appropriate datasets.
- 3. Decompress the file(s).
- 4. If you download multiple files, they will all be automatically added in the lineages directory.
Directories hierarchy will look as follows:
- busco_downloads/
- information/
- lineages_list.2021-12-14.txt
- lineages/
- bacteria_odb10
- actinobacteria_class_odb10
- actinobacteria_phylum_odb10
- placement_files/
- list_of_reference_markers.archaea_odb10.2019-12-16.txt
Doing so, all your lineage files should be in busco_downloads/lineages/. When referring --download_path busco_downloads/
in your busco command line, it will know where to find the lineage dataset argument --lineage_dataset bacteria_odb10
. If the busco_download directory is not in your working directory, you would need to provide full path.
Wget download command
6.2 Use wget download command. Here is one example:
All files must be decompressed: tar -xvf file.tar.gz
[name@server ~]$ mkdir -p busco_downloads/lineages
[name@server ~]$ cd busco_downloads/lineages
[name@server ~]$ wget https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
[name@server ~]$ tar -xvf bacteria_odb10.2020-03-06.tar.gz
Test
7. Download a genome file.
[name@server ~]$ wget https://gitlab.com/ezlab/busco/-/raw/master/test_data/bacteria/genome.fna
8. Run:
Command to run a single genome:
[name@server ~]$ busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/
Command to run multiple genomes that would be saved in the genome/ directory: (As describe here, genome folder would need to be in the current directory or you would need to provide the full path).
[name@server ~]$ busco --offline --in genome/ --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/
The single genome command should take less than 60 seconds to complete. Production runs which take longer must be submitted to the scheduler.
Busco tips
Specify --in genome.fna for single file analysis,
Specify --in genome/ for multiple files analysis.
Slurm tips
Specify --offline to avoid using the internet.
Specify --cpu to $SLURM_CPUS_PER_TASK in your job submission script to use the number of CPUs allocated.
Specify --restart to restart from a partial run.
Job submission
Here you have an example of a submission script. You can submit as so: sbatch run_busco.sh
.
#!/bin/bash
#SBATCH --job-name=busco9_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=01:00:00 # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=8 # adjust depending on the size of the genome(s)/protein(s)/transcriptome(s)
#SBATCH --mem=20G # adjust this according to the memory you need
# Load modules dependencies.
module load StdEnv/2020 gcc python augustus hmmer blast+ metaeuk prodigal r bbmap
# Generate your virtual environment in $SLURM_TMPDIR.
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate
# Install busco and its dependencies.
pip install --no-index --upgrade pip
pip install --no-index --requirement ~/busco-requirements.txt
# Edit with the proper arguments, run your commands.
busco --offline --in genome.fna --out TEST --lineage_dataset bacteria_odb10 --mode genome --cpu ${SLURM_CPUS_PER_TASK:-1} --download_path busco_download/
Augustus parameters
9. For advanced users who want to use Augustus parameters: --augustus_parameters="--yourAugustusParameter".
Copy the Augustus config directory to a writable location:
[name@server ~]$ cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config
Make sure to define the AUGUSTUS_CONFIG_PATH
environment variable:
[name@server ~]$ export AUGUSTUS_CONFIG_PATH=$HOME/augustus_config
SEPP parameters
10. To use SEPP parameters, you need to install SEPP locally in your virtual environment. This should be done in a login node.
10.1. Activate your BUSCO virtual environment:
[name@server ~]$ source busco_env/bin/activate
10.2. Install dendropy:
[name@server ~]$ pip install 'dendropy<4.6'
10.3. Install SEPP:
[name@server ~]$ git clone https://github.com/smirarab/sepp.git
[name@server ~]$ cd sepp
[name@server ~]$ python setup.py config
[name@server ~]$ python setup.py install
10.4. Validate the installation:
[name@server ~]$ cd
[name@server ~]$ run_sepp.py -h
10.5. When using SEPP, because it is installed locally you cannot create the virtual environment as we have described in previous submission script demo. You simply need to add this command which activates your local virtual environment just after the loading module command line:
[name@server ~]$ source ~/busco_env/bin/activate
Modules
This section is outdated. We are currently working on updating it.
1. Load the necessary modules:
[name@server ~]$ module load StdEnv/2018.3 gcc/7.3.0 openmpi/3.1.4 busco/3.0.2 r/4.0.2
This will also load modules for augustus, blast+, hmmer
and some other
software packages that BUSCO relies upon.
2. Copy the configuration file:
[name@server ~]$ cp -v $EBROOTBUSCO/config/config.ini.default $HOME/busco_config.ini
or
[name@server ~]$ wget -O $HOME/busco_config.ini https://gitlab.com/ezlab/busco/raw/master/config/config.ini.default
3. Edit the configuration file. The locations of external tools are all specified in the last section, which is shown below:
[tblastn]
# path to tblastn
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/blast+/2.7.1/bin/
[makeblastdb]
# path to makeblastdb
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/blast+/2.7.1/bin/
[augustus]
# path to augustus
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/bin/
[etraining]
# path to augustus etraining
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/bin/
# path to augustus perl scripts, redeclare it for each new script
[gff2gbSmallDNA.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/
[new_species.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/
[optimize_augustus.pl]
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/augustus/3.3/scripts/
[hmmsearch]
# path to HMMsearch executable
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/hmmer/3.1b2/bin/
[Rscript]
# path to Rscript, if you wish to use the plot tool
path = /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx512/Compiler/gcc7.3/r/4.0.2/bin/
4. Copy the Augustus config directory to a writable location:
[name@server ~]$ cp -r $EBROOTAUGUSTUS/config $HOME/augustus_config
5. Check that it runs.
[name@server ~]$ export BUSCO_CONFIG_FILE=$HOME/busco_config.ini
[name@server ~]$ export AUGUSTUS_CONFIG_PATH=$HOME/augustus_config
[name@server ~]$ run_BUSCO.py --in $EBROOTBUSCO/sample_data/target.fa --out TEST --lineage_path $EBROOTBUSCO/sample_data/example --mode genome
The run_BUSCO.py
command should take less than 60 seconds to complete.
Production runs which take longer should be submitted to the scheduler.
Troubleshooting
Cannot write to Augustus config path
Make sure you have copied the config directory to a writable location and exported the AUGUSTUS_CONFIG_PATH variable.