BLAST: Difference between revisions

Jump to navigation Jump to search
4,170 bytes added ,  5 years ago
Complete rewrite. Show how to accelerate search on a cluster.
(Category:Software)
(Complete rewrite. Show how to accelerate search on a cluster.)
Line 4: Line 4:
<translate>
<translate>


<!--T:1-->
BLAST ("Basic Local Alignment Search Tool") finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
BLAST ("Basic Local Alignment Search Tool") finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.


<!--T:2-->
== User manual ==
BLAST searches can be run over the Internet using the [https://blast.ncbi.nlm.nih.gov/Blast.cgi NCBI site], but you '''should not do this''' for production work on a Compute Canada cluster. Instead load the BLAST+ [[Utiliser des modules/en|module]] and a search database on the cluster. 
You can find more information its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual].  
Or with :
{{Command|blastn -help}}


<!--T:3-->
== Databases ==
Some frequently-used sequence databases are installed on Compute Canada clusters. See [[Genomics data]].
Some frequently-used sequence databases are installed on Compute Canada clusters.  
You can find information on the BLAST databases available here : [[Genomics data]].


== Performance == <!--T:4-->
== Accelerating the search ==
For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format and <tt>seq.fa</tt> as the queries we are looking for.


<!--T:5-->
=== <tt>makeblastdb</tt> ===
Here are some things to try in order to accelerate your BLAST search on a compute cluster:
Before running a search, we must build the database. Building the database can be a preprocessing job, where the other job are dependent on the completion of the <tt>makeblastdb</tt> job.
Here is an example of a submission script:
{{File
  |name=makeblastdb.sh
  |lang="bash"
  |contents=
#!/bin/bash


<!--T:6-->
#SBATCH --account=def-<user>  # The account to use
* Copy your FASTA database to node-local storage (<code>$SLURM_TMPDIR</code>) and run <code>makeblastdb</code> at beginning of your job script to generate your blast db on ramdisk on the node.
#SBATCH --time=00:02:00      # The duration in HH:MM:SS format
* Use multi-threading (option <code>-num_threads</code>).  Beware that this is not very efficient; test to determine a suitable number of threads.
#SBATCH --cpus-per-task=1    # The number of cores
#SBATCH --mem=512M            # Total memory for this task
 
module load gcc/7.3.0 blast+/2.7.1
 
# Create the nucleotide database based on `ref.fa`.
makeblastdb -in ref.fa -title reference -dbtype nucl -out ref.fa
}}
 
=== Task array ===
Blast search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.
 
==== Preprocess ====
In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. The file chunks should be at least <tt>1Mb</tt> or greater, but not '''smaller''' as it may hurt the parallel file system.
 
'''Important''': To correctly split a FASTA format file, it must be in its original format and not multiline format. In other words, the sequence must be on one line.
 
Using the <tt>split</tt> utility:
{{Command|split -d -a 1 -l 2 seq.fa seq.fa.}}
will create 10 files named <tt>seq.fa.N</tt> where <tt>N</tt> is in the range of <tt>[0..9]</tt> for 10 queries (sequences).
 
==== Job submission ====
Once our queries are split, we can create a task for each <tt>seq.fa.N</tt> file using a job array. The task id from the array will map to the file name containing the query to run.
 
This solution allows the scheduler to fit the smaller jobs from the array where there's available resources in the cluster.
{{File
  |name=blastn_array.sh
  |lang="bash"
  |contents=
#!/bin/bash
 
#SBATCH --account=def-<user>  # The account to use
#SBATCH --time=00:02:00      # The duration in HH:MM:SS format of each task in the array
#SBATCH --cpus-per-task=1    # The number of cores for each task in the array
#SBATCH --mem-per-cpu=512M    # The memory per core for each task in the array
#SBATCH --array=0-9          # The number of tasks: 10
 
module load gcc/7.3.0 blast+/2.7.1
 
# Using the index of the current task, given by `$SLURM_ARRAY_TASK_ID`, run the corresponding query and write the result
blastn -db ref.fa -query seq.fa.${SLURM_ARRAY_TASK_ID} > seq.ref.${SLURM_ARRAY_TASK_ID}
}}
 
With the above submission script, we can submit our blast search and it will run after the creation of the database:
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}}
 
Once all the tasks from the array are done, the results can be concatenated using:
{{Command|cat seq.ref.{0..9} > seq.ref}}
where the 10 files will be catenated into <tt>seq.ref</tt> file.
This could be done from the login node or as a dependent job upon completion of all the tasks from the array.
 
=== GNU Parallel ===
<tt>GNU parallel</tt> is a great tool to pack many small jobs into one and parallelize it.
This solution helps alleviate the issue of too many small files on a parallel file system by querying fixed size chunks from <tt>seq.fa</tt> and run on one node and multiple cores.
 
As an example, if your <tt>seq.fa</tt> file is <tt>100Mb</tt>, you could read block of <tt>10Mb</tt> and use 10 cores.
{{File
  |name=blastn_gnu.sh
  |lang="bash"
  |contents=
#!/bin/bash
 
#SBATCH --account=def-<user>  # The account to use
#SBATCH --time=00:02:00      # The duration in HH:MM:SS format
#SBATCH --cpus-per-task=4    # The number of cores
#SBATCH --mem-per-cpu=512M    # The memory per core
 
module load gcc/7.3.0 blast+/2.7.1
 
# Pass the whole file to GNU parallel
# where
#  --jobs number of core to use, equal $SLURM_CPUS_PER_TASK (the number of cores requested)
#  -k keep same order
#  --block use block of 1Mb in size
#  --recstart record start, here the sequence identifier `>`
#  --pipe pass to the command
cat seq.fa {{!}} parallel --jobs $SLURM_CPUS_PER_TASK -k --block 1M --recstart '>' --pipe 'blastn -db ref.fa -query - ' > seq.ref
}}
 
==== Job submission ====
With the above submission script, we can submit our blast search and it will run after the creation of the database:
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}}
 
=== Additional tips ===
* If it fits into the node-local storage, copy your FASTA database to the localscratch (<tt>$SLURM_TMPDIR</tt>).
* Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
* Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
* Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research.
* Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research.


</translate>
</translate>
cc_staff
284

edits

Navigation menu