cc_staff
284
edits
(Category:Software) |
(Complete rewrite. Show how to accelerate search on a cluster.) |
||
Line 4: | Line 4: | ||
<translate> | <translate> | ||
BLAST ("Basic Local Alignment Search Tool") finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. | BLAST ("Basic Local Alignment Search Tool") finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. | ||
== User manual == | |||
You can find more information its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual]. | |||
Or with : | |||
{{Command|blastn -help}} | |||
== Databases == | |||
Some frequently-used sequence databases are installed on Compute Canada clusters. | Some frequently-used sequence databases are installed on Compute Canada clusters. | ||
You can find information on the BLAST databases available here : [[Genomics data]]. | |||
== | == Accelerating the search == | ||
For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format and <tt>seq.fa</tt> as the queries we are looking for. | |||
< | === <tt>makeblastdb</tt> === | ||
Before running a search, we must build the database. Building the database can be a preprocessing job, where the other job are dependent on the completion of the <tt>makeblastdb</tt> job. | |||
Here is an example of a submission script: | |||
{{File | |||
|name=makeblastdb.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
<!-- | #SBATCH --account=def-<user> # The account to use | ||
#SBATCH --time=00:02:00 # The duration in HH:MM:SS format | |||
* | #SBATCH --cpus-per-task=1 # The number of cores | ||
#SBATCH --mem=512M # Total memory for this task | |||
module load gcc/7.3.0 blast+/2.7.1 | |||
# Create the nucleotide database based on `ref.fa`. | |||
makeblastdb -in ref.fa -title reference -dbtype nucl -out ref.fa | |||
}} | |||
=== Task array === | |||
Blast search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database. | |||
==== Preprocess ==== | |||
In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. The file chunks should be at least <tt>1Mb</tt> or greater, but not '''smaller''' as it may hurt the parallel file system. | |||
'''Important''': To correctly split a FASTA format file, it must be in its original format and not multiline format. In other words, the sequence must be on one line. | |||
Using the <tt>split</tt> utility: | |||
{{Command|split -d -a 1 -l 2 seq.fa seq.fa.}} | |||
will create 10 files named <tt>seq.fa.N</tt> where <tt>N</tt> is in the range of <tt>[0..9]</tt> for 10 queries (sequences). | |||
==== Job submission ==== | |||
Once our queries are split, we can create a task for each <tt>seq.fa.N</tt> file using a job array. The task id from the array will map to the file name containing the query to run. | |||
This solution allows the scheduler to fit the smaller jobs from the array where there's available resources in the cluster. | |||
{{File | |||
|name=blastn_array.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-<user> # The account to use | |||
#SBATCH --time=00:02:00 # The duration in HH:MM:SS format of each task in the array | |||
#SBATCH --cpus-per-task=1 # The number of cores for each task in the array | |||
#SBATCH --mem-per-cpu=512M # The memory per core for each task in the array | |||
#SBATCH --array=0-9 # The number of tasks: 10 | |||
module load gcc/7.3.0 blast+/2.7.1 | |||
# Using the index of the current task, given by `$SLURM_ARRAY_TASK_ID`, run the corresponding query and write the result | |||
blastn -db ref.fa -query seq.fa.${SLURM_ARRAY_TASK_ID} > seq.ref.${SLURM_ARRAY_TASK_ID} | |||
}} | |||
With the above submission script, we can submit our blast search and it will run after the creation of the database: | |||
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}} | |||
Once all the tasks from the array are done, the results can be concatenated using: | |||
{{Command|cat seq.ref.{0..9} > seq.ref}} | |||
where the 10 files will be catenated into <tt>seq.ref</tt> file. | |||
This could be done from the login node or as a dependent job upon completion of all the tasks from the array. | |||
=== GNU Parallel === | |||
<tt>GNU parallel</tt> is a great tool to pack many small jobs into one and parallelize it. | |||
This solution helps alleviate the issue of too many small files on a parallel file system by querying fixed size chunks from <tt>seq.fa</tt> and run on one node and multiple cores. | |||
As an example, if your <tt>seq.fa</tt> file is <tt>100Mb</tt>, you could read block of <tt>10Mb</tt> and use 10 cores. | |||
{{File | |||
|name=blastn_gnu.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-<user> # The account to use | |||
#SBATCH --time=00:02:00 # The duration in HH:MM:SS format | |||
#SBATCH --cpus-per-task=4 # The number of cores | |||
#SBATCH --mem-per-cpu=512M # The memory per core | |||
module load gcc/7.3.0 blast+/2.7.1 | |||
# Pass the whole file to GNU parallel | |||
# where | |||
# --jobs number of core to use, equal $SLURM_CPUS_PER_TASK (the number of cores requested) | |||
# -k keep same order | |||
# --block use block of 1Mb in size | |||
# --recstart record start, here the sequence identifier `>` | |||
# --pipe pass to the command | |||
cat seq.fa {{!}} parallel --jobs $SLURM_CPUS_PER_TASK -k --block 1M --recstart '>' --pipe 'blastn -db ref.fa -query - ' > seq.ref | |||
}} | |||
==== Job submission ==== | |||
With the above submission script, we can submit our blast search and it will run after the creation of the database: | |||
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}} | |||
=== Additional tips === | |||
* If it fits into the node-local storage, copy your FASTA database to the localscratch (<tt>$SLURM_TMPDIR</tt>). | |||
* Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research. | * Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research. | ||
* Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research. | * Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research. | ||
</translate> | </translate> |