BLAST: Difference between revisions

17 bytes removed ,  5 years ago
no edit summary
m (Fixed Mb unit for MB)
No edit summary
Line 8: Line 8:


== User manual == <!--T:8-->
== User manual == <!--T:8-->
You can find more information its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual].
You can find more information on its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual]  
Or with :
or with
{{Command|blastn -help}}
{{Command|blastn -help}}


== Databases == <!--T:9-->
== Databases == <!--T:9-->
Some frequently-used sequence databases are installed on Compute Canada clusters.  
Some frequently used sequence databases are installed on Compute Canada clusters.  
You can find information on the BLAST databases available here : [[Genomics data]].
You can find information on the BLAST databases available in [[Genomics data]].


== Accelerating the search == <!--T:10-->
== Accelerating the search == <!--T:10-->
For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format and <tt>seq.fa</tt> as the queries we are looking for.
For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format, and <tt>seq.fa</tt> as the queries.


=== <tt>makeblastdb</tt> === <!--T:11-->
=== <tt>makeblastdb</tt> === <!--T:11-->
Before running a search, we must build the database. Building the database can be a preprocessing job, where the other job are dependent on the completion of the <tt>makeblastdb</tt> job.
Before running a search, we must build the database. This can be a preprocessing job, where the other jobs are dependent on the completion of the <tt>makeblastdb</tt> job.
Here is an example of a submission script:
Here is an example of a submission script:
{{File
{{File
Line 43: Line 43:


=== Task array === <!--T:15-->
=== Task array === <!--T:15-->
Blast search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.
BLAST search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.


==== Preprocess ==== <!--T:16-->
==== Preprocessing ==== <!--T:16-->
In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. The file chunks should be at least <tt>1MB</tt> or greater, but not '''smaller''' as it may hurt the parallel file system.
In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. These should be at least <tt>1MB</tt> or greater, but not '''smaller''' as it may hurt the parallel filesystem.


<!--T:17-->
<!--T:17-->
'''Important''': To correctly split a FASTA format file, it must be in its original format and not multiline format. In other words, the sequence must be on one line.
'''Important''': To correctly split a FASTA format file, it must be in its original format and not in multiline format. In other words, the sequence must be on a single line.


<!--T:18-->
<!--T:18-->
Line 60: Line 60:


<!--T:20-->
<!--T:20-->
This solution allows the scheduler to fit the smaller jobs from the array where there's available resources in the cluster.
This solution allows the scheduler to fit the smaller jobs from the array where there are resources available in the cluster.
{{File
{{File
   |name=blastn_array.sh
   |name=blastn_array.sh
Line 83: Line 83:


<!--T:24-->
<!--T:24-->
With the above submission script, we can submit our blast search and it will run after the creation of the database:
With the above submission script, we can submit our search and it will run after the creation of the database:
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}}
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}}


<!--T:25-->
<!--T:25-->
Once all the tasks from the array are done, the results can be concatenated using:
Once all the tasks from the array are done, the results can be concatenated using
{{Command|cat seq.ref.{0..9} > seq.ref}}
{{Command|cat seq.ref.{0..9} > seq.ref}}
where the 10 files will be catenated into <tt>seq.ref</tt> file.
where the 10 files will be concatenated into <tt>seq.ref</tt> file.
This could be done from the login node or as a dependent job upon completion of all the tasks from the array.
This could be done from the login node or as a dependent job upon completion of all the tasks from the array.


=== GNU Parallel === <!--T:26-->
=== GNU Parallel === <!--T:26-->
<tt>GNU parallel</tt> is a great tool to pack many small jobs into one and parallelize it.
<tt>GNU parallel</tt> is a great tool to pack many small jobs into a single job, which it can then parallelize.
This solution helps alleviate the issue of too many small files on a parallel file system by querying fixed size chunks from <tt>seq.fa</tt> and run on one node and multiple cores.
This solution helps alleviate the issue of too many small files in a parallel filesystem by querying fixed size chunks from <tt>seq.fa</tt> and running on one node and multiple cores.


<!--T:27-->
<!--T:27-->
As an example, if your <tt>seq.fa</tt> file is <tt>3MB</tt>, you could read block of <tt>1MB</tt> and GNU Parallel will create 3 jobs, thus using 3 cores. If we would have requested 10 cores in our task, we would be wasting 7 cores. Therefore, '''the block size is important'''. We can also let GNU Parallel decide, as done below.
As an example, if your <tt>seq.fa</tt> file is <tt>3MB</tt>, you could read blocks of <tt>1MB</tt> and GNU Parallel will create 3 jobs, thus using 3 cores. If we would have requested 10 cores in our task, we would have wasted 7 cores. Therefore, '''the block size is important'''. We can also let GNU Parallel decide, as done below.


<!--T:36-->
<!--T:36-->
See also [[GNU Parallel#Handling_large_files|handling large files]] with GNU Parallel.
See also [[GNU Parallel#Handling_large_files|Handling large files]] in the GNU Parallel page.


<!--T:34-->
<!--T:34-->
Line 135: Line 135:


<!--T:33-->
<!--T:33-->
Note: the file must not be compressed.
Note: The file must not be compressed.


==== Job submission ==== <!--T:31-->
==== Job submission ==== <!--T:31-->
With the above submission script, we can submit our blast search and it will run after the creation of the database:
With the above submission script, we can submit our search and it will run after the database is created.
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}}
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}}


=== Additional tips === <!--T:32-->
=== Additional tips === <!--T:32-->
* If it fits into the node-local storage, copy your FASTA database to the localscratch (<tt>$SLURM_TMPDIR</tt>).
* If it fits into the node's local storage, copy your FASTA database to the local scratch space (<tt>$SLURM_TMPDIR</tt>).
* Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
* Reduce the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
* Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research.
* Limit your hit list to near identical hits using <code>-evalue</code> filters, if it is reasonable for your research.


</translate>
</translate>
rsnt_translations
56,430

edits