rsnt_translations
56,430
edits
m (Fixed Mb unit for MB) |
No edit summary |
||
Line 8: | Line 8: | ||
== User manual == <!--T:8--> | == User manual == <!--T:8--> | ||
You can find more information its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual] | You can find more information on its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual] | ||
or with | |||
{{Command|blastn -help}} | {{Command|blastn -help}} | ||
== Databases == <!--T:9--> | == Databases == <!--T:9--> | ||
Some frequently | Some frequently used sequence databases are installed on Compute Canada clusters. | ||
You can find information on the BLAST databases available | You can find information on the BLAST databases available in [[Genomics data]]. | ||
== Accelerating the search == <!--T:10--> | == Accelerating the search == <!--T:10--> | ||
For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format and <tt>seq.fa</tt> as the queries | For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format, and <tt>seq.fa</tt> as the queries. | ||
=== <tt>makeblastdb</tt> === <!--T:11--> | === <tt>makeblastdb</tt> === <!--T:11--> | ||
Before running a search, we must build the database. | Before running a search, we must build the database. This can be a preprocessing job, where the other jobs are dependent on the completion of the <tt>makeblastdb</tt> job. | ||
Here is an example of a submission script: | Here is an example of a submission script: | ||
{{File | {{File | ||
Line 43: | Line 43: | ||
=== Task array === <!--T:15--> | === Task array === <!--T:15--> | ||
BLAST search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database. | |||
==== | ==== Preprocessing ==== <!--T:16--> | ||
In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. | In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. These should be at least <tt>1MB</tt> or greater, but not '''smaller''' as it may hurt the parallel filesystem. | ||
<!--T:17--> | <!--T:17--> | ||
'''Important''': To correctly split a FASTA format file, it must be in its original format and not multiline format. In other words, the sequence must be on | '''Important''': To correctly split a FASTA format file, it must be in its original format and not in multiline format. In other words, the sequence must be on a single line. | ||
<!--T:18--> | <!--T:18--> | ||
Line 60: | Line 60: | ||
<!--T:20--> | <!--T:20--> | ||
This solution allows the scheduler to fit the smaller jobs from the array where there | This solution allows the scheduler to fit the smaller jobs from the array where there are resources available in the cluster. | ||
{{File | {{File | ||
|name=blastn_array.sh | |name=blastn_array.sh | ||
Line 83: | Line 83: | ||
<!--T:24--> | <!--T:24--> | ||
With the above submission script, we can submit our | With the above submission script, we can submit our search and it will run after the creation of the database: | ||
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}} | {{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}} | ||
<!--T:25--> | <!--T:25--> | ||
Once all the tasks from the array are done, the results can be concatenated using | Once all the tasks from the array are done, the results can be concatenated using | ||
{{Command|cat seq.ref.{0..9} > seq.ref}} | {{Command|cat seq.ref.{0..9} > seq.ref}} | ||
where the 10 files will be | where the 10 files will be concatenated into <tt>seq.ref</tt> file. | ||
This could be done from the login node or as a dependent job upon completion of all the tasks from the array. | This could be done from the login node or as a dependent job upon completion of all the tasks from the array. | ||
=== GNU Parallel === <!--T:26--> | === GNU Parallel === <!--T:26--> | ||
<tt>GNU parallel</tt> is a great tool to pack many small jobs into | <tt>GNU parallel</tt> is a great tool to pack many small jobs into a single job, which it can then parallelize. | ||
This solution helps alleviate the issue of too many small files | This solution helps alleviate the issue of too many small files in a parallel filesystem by querying fixed size chunks from <tt>seq.fa</tt> and running on one node and multiple cores. | ||
<!--T:27--> | <!--T:27--> | ||
As an example, if your <tt>seq.fa</tt> file is <tt>3MB</tt>, you could read | As an example, if your <tt>seq.fa</tt> file is <tt>3MB</tt>, you could read blocks of <tt>1MB</tt> and GNU Parallel will create 3 jobs, thus using 3 cores. If we would have requested 10 cores in our task, we would have wasted 7 cores. Therefore, '''the block size is important'''. We can also let GNU Parallel decide, as done below. | ||
<!--T:36--> | <!--T:36--> | ||
See also [[GNU Parallel#Handling_large_files| | See also [[GNU Parallel#Handling_large_files|Handling large files]] in the GNU Parallel page. | ||
<!--T:34--> | <!--T:34--> | ||
Line 135: | Line 135: | ||
<!--T:33--> | <!--T:33--> | ||
Note: | Note: The file must not be compressed. | ||
==== Job submission ==== <!--T:31--> | ==== Job submission ==== <!--T:31--> | ||
With the above submission script, we can submit our | With the above submission script, we can submit our search and it will run after the database is created. | ||
{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}} | {{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}} | ||
=== Additional tips === <!--T:32--> | === Additional tips === <!--T:32--> | ||
* If it fits into the node | * If it fits into the node's local storage, copy your FASTA database to the local scratch space (<tt>$SLURM_TMPDIR</tt>). | ||
* | * Reduce the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research. | ||
* Limit your hit list | * Limit your hit list to near identical hits using <code>-evalue</code> filters, if it is reasonable for your research. | ||
</translate> | </translate> |