BLAST: Difference between revisions

BLAST (view source)

Revision as of 13:58, 5 August 2019

17 bytes removed , 5 years ago

no edit summary

Diane27

rsnt_translations

56,430

edits

@@ Line 8: / Line 8: @@
 == User manual == <!--T:8-->
-You can find more information its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual].
+You can find more information on its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual]
-Or with :
+or with
 {{Command|blastn -help}}
 == Databases == <!--T:9-->
-Some frequently-used sequence databases are installed on Compute Canada clusters.
+Some frequently used sequence databases are installed on Compute Canada clusters.
-You can find information on the BLAST databases available here : [[Genomics data]].
+You can find information on the BLAST databases available in [[Genomics data]].
 == Accelerating the search == <!--T:10-->
-For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format and <tt>seq.fa</tt> as the queries we are looking for.
+For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format, and <tt>seq.fa</tt> as the queries.
 === <tt>makeblastdb</tt> === <!--T:11-->
-Before running a search, we must build the database. Building the database can be a preprocessing job, where the other job are dependent on the completion of the <tt>makeblastdb</tt> job.
+Before running a search, we must build the database. This can be a preprocessing job, where the other jobs are dependent on the completion of the <tt>makeblastdb</tt> job.
 Here is an example of a submission script:
 {{File
@@ Line 43: / Line 43: @@
 === Task array === <!--T:15-->
-Blast search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.
+BLAST search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.
-==== Preprocess ==== <!--T:16-->
+==== Preprocessing ==== <!--T:16-->
-In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. The file chunks should be at least <tt>1MB</tt> or greater, but not '''smaller''' as it may hurt the parallel file system.
+In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. These should be at least <tt>1MB</tt> or greater, but not '''smaller''' as it may hurt the parallel filesystem.
 <!--T:17-->
-'''Important''': To correctly split a FASTA format file, it must be in its original format and not multiline format. In other words, the sequence must be on one line.
+'''Important''': To correctly split a FASTA format file, it must be in its original format and not in multiline format. In other words, the sequence must be on a single line.
 <!--T:18-->
@@ Line 60: / Line 60: @@
 <!--T:20-->
-This solution allows the scheduler to fit the smaller jobs from the array where there's available resources in the cluster.
+This solution allows the scheduler to fit the smaller jobs from the array where there are resources available in the cluster.
 {{File
    |name=blastn_array.sh
@@ Line 83: / Line 83: @@
 <!--T:24-->
-With the above submission script, we can submit our blast search and it will run after the creation of the database:
+With the above submission script, we can submit our search and it will run after the creation of the database:
 {{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}}
 <!--T:25-->
-Once all the tasks from the array are done, the results can be concatenated using:
+Once all the tasks from the array are done, the results can be concatenated using
 {{Command|cat seq.ref.{0..9} > seq.ref}}
-where the 10 files will be catenated into <tt>seq.ref</tt> file.
+where the 10 files will be concatenated into <tt>seq.ref</tt> file.
 This could be done from the login node or as a dependent job upon completion of all the tasks from the array.
 === GNU Parallel === <!--T:26-->
-<tt>GNU parallel</tt> is a great tool to pack many small jobs into one and parallelize it.
+<tt>GNU parallel</tt> is a great tool to pack many small jobs into a single job, which it can then parallelize.
-This solution helps alleviate the issue of too many small files on a parallel file system by querying fixed size chunks from <tt>seq.fa</tt> and run on one node and multiple cores.
+This solution helps alleviate the issue of too many small files in a parallel filesystem by querying fixed size chunks from <tt>seq.fa</tt> and running on one node and multiple cores.
 <!--T:27-->
-As an example, if your <tt>seq.fa</tt> file is <tt>3MB</tt>, you could read block of <tt>1MB</tt> and GNU Parallel will create 3 jobs, thus using 3 cores. If we would have requested 10 cores in our task, we would be wasting 7 cores. Therefore, '''the block size is important'''. We can also let GNU Parallel decide, as done below.
+As an example, if your <tt>seq.fa</tt> file is <tt>3MB</tt>, you could read blocks of <tt>1MB</tt> and GNU Parallel will create 3 jobs, thus using 3 cores. If we would have requested 10 cores in our task, we would have wasted 7 cores. Therefore, '''the block size is important'''. We can also let GNU Parallel decide, as done below.
 <!--T:36-->
-See also [[GNU Parallel#Handling_large_files|handling large files]] with GNU Parallel.
+See also [[GNU Parallel#Handling_large_files|Handling large files]] in the GNU Parallel page.
 <!--T:34-->
@@ Line 135: / Line 135: @@
 <!--T:33-->
-Note: the file must not be compressed.
+Note: The file must not be compressed.
 ==== Job submission ==== <!--T:31-->
-With the above submission script, we can submit our blast search and it will run after the creation of the database:
+With the above submission script, we can submit our search and it will run after the database is created.
 {{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}}
 === Additional tips === <!--T:32-->
-* If it fits into the node-local storage, copy your FASTA database to the localscratch (<tt>$SLURM_TMPDIR</tt>).
+* If it fits into the node's local storage, copy your FASTA database to the local scratch space (<tt>$SLURM_TMPDIR</tt>).
-* Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
+* Reduce the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
-* Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research.
+* Limit your hit list to near identical hits using <code>-evalue</code> filters, if it is reasonable for your research.
 </translate>