BLAST: Difference between revisions

BLAST (view source)

Revision as of 13:17, 31 May 2019

4,170 bytes added , 5 years ago

Complete rewrite. Show how to accelerate search on a cluster.

Coulombc

cc_staff

284

edits

@@ Line 4: / Line 4: @@
 <translate>
-<!--T:1-->
 BLAST ("Basic Local Alignment Search Tool") finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
-<!--T:2-->
+== User manual ==
-BLAST searches can be run over the Internet using the [https://blast.ncbi.nlm.nih.gov/Blast.cgi NCBI site], but you '''should not do this''' for production work on a Compute Canada cluster.  Instead load the BLAST+ [[Utiliser des modules/en|module]] and a search database on the cluster.
+You can find more information its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual].
+Or with :
+{{Command|blastn -help}}
-<!--T:3-->
+== Databases ==
-Some frequently-used sequence databases are installed on Compute Canada clusters.  See [[Genomics data]].
+Some frequently-used sequence databases are installed on Compute Canada clusters.
+You can find information on the BLAST databases available here : [[Genomics data]].
-== Performance == <!--T:4-->
+== Accelerating the search ==
+For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format and <tt>seq.fa</tt> as the queries we are looking for.
-<!--T:5-->
+=== <tt>makeblastdb</tt> ===
-Here are some things to try in order to accelerate your BLAST search on a compute cluster:
+Before running a search, we must build the database. Building the database can be a preprocessing job, where the other job are dependent on the completion of the <tt>makeblastdb</tt> job.
+Here is an example of a submission script:
+{{File
+  |name=makeblastdb.sh
+  |lang="bash"
+  |contents=
+#!/bin/bash
-<!--T:6-->
+#SBATCH --account=def-<user>  # The account to use
-* Copy your FASTA database to node-local storage (<code>$SLURM_TMPDIR</code>) and run <code>makeblastdb</code> at beginning of your job script to generate your blast db on ramdisk on the node.
+#SBATCH --time=00:02:00       # The duration in HH:MM:SS format
-* Use multi-threading (option <code>-num_threads</code>).  Beware that this is not very efficient; test to determine a suitable number of threads.
+#SBATCH --cpus-per-task=1     # The number of cores
+#SBATCH --mem=512M            # Total memory for this task
+module load gcc/7.3.0 blast+/2.7.1
+# Create the nucleotide database based on `ref.fa`.
+makeblastdb -in ref.fa -title reference -dbtype nucl -out ref.fa
+}}
+=== Task array ===
+Blast search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.
+==== Preprocess ====
+In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. The file chunks should be at least <tt>1Mb</tt> or greater, but not '''smaller''' as it may hurt the parallel file system.
+'''Important''': To correctly split a FASTA format file, it must be in its original format and not multiline format. In other words, the sequence must be on one line.
+Using the <tt>split</tt> utility:
+{{Command|split -d -a 1 -l 2 seq.fa seq.fa.}}
+will create 10 files named <tt>seq.fa.N</tt> where <tt>N</tt> is in the range of <tt>[0..9]</tt> for 10 queries (sequences).
+==== Job submission ====
+Once our queries are split, we can create a task for each <tt>seq.fa.N</tt> file using a job array. The task id from the array will map to the file name containing the query to run.
+This solution allows the scheduler to fit the smaller jobs from the array where there's available resources in the cluster.
+{{File
+  |name=blastn_array.sh
+  |lang="bash"
+  |contents=
+#!/bin/bash
+#SBATCH --account=def-<user>  # The account to use
+#SBATCH --time=00:02:00       # The duration in HH:MM:SS format of each task in the array
+#SBATCH --cpus-per-task=1     # The number of cores for each task in the array
+#SBATCH --mem-per-cpu=512M    # The memory per core for each task in the array
+#SBATCH --array=0-9           # The number of tasks: 10
+module load gcc/7.3.0 blast+/2.7.1
+# Using the index of the current task, given by `$SLURM_ARRAY_TASK_ID`, run the corresponding query and write the result
+blastn -db ref.fa -query seq.fa.${SLURM_ARRAY_TASK_ID} > seq.ref.${SLURM_ARRAY_TASK_ID}
+}}
+With the above submission script, we can submit our blast search and it will run after the creation of the database:
+{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}}
+Once all the tasks from the array are done, the results can be concatenated using:
+{{Command|cat seq.ref.{0..9} > seq.ref}}
+where the 10 files will be catenated into <tt>seq.ref</tt> file.
+This could be done from the login node or as a dependent job upon completion of all the tasks from the array.
+=== GNU Parallel ===
+<tt>GNU parallel</tt> is a great tool to pack many small jobs into one and parallelize it.
+This solution helps alleviate the issue of too many small files on a parallel file system by querying fixed size chunks from <tt>seq.fa</tt> and run on one node and multiple cores.
+As an example, if your <tt>seq.fa</tt> file is <tt>100Mb</tt>, you could read block of <tt>10Mb</tt> and use 10 cores.
+{{File
+  |name=blastn_gnu.sh
+  |lang="bash"
+  |contents=
+#!/bin/bash
+#SBATCH --account=def-<user>  # The account to use
+#SBATCH --time=00:02:00       # The duration in HH:MM:SS format
+#SBATCH --cpus-per-task=4     # The number of cores
+#SBATCH --mem-per-cpu=512M    # The memory per core
+module load gcc/7.3.0 blast+/2.7.1
+# Pass the whole file to GNU parallel
+# where
+#   --jobs number of core to use, equal $SLURM_CPUS_PER_TASK (the number of cores requested)
+#   -k keep same order
+#   --block use block of 1Mb in size
+#   --recstart record start, here the sequence identifier `>`
+#   --pipe pass to the command
+cat seq.fa {{!}} parallel --jobs $SLURM_CPUS_PER_TASK -k --block 1M --recstart '>' --pipe 'blastn -db ref.fa -query - ' > seq.ref
+}}
+==== Job submission ====
+With the above submission script, we can submit our blast search and it will run after the creation of the database:
+{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}}
+=== Additional tips ===
+* If it fits into the node-local storage, copy your FASTA database to the localscratch (<tt>$SLURM_TMPDIR</tt>).
 * Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
 * Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research.
 </translate>

BLAST: Difference between revisions

BLAST (view source)

Revision as of 13:17, 31 May 2019

Navigation menu

Search