BLAST: Difference between revisions

Latest revision as of 22:02, 24 January 2024

Other languages:

English
français

BLAST ("Basic Local Alignment Search Tool") finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.

User manual

You can find more information on its arguments in the user manual or with

[name@server ~]$ blastn -help

Databases

Some frequently used sequence databases are installed on the clusters in /cvmfs/bio.data.computecanada.ca/content/databases/Core/blast_dbs/2022_03_23/. Examine that directory and its subdirectories, e.g. with

[name@server ~]$ ls /cvmfs/bio.data.computecanada.ca/content/databases/Core/blast_dbs/2022_03_23/

Accelerating the search

For the examples below, the file ref.fa will be used as the reference database in FASTA format, and seq.fa as the queries.

`makeblastdb`

Before running a search, we must build the database. This can be a preprocessing job, where the other jobs are dependent on the completion of the makeblastdb job. Here is an example of a submission script:

File : makeblastdb.sh

#!/bin/bash

#SBATCH --account=def-<user>  # The account to use
#SBATCH --time=00:02:00       # The duration in HH:MM:SS format
#SBATCH --cpus-per-task=1     # The number of cores
#SBATCH --mem=512M            # Total memory for this task

module load gcc/7.3.0 blast+/2.9.0

# Create the nucleotide database based on `ref.fa`.
makeblastdb -in ref.fa -title reference -dbtype nucl -out ref.fa

Task array

BLAST search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.

Preprocessing

In order to accelerate the search, the seq.fa file must be split into smaller chunks. These should be at least 1MB or greater, but not smaller as it may hurt the parallel filesystem.

Using the faSplit utility:

[name@server ~]$ module load kentutils/20180716

[name@server ~]$ faSplit sequence seqs.fa 10 seq

will create 10 files named seqN.fa where N is in the range of [0..9] for 10 queries (sequences).

Job submission

Once our queries are split, we can create a task for each seq.fa.N file using a job array. The task id from the array will map to the file name containing the query to run.

This solution allows the scheduler to fit the smaller jobs from the array where there are resources available in the cluster.

File : blastn_array.sh

#!/bin/bash

#SBATCH --account=def-<user>  # The account to use
#SBATCH --time=00:02:00       # The duration in HH:MM:SS format of each task in the array
#SBATCH --cpus-per-task=1     # The number of cores for each task in the array
#SBATCH --mem-per-cpu=512M    # The memory per core for each task in the array
#SBATCH --array=0-9           # The number of tasks: 10

module load gcc/7.3.0 blast+/2.9.0

# Using the index of the current task, given by `$SLURM_ARRAY_TASK_ID`, run the corresponding query and write the result
blastn -db ref.fa -query seq.fa.${SLURM_ARRAY_TASK_ID} > seq.ref.${SLURM_ARRAY_TASK_ID}

With the above submission script, we can submit our search and it will run after the database has been created.

[name@server ~]$ sbatch --dependency=afterok:$(sbatch makeblastdb.sh) blastn_array.sh

Once all the tasks from the array are done, the results can be concatenated using

[name@server ~]$ cat seq.ref.{0..9} > seq.ref

where the 10 files will be concatenated into seq.ref file. This could be done from the login node or as a dependent job upon completion of all the tasks from the array.

GNU Parallel

GNU Parallel is a great tool to pack many small jobs into a single job, and parallelize it. This solution helps alleviate the issue of too many small files in a parallel filesystem by querying fixed size chunks from seq.fa and running on one node and multiple cores.

As an example, if your seq.fa file is 3MB, you could read blocks of 1MB and GNU Parallel will create 3 jobs, thus using 3 cores. If we would have requested 10 cores in our task, we would have wasted 7 cores. Therefore, the block size is important. We can also let GNU Parallel decide, as done below.

See also Handling large files in the GNU Parallel page.

Running with multiple cores on one node

File : blastn_gnu.sh

#!/bin/bash

#SBATCH --account=def-<user>  # The account to use
#SBATCH --time=00:02:00       # The duration in HH:MM:SS format
#SBATCH --cpus-per-task=4     # The number of cores
#SBATCH --mem-per-cpu=512M    # The memory per core

module load gcc/7.3.0 blast+/2.9.0

cmd='blastn -db ref.fa -query - '

# Using the `::::` notation, give the sequences file to GNU parallel
# where
#   --jobs number of core to use, equal $SLURM_CPUS_PER_TASK (the number of cores requested)
#   --keep-order keep same order as given in input
#   --block -1 let GNU Parallel evaluate the block size and adapt
#   --recstart record start, here the sequence identifier `>`
#   --pipepart pipe parts of $cmd together. 
#              `--pipepart` is faster than `--pipe` (which is limited to 500MB/s) as `--pipepart` can easily go to 5GB/s according to Ole Tange.
# and redirect results in `seq.ref`.
parallel --jobs $SLURM_CPUS_PER_TASK --keep-order --block -1 --recstart '>' --pipepart $cmd :::: seq.fa > seq.ref

Note: The file must not be compressed.

Job submission

With the above submission script, we can submit our search and it will run after the database has been created.

[name@server ~]$ sbatch --dependency=afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh

Additional tips

If it fits into the node's local storage, copy your FASTA database to the local scratch space ($SLURM_TMPDIR).
Reduce the number of hits returned (-max_target_seqs, -max_hsps can help), if it is reasonable for your research.
Limit your hit list to nearly identical hits using -evalue filters, if it is reasonable for your research.

@@ Line 4: / Line 4: @@
 <translate>
-<!--T:1-->
+<!--T:7-->
 BLAST ("Basic Local Alignment Search Tool") finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
-<!--T:2-->
+== User manual == <!--T:8-->
-BLAST searches can be run over the Internet using the [https://blast.ncbi.nlm.nih.gov/Blast.cgi NCBI site], but you '''should not do this''' for production work on a Compute Canada cluster.  Instead load the BLAST+ [[Utiliser des modules/en|module]] and a search database on the cluster.
+You can find more information on its arguments in the [https://www.ncbi.nlm.nih.gov/books/NBK279684/ user manual]
+or with
+{{Command|blastn -help}}
-<!--T:3-->
+== Databases == <!--T:9-->
-Some frequently-used sequence databases are installed on Compute Canada clusters.  See [[Genomics data]].
+Some frequently used sequence databases are installed on the clusters in <code>/cvmfs/bio.data.computecanada.ca/content/databases/Core/blast_dbs/2022_03_23/</code>.
+Examine that directory and its subdirectories, e.g. with
-== Performance == <!--T:4-->
+<!--T:38-->
+{{Command|ls /cvmfs/bio.data.computecanada.ca/content/databases/Core/blast_dbs/2022_03_23/}}
-<!--T:5-->
-Here are some things to try in order to accelerate your BLAST search on a compute cluster:
-<!--T:6-->
-* Copy your FASTA database to node-local storage (<code>$SLURM_TMPDIR</code>) and run <code>makeblastdb</code> at beginning of your job script to generate your blast db on ramdisk on the node.
+== Accelerating the search == <!--T:10-->
-* Use multi-threading (option <code>-num_threads</code>).  Beware that this is not very efficient; test to determine a suitable number of threads.
+For the examples below, the file <tt>ref.fa</tt> will be used as the reference database in FASTA format, and <tt>seq.fa</tt> as the queries.
-* Lower the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
-* Limit your hit list using evalue filters to near identical hits (<code>-evalue</code>), if it is reasonable for your research.
+=== <tt>makeblastdb</tt> === <!--T:11-->
+Before running a search, we must build the database. This can be a preprocessing job, where the other jobs are dependent on the completion of the <tt>makeblastdb</tt> job.
+Here is an example of a submission script:
+{{File
+  |name=makeblastdb.sh
+  |lang="bash"
+  |contents=
+#!/bin/bash
+<!--T:12-->
+#SBATCH --account=def-<user>  # The account to use
+#SBATCH --time=00:02:00       # The duration in HH:MM:SS format
+#SBATCH --cpus-per-task=1     # The number of cores
+#SBATCH --mem=512M            # Total memory for this task
+<!--T:13-->
+module load gcc/7.3.0 blast+/2.9.0
+<!--T:14-->
+# Create the nucleotide database based on `ref.fa`.
+makeblastdb -in ref.fa -title reference -dbtype nucl -out ref.fa
+}}
+=== Task array === <!--T:15-->
+BLAST search can greatly benefit from data parallelism by splitting the query file into multiples queries and running these queries against the database.
+==== Preprocessing ==== <!--T:16-->
+In order to accelerate the search, the <tt>seq.fa</tt> file must be split into smaller chunks. These should be at least <tt>1MB</tt> or greater, but '''not smaller''' as it may hurt the parallel filesystem.
+<!--T:18-->
+Using the <tt>faSplit</tt> utility:
+{{Command|module load kentutils/20180716}}
+{{Command|faSplit sequence seqs.fa 10 seq}}
+will create 10 files named <tt>seqN.fa</tt> where <tt>N</tt> is in the range of <tt>[0..9]</tt> for 10 queries (sequences).
+==== Job submission ==== <!--T:19-->
+Once our queries are split, we can create a task for each <tt>seq.fa.N</tt> file using a job array. The task id from the array will map to the file name containing the query to run.
+<!--T:20-->
+This solution allows the scheduler to fit the smaller jobs from the array where there are resources available in the cluster.
+{{File
+  |name=blastn_array.sh
+  |lang="bash"
+  |contents=
+#!/bin/bash
+<!--T:21-->
+#SBATCH --account=def-<user>  # The account to use
+#SBATCH --time=00:02:00       # The duration in HH:MM:SS format of each task in the array
+#SBATCH --cpus-per-task=1     # The number of cores for each task in the array
+#SBATCH --mem-per-cpu=512M    # The memory per core for each task in the array
+#SBATCH --array=0-9           # The number of tasks: 10
+<!--T:22-->
+module load gcc/7.3.0 blast+/2.9.0
+<!--T:23-->
+# Using the index of the current task, given by `$SLURM_ARRAY_TASK_ID`, run the corresponding query and write the result
+blastn -db ref.fa -query seq.fa.${SLURM_ARRAY_TASK_ID} > seq.ref.${SLURM_ARRAY_TASK_ID}
+}}
+<!--T:24-->
+With the above submission script, we can submit our search and it will run after the database has been created.
+{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_array.sh}}
+<!--T:25-->
+Once all the tasks from the array are done, the results can be concatenated using
+{{Command|cat seq.ref.{0..9} > seq.ref}}
+where the 10 files will be concatenated into <tt>seq.ref</tt> file.
+This could be done from the login node or as a dependent job upon completion of all the tasks from the array.
+=== GNU Parallel === <!--T:26-->
+<tt>GNU Parallel</tt> is a great tool to pack many small jobs into a single job, and parallelize it.
+This solution helps alleviate the issue of too many small files in a parallel filesystem by querying fixed size chunks from <tt>seq.fa</tt> and running on one node and multiple cores.
+<!--T:27-->
+As an example, if your <tt>seq.fa</tt> file is <tt>3MB</tt>, you could read blocks of <tt>1MB</tt> and GNU Parallel will create 3 jobs, thus using 3 cores. If we would have requested 10 cores in our task, we would have wasted 7 cores. Therefore, '''the block size is important'''. We can also let GNU Parallel decide, as done below.
+<!--T:36-->
+See also [[GNU Parallel#Handling_large_files|Handling large files]] in the GNU Parallel page.
+==== Running with multiple cores on one node==== <!--T:34-->
+{{File
+  |name=blastn_gnu.sh
+  |lang="bash"
+  |contents=
+#!/bin/bash
+<!--T:28-->
+#SBATCH --account=def-<user>  # The account to use
+#SBATCH --time=00:02:00       # The duration in HH:MM:SS format
+#SBATCH --cpus-per-task=4     # The number of cores
+#SBATCH --mem-per-cpu=512M    # The memory per core
+<!--T:29-->
+module load gcc/7.3.0 blast+/2.9.0
+<!--T:35-->
+cmd='blastn -db ref.fa -query - '
+<!--T:30-->
+# Using the `::::` notation, give the sequences file to GNU parallel
+# where
+#   --jobs number of core to use, equal $SLURM_CPUS_PER_TASK (the number of cores requested)
+#   --keep-order keep same order as given in input
+#   --block -1 let GNU Parallel evaluate the block size and adapt
+#   --recstart record start, here the sequence identifier `>`
+#   --pipepart pipe parts of $cmd together.
+#              `--pipepart` is faster than `--pipe` (which is limited to 500MB/s) as `--pipepart` can easily go to 5GB/s according to Ole Tange.
+# and redirect results in `seq.ref`.
+parallel --jobs $SLURM_CPUS_PER_TASK --keep-order --block -1 --recstart '>' --pipepart $cmd :::: seq.fa > seq.ref
+}}
+<!--T:33-->
+Note: The file must not be compressed.
+===== Job submission ===== <!--T:31-->
+With the above submission script, we can submit our search and it will run after the database has been created.
+{{Command|sbatch --dependency{{=}}afterok:$(sbatch makeblastdb.sh) blastn_gnu.sh}}
+=== Additional tips === <!--T:32-->
+* If it fits into the node's local storage, copy your FASTA database to the local scratch space (<tt>$SLURM_TMPDIR</tt>).
+* Reduce the number of hits returned (<code>-max_target_seqs, -max_hsps</code> can help), if it is reasonable for your research.
+* Limit your hit list to nearly identical hits using <code>-evalue</code> filters, if it is reasonable for your research.
 </translate>

BLAST: Difference between revisions

Latest revision as of 22:02, 24 January 2024

Contents

User manual

Databases

Accelerating the search

`makeblastdb`

Task array

Preprocessing

Job submission

GNU Parallel

Running with multiple cores on one node

Job submission

Additional tips

Navigation menu