Samtools

From Alliance Doc
Jump to navigation Jump to search
This site replaces the former Compute Canada documentation site, and is now being managed by the Digital Research Alliance of Canada.

Ce site remplace l'ancien site de documentation de Calcul Canada et est maintenant géré par l'Alliance de recherche numérique du Canada.


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.



Description

Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:

  • Samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
  • BCFtools: Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
  • HTSlib: A C library for reading/writing high-throughput sequencing data

Samtools and BCFtools both use HTSlib internally, but these source packages contain their own copies of htslib so they can be built independently.

Note: This page does not cover all features of SAMtools. Please refer to Samtools for the complete list of all subtools.

Availability and Loading Module

To query all available version of SAMtools in our clusters:

Question.png
[name@server ~]$ module spider samtools

You can load the version of your choice using module load. For example, to load samtools 1.12 use command

Question.png
[name@server ~]$ module load samtools/1.12

To load the default version of samtools use command

[name@server ~]$ module load samtools
[name@server ~]$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.12 (using htslib 1.12)

Usage:   samtools <command> [options]

General usage

SAMtools provides various (sub)tools for manipulating alignments in the SAM/BAM format. While working with high-throughput sequencing data, you will probably have to deal with SAM/BAM file formats. The most common task would be to convert your SAM (Sequence Alignment/Map) files to BAM (binary form of SAM) files. BAM files are compressed versions of SAM files and are much smaller in size. BAM files are also easy to manipulate and are ideal for storing large nucleotide sequence alignments. In recent years, an even more compressed CRAM version has been widely preferred for storage.

Converting a SAM file to a BAM file

Prior to converting, verify if your files carry a header section with character “@”. You can inspect the header section using the view command:

Question.png
[name@server ~]$ samtools view -H my_sample.sam

To convert a SAM file with a header section to a bam file:

Question.png
[name@server ~]$ samtools view -bo my_sample.bam my_sample.sam

Alternatively, samtools also supports the following format:

Question.png
[name@server ~]$ samtools view -b my_sample.sam -o my_sample.sam

If header files are absent you can use the reference fasta file used to map the reads

Question.png
[name@server ~]$ samtools view -bt ref_seq.fa -o my_sample.bam my_sample.sam

Sorting and indexing BAM files

You may also have to sort and index bam files for many downstream applications

[name@server ~]$ samtools sort my_sample.bam -o my_sample_sorted.bam
[name@server ~]$ samtools index my_sample_sorted.bam


You can also convert SAM files directly to a sorted BAM file using pipe {|} function

[name@server ~]$ samtools view -b my_sample.sam | samtools sort -o my_sample_sorted.bam

With sorted bam files and its respective index file with extension .bai you are all set for any downstream process such as variant calling, feature counts etc.

Processing multiple files with multithreading and/or GNU parallel

You likely have more than one file and a simple job submission script with a forward loop should automate the above command lines to process multiple files.


File : samtools.sh

#!/bin/bash
#SBATCH --account=def-prof_username              
#SBATCH --cpus-per-task 1
#SBATCH --mem-per-cpu=4G      
#SBATCH --time=0-3:00 
#SBATCH --output=%x-%j.out

module load samtools/1.12

for files in *.sam
do
time samtools view -b ${files} | samtools sort -o ${files%.*}_mt_sorted.bam
done


Samtools typically runs on a single core by default but it’s possible to use multithreading and GNU parallel to improve the overall efficiency of your pipeline.

You can multithread your task and improve CPU efficiency using -@ flag.


File : samtools_multithreading.sh

#!/bin/bash
#SBATCH –account=def-prof_username
#SBATCH --cpus-per-task 2
#SBATCH --mem-per-cpu=4G  	# memory; default unit is megabytes
#SBATCH --time=0-3:00       	# time (DD-HH:MM)
#SBATCH --output=%x-%j.out

module load samtools/1.12

for files in *.sam
do
time samtools view -@ 4 -b ${files} | samtools sort -o ${files%.*}_mt_sorted.bam
done


You can also implement GNU parallel to process multiple sam files concurrently. Please note that GNU parallel is available by default on Cedar, Graham, Narval and Beluga


File : samtools_gnuparallel.sh

#!/bin/bash
#SBATCH --account=cc-debug
#SBATCH --cpus-per-task 4
#SBATCH --mem-per-cpu=4G  	# memory; default unit is megabytes
#SBATCH --time=0-3:00       	# time (DD-HH:MM)
#SBATCH --output=%x-%j.out

module load samtools/1.12

find . -name "*.sam" | parallel -j 4 "time samtools view -bS {} | samtools sort -o {.}_mt_sorted.bam"


The above script will execute view and sort on four sam files concurrently. If you have more input files, modify the number of cores and -j. Please note that if you ignore the -j flag your job will run on all available cpu cores.