Samtools: Difference between revisions
No edit summary |
|||
(4 intermediate revisions by the same user not shown) | |||
Line 31: | Line 31: | ||
Usage: samtools <command> [options]}} | Usage: samtools <command> [options]}} | ||
== General usage == <!--T:3--> | |||
SAMtools provides various (sub)tools for manipulating alignments in the SAM/BAM format. While working with high-throughput sequencing data, you will probably have to deal with SAM/BAM file formats. The most common task would be to convert your SAM (Sequence Alignment/Map) files to BAM (binary form of SAM) files. BAM files are compressed versions of SAM files and are much smaller in size. BAM files are also easy to manipulate and are ideal for storing large nucleotide sequence alignments. In recent years, an even more compressed CRAM version has been widely preferred for storage. | |||
=== Converting a SAM file to a BAM file === | |||
Prior to converting, verify if your files carry a header section with character “@”. You can inspect the header section using the view command: | |||
{{Command|samtools view -H my_sample.sam}} | |||
To convert a SAM file with a header section to a bam file: | |||
{{Command|samtools view -bo my_sample.bam my_sample.sam}} | |||
Alternatively, samtools also supports the following format: | |||
{{Command|samtools view -b my_sample.sam -o my_sample.sam}} | |||
If header files are absent you can use the reference fasta file used to map the reads | |||
{{Command|samtools view -bt ref_seq.fa -o my_sample.bam my_sample.sam}} | |||
=== Sorting and indexing BAM files === | |||
You may also have to sort and index bam files for many downstream applications | |||
{{Commands | |||
|samtools sort my_sample.bam -o my_sample_sorted.bam | |||
|samtools index my_sample_sorted.bam}} | |||
You can also convert SAM files directly to a sorted BAM file using pipe {|} function | |||
<source> | |||
[name@server ~]$ samtools view -b my_sample.sam | samtools sort -o my_sample_sorted.bam | |||
</source> | |||
With sorted bam files and its respective index file with extension .bai you are all set for any downstream process such as variant calling, feature counts etc. | |||
=== Processing multiple files with multithreading and/or GNU parallel === | |||
You likely have more than one file and a simple job submission script with a forward loop should automate the above command lines to process multiple files. | |||
{{File | |||
|name=samtools.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-prof_username | |||
#SBATCH --cpus-per-task 1 | |||
#SBATCH --mem-per-cpu=4G | |||
#SBATCH --time=0-3:00 | |||
#SBATCH --output=%x-%j.out | |||
module load samtools/1.12 | |||
for files in *.sam | |||
do | |||
time samtools view -b ${files} {{!}} samtools sort -o ${files%.*}_mt_sorted.bam | |||
done | |||
}} | |||
Samtools typically runs on a single core by default but it’s possible to use multithreading and GNU parallel to improve the overall efficiency of your pipeline. | |||
You can multithread your task and improve CPU efficiency using -@ flag. | |||
{{File | |||
|name=samtools_multithreading.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH –account=def-prof_username | |||
#SBATCH --cpus-per-task 2 | |||
#SBATCH --mem-per-cpu=4G # memory; default unit is megabytes | |||
#SBATCH --time=0-3:00 # time (DD-HH:MM) | |||
#SBATCH --output=%x-%j.out | |||
module load samtools/1.12 | |||
for files in *.sam | |||
do | |||
time samtools view -@ 4 -b ${files} {{!}} samtools sort -o ${files%.*}_mt_sorted.bam | |||
done | |||
}} | |||
You can also implement GNU parallel to process multiple sam files concurrently. Please note that GNU parallel is available by default on Cedar, Graham, Narval and Beluga | |||
{{File | |||
|name=samtools_gnuparallel.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=cc-debug | |||
#SBATCH --cpus-per-task 4 | |||
#SBATCH --mem-per-cpu=4G # memory; default unit is megabytes | |||
#SBATCH --time=0-3:00 # time (DD-HH:MM) | |||
#SBATCH --output=%x-%j.out | |||
module load samtools/1.12 | |||
find . -name "*.sam" {{!}} parallel -j 4 "time samtools view -bS {} {{!}} samtools sort -o {.}_mt_sorted.bam" | |||
}} | |||
The above script will execute <tt>view</tt> and <tt>sort</tt> on four sam files concurrently. If you have more input files, modify the number of cores and <tt>-j</tt>. Please note that if you ignore the -j flag your job will run on all available cpu cores. |
Latest revision as of 18:47, 15 November 2022
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
Description
Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:
- Samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
- BCFtools: Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
- HTSlib: A C library for reading/writing high-throughput sequencing data
Samtools and BCFtools both use HTSlib internally, but these source packages contain their own copies of htslib so they can be built independently.
Note: This page does not cover all features of SAMtools. Please refer to Samtools for the complete list of all subtools.
Availability and Loading Module
To query all available version of SAMtools in our clusters:
[name@server ~]$ module spider samtools
You can load the version of your choice using module load. For example, to load samtools 1.12 use command
[name@server ~]$ module load samtools/1.12
To load the default version of samtools use command
[name@server ~]$ module load samtools
[name@server ~]$ samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.12 (using htslib 1.12)
Usage: samtools <command> [options]
General usage
SAMtools provides various (sub)tools for manipulating alignments in the SAM/BAM format. While working with high-throughput sequencing data, you will probably have to deal with SAM/BAM file formats. The most common task would be to convert your SAM (Sequence Alignment/Map) files to BAM (binary form of SAM) files. BAM files are compressed versions of SAM files and are much smaller in size. BAM files are also easy to manipulate and are ideal for storing large nucleotide sequence alignments. In recent years, an even more compressed CRAM version has been widely preferred for storage.
Converting a SAM file to a BAM file
Prior to converting, verify if your files carry a header section with character “@”. You can inspect the header section using the view command:
[name@server ~]$ samtools view -H my_sample.sam
To convert a SAM file with a header section to a bam file:
[name@server ~]$ samtools view -bo my_sample.bam my_sample.sam
Alternatively, samtools also supports the following format:
[name@server ~]$ samtools view -b my_sample.sam -o my_sample.sam
If header files are absent you can use the reference fasta file used to map the reads
[name@server ~]$ samtools view -bt ref_seq.fa -o my_sample.bam my_sample.sam
Sorting and indexing BAM files
You may also have to sort and index bam files for many downstream applications
[name@server ~]$ samtools sort my_sample.bam -o my_sample_sorted.bam
[name@server ~]$ samtools index my_sample_sorted.bam
You can also convert SAM files directly to a sorted BAM file using pipe {|} function
[name@server ~]$ samtools view -b my_sample.sam | samtools sort -o my_sample_sorted.bam
With sorted bam files and its respective index file with extension .bai you are all set for any downstream process such as variant calling, feature counts etc.
Processing multiple files with multithreading and/or GNU parallel
You likely have more than one file and a simple job submission script with a forward loop should automate the above command lines to process multiple files.
#!/bin/bash
#SBATCH --account=def-prof_username
#SBATCH --cpus-per-task 1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=0-3:00
#SBATCH --output=%x-%j.out
module load samtools/1.12
for files in *.sam
do
time samtools view -b ${files} | samtools sort -o ${files%.*}_mt_sorted.bam
done
Samtools typically runs on a single core by default but it’s possible to use multithreading and GNU parallel to improve the overall efficiency of your pipeline.
You can multithread your task and improve CPU efficiency using -@ flag.
#!/bin/bash
#SBATCH –account=def-prof_username
#SBATCH --cpus-per-task 2
#SBATCH --mem-per-cpu=4G # memory; default unit is megabytes
#SBATCH --time=0-3:00 # time (DD-HH:MM)
#SBATCH --output=%x-%j.out
module load samtools/1.12
for files in *.sam
do
time samtools view -@ 4 -b ${files} | samtools sort -o ${files%.*}_mt_sorted.bam
done
You can also implement GNU parallel to process multiple sam files concurrently. Please note that GNU parallel is available by default on Cedar, Graham, Narval and Beluga
#!/bin/bash
#SBATCH --account=cc-debug
#SBATCH --cpus-per-task 4
#SBATCH --mem-per-cpu=4G # memory; default unit is megabytes
#SBATCH --time=0-3:00 # time (DD-HH:MM)
#SBATCH --output=%x-%j.out
module load samtools/1.12
find . -name "*.sam" | parallel -j 4 "time samtools view -bS {} | samtools sort -o {.}_mt_sorted.bam"
The above script will execute view and sort on four sam files concurrently. If you have more input files, modify the number of cores and -j. Please note that if you ignore the -j flag your job will run on all available cpu cores.