GATK: Difference between revisions

Marked this version for translation
(remove draft tag, mark for translation)
(Marked this version for translation)
Line 3: Line 3:
<translate>
<translate>


<!--T:1-->
The '''Genome Analysis Toolkit (GATK)''' is a set of bioinformatic tools for
The '''Genome Analysis Toolkit (GATK)''' is a set of bioinformatic tools for
analyzing high-throughput sequencing (HTS) and variant call format (VCF)
analyzing high-throughput sequencing (HTS) and variant call format (VCF)
Line 10: Line 11:
for genomics research.
for genomics research.


==Availability and loading module ==
==Availability and loading module == <!--T:2-->
In all Compute Canada clusters (Graham, Cedar, Beluga), we provide
In all Compute Canada clusters (Graham, Cedar, Beluga), we provide
different versions of GATK. To access the version information you can use
different versions of GATK. To access the version information you can use
the [https://docs.computecanada.ca/wiki/Utiliser_des_modules/en module command]:
the [https://docs.computecanada.ca/wiki/Utiliser_des_modules/en module command]:


<!--T:3-->
{{Commands
{{Commands
|module spider gatk
|module spider gatk
}}
}}


<!--T:4-->
which will give you some information about GATK and the versions:
which will give you some information about GATK and the versions:
<pre>
<pre>
Line 30: Line 33:
</pre>
</pre>


<!--T:5-->
More specific information of any given version can be access with:
More specific information of any given version can be access with:


<!--T:6-->
{{Commands
{{Commands
|module spider gatk/4.1.2.0
|module spider gatk/4.1.2.0
}}
}}


<!--T:7-->
As you can see, this module only has nixpkgs/16.09 module as prerequisite
As you can see, this module only has nixpkgs/16.09 module as prerequisite
so it can be loaded by:
so it can be loaded by:


<!--T:8-->
{{Commands
{{Commands
|module load nixpkgs/16.09 gatk/4.1.2.0
|module load nixpkgs/16.09 gatk/4.1.2.0
}}
}}


<!--T:9-->
Or, given that nixpkgs/16.09 is loaded by default, simply:
Or, given that nixpkgs/16.09 is loaded by default, simply:


<!--T:10-->
{{Commands
{{Commands
|module load gatk/4.1.2.0
|module load gatk/4.1.2.0
}}
}}


==General usage ==
==General usage == <!--T:11-->
The later versions of GATK (>=4.0.0.0) provide a wrapper over the java executables (.jar). Loading the GATK modules will automatically set most of the environmental variables you will need to successfully run GATK.
The later versions of GATK (>=4.0.0.0) provide a wrapper over the java executables (.jar). Loading the GATK modules will automatically set most of the environmental variables you will need to successfully run GATK.


<!--T:12-->
The module spider command also provides you with usage and examples of that wrapper:
The module spider command also provides you with usage and examples of that wrapper:
<pre>
<pre>
Line 64: Line 74:
</pre>
</pre>


<!--T:13-->
As you probably notice, there are some arguments to be passed directly to java through the '''--java-options''' such as the maximum heap memory (<code>-Xmx8G</code> in the example, reserving 8 Gb of memory for the virtual machine). We recommend that you '''always''' use <code>-DGATK_STACKTRACE_ON_USER_EXCEPTION=true</code> since it will give you more information in case the program fails. This information can help you or us (in case you needed support) to solve the issue.
As you probably notice, there are some arguments to be passed directly to java through the '''--java-options''' such as the maximum heap memory (<code>-Xmx8G</code> in the example, reserving 8 Gb of memory for the virtual machine). We recommend that you '''always''' use <code>-DGATK_STACKTRACE_ON_USER_EXCEPTION=true</code> since it will give you more information in case the program fails. This information can help you or us (in case you needed support) to solve the issue.
Note that all options passed to <code>--java-options</code> have to be within quotation marks.
Note that all options passed to <code>--java-options</code> have to be within quotation marks.


===Earlier versions than GATK 4 ===
===Earlier versions than GATK 4 === <!--T:14-->
Earlier versions of GATK do not have the '''gatk''' command. Instead, one has to call the jar file:
Earlier versions of GATK do not have the '''gatk''' command. Instead, one has to call the jar file:


<!--T:15-->
<pre>
<pre>
java -jar GenomeAnalysisTK.jar PROGRAM OPTIONS
java -jar GenomeAnalysisTK.jar PROGRAM OPTIONS
</pre>
</pre>


<!--T:16-->
However, GenomeAnalysisTK.jar must be in PATH. In Compute Canada systems, the environmental variables <code>$EBROOTPICARD</code> for Picard (included in GATK >= 4) and <code>$EBROOTGATK</code> for GATK contain the path to the jar file, so the appropriate way to call GATK <= 3 is:
However, GenomeAnalysisTK.jar must be in PATH. In Compute Canada systems, the environmental variables <code>$EBROOTPICARD</code> for Picard (included in GATK >= 4) and <code>$EBROOTGATK</code> for GATK contain the path to the jar file, so the appropriate way to call GATK <= 3 is:


<!--T:17-->
<pre>
<pre>
module load gatk/3.8
module load gatk/3.8
Line 81: Line 95:
</pre>
</pre>


<!--T:18-->
You can find the specific usage of GATK <= 3 in the [https://gatkforums.broadinstitute.org/gatk/categories/gatk-guide GATK3 guide].
You can find the specific usage of GATK <= 3 in the [https://gatkforums.broadinstitute.org/gatk/categories/gatk-guide GATK3 guide].


===Multicore usage ===
===Multicore usage === <!--T:19-->
Most  GATK (>=4) tools are not multicore by default. This means that you should request only one core when calling these kind of tools. Some tools use threads in some of the computations (e.g. <code>Mutect2</code> has the <code>--native-pair-hmm-threads</code>) and therefore you can require more cpus (most of them with up to 4 threads) for these computations. GATK4, however, does provides '''some''' [https://gatk.broadinstitute.org/hc/en-us/articles/360035890591-Spark SPARK commands]:
Most  GATK (>=4) tools are not multicore by default. This means that you should request only one core when calling these kind of tools. Some tools use threads in some of the computations (e.g. <code>Mutect2</code> has the <code>--native-pair-hmm-threads</code>) and therefore you can require more cpus (most of them with up to 4 threads) for these computations. GATK4, however, does provides '''some''' [https://gatk.broadinstitute.org/hc/en-us/articles/360035890591-Spark SPARK commands]:
<pre>
<pre>
Line 89: Line 104:
Tools that can use Spark generally have a note to that effect in their respective Tool Doc.
Tools that can use Spark generally have a note to that effect in their respective Tool Doc.


<!--T:20-->
- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions
- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions
The "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.
The "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.


<!--T:21-->
- Some GATK tools only exist in a Spark-capable version
- Some GATK tools only exist in a Spark-capable version
Those tools don't have the "Spark" suffix.
Those tools don't have the "Spark" suffix.
</pre>
</pre>


<!--T:22-->
For the commands that do use spark, you can request multiple cpus. '''NOTE:''' please provide the exact number of cpus to the spark command, so if you requested 10 cpus, use <code>--spark-master local[10]</code> instead of <code>--spark-master local[*]</code>. If you want to scale the spark calls with multinode SPARK cluster, you have to first [https://docs.computecanada.ca/wiki/Apache_Spark/en deploy an SPARK cluster] and then set up the appropriate variables in the GATK4 spark command.
For the commands that do use spark, you can request multiple cpus. '''NOTE:''' please provide the exact number of cpus to the spark command, so if you requested 10 cpus, use <code>--spark-master local[10]</code> instead of <code>--spark-master local[*]</code>. If you want to scale the spark calls with multinode SPARK cluster, you have to first [https://docs.computecanada.ca/wiki/Apache_Spark/en deploy an SPARK cluster] and then set up the appropriate variables in the GATK4 spark command.




==Frequently asked questions ==
==Frequently asked questions == <!--T:23-->
===How do I add a read group (RG) tag in my bam file? ===
===How do I add a read group (RG) tag in my bam file? ===
Assuming that the read group you want to add is called '''tag''' to the file input bam, you can use the GATK/PICARD command [https://gatk.broadinstitute.org/hc/en-us/articles/360037226472-AddOrReplaceReadGroups-Picard- AddOrReplaceReadGroups]:
Assuming that the read group you want to add is called '''tag''' to the file input bam, you can use the GATK/PICARD command [https://gatk.broadinstitute.org/hc/en-us/articles/360037226472-AddOrReplaceReadGroups-Picard- AddOrReplaceReadGroups]:
Line 115: Line 133:
This assumes that your input file is sorted by coordinates and will generate an index along with the annotated output (<code>--CREATE_INDEX true</code>)
This assumes that your input file is sorted by coordinates and will generate an index along with the annotated output (<code>--CREATE_INDEX true</code>)


===How do I deal with <code>java.lang.OutOfMemoryError: Java heap space</code> ===
===How do I deal with <code>java.lang.OutOfMemoryError: Java heap space</code> === <!--T:24-->
Oftentimes the subprograms of GATK require more memory to process your files. If you were not using the <code>-Xms</code> command, add it to the <code>--java-options</code>. For example, let's imagine that you run the following command:
Oftentimes the subprograms of GATK require more memory to process your files. If you were not using the <code>-Xms</code> command, add it to the <code>--java-options</code>. For example, let's imagine that you run the following command:
<pre>
<pre>
Line 124: Line 142:
</pre>
</pre>


<!--T:25-->
But it gives you the <code>java.lang.OutOfMemoryError: Java heap space</code> error. Try:
But it gives you the <code>java.lang.OutOfMemoryError: Java heap space</code> error. Try:


<!--T:26-->
<pre>
<pre>
gatk MarkDuplicates \
gatk MarkDuplicates \
Line 134: Line 154:
</pre>
</pre>


<!--T:27-->
If it fails again, keep increasing the memory until you find the required memory for your particular data set. If you are using any of our systems, '''remember to request enough memory for this'''.
If it fails again, keep increasing the memory until you find the required memory for your particular data set. If you are using any of our systems, '''remember to request enough memory for this'''.


<!--T:28-->
If you are interested in knowing more about java heap space, you can start [https://plumbr.io/outofmemoryerror/java-heap-space here].
If you are interested in knowing more about java heap space, you can start [https://plumbr.io/outofmemoryerror/java-heap-space here].


===Increasing the heap memory does not fix the <code>java.lang.OutOfMemoryError: Java heap space</code> ===
===Increasing the heap memory does not fix the <code>java.lang.OutOfMemoryError: Java heap space</code> === <!--T:29-->
There are cases in which the memory issue cannot be fixed with increasing the heap memory. This often happens with non-model organisms, and you are using too many scaffolds in your reference. In this case it is recommended to remove small scaffolds and create subsets of your reference. This implies that you have to map multiple times and run the pipelines in each of the subsets. '''This approach does not work in all pipelines''' so review your results carefully. GATK is designed with the human genome in mind, and therefore other organism will require adjustment in many parameters and pipelines.
There are cases in which the memory issue cannot be fixed with increasing the heap memory. This often happens with non-model organisms, and you are using too many scaffolds in your reference. In this case it is recommended to remove small scaffolds and create subsets of your reference. This implies that you have to map multiple times and run the pipelines in each of the subsets. '''This approach does not work in all pipelines''' so review your results carefully. GATK is designed with the human genome in mind, and therefore other organism will require adjustment in many parameters and pipelines.


===Using more resources than asked for ===
===Using more resources than asked for === <!--T:30-->
Sometimes GATK/JAVA applications will use more memory or CPUs/threads than the ones requested. This is often generated by the JAVA garbage collection. To add control for this, you can add <code>-XX:ConcGCThreads=1</code> to the <code>--java-options</code> argument.  
Sometimes GATK/JAVA applications will use more memory or CPUs/threads than the ones requested. This is often generated by the JAVA garbage collection. To add control for this, you can add <code>-XX:ConcGCThreads=1</code> to the <code>--java-options</code> argument.  


===FAQ on GATK ===
===FAQ on GATK === <!--T:31-->
You can find GATK's FAQ's in their [https://gatk.broadinstitute.org/hc/en-us/sections/360007226791-Troubleshooting-GATK4-Issues website].
You can find GATK's FAQ's in their [https://gatk.broadinstitute.org/hc/en-us/sections/360007226791-Troubleshooting-GATK4-Issues website].


=References =
=References = <!--T:32-->
[https://gatk.broadinstitute.org/hc/en-us GATK Home]
[https://gatk.broadinstitute.org/hc/en-us GATK Home]


<!--T:33-->
[https://gatk.broadinstitute.org/hc/en-us/articles/360035532012-Parallelism-Multithreading-Scatter-Gather GATK SPARK]
[https://gatk.broadinstitute.org/hc/en-us/articles/360035532012-Parallelism-Multithreading-Scatter-Gather GATK SPARK]


<!--T:34-->
[https://gatk.broadinstitute.org/hc/en-us/articles/360035889611-How-can-I-make-GATK-tools-run-faster- Make GATK run faster]
[https://gatk.broadinstitute.org/hc/en-us/articles/360035889611-How-can-I-make-GATK-tools-run-faster- Make GATK run faster]


<!--T:35-->
[[Category:Bioinformatics]]
[[Category:Bioinformatics]]
[[Category:Software]]
[[Category:Software]]


</translate>
</translate>
Bureaucrats, cc_docs_admin, cc_staff
2,879

edits