Revision as of 18:30, 17 July 2023

Other languages:

English
français

The Genome Analysis Toolkit (GATK) is a set of bioinformatic tools for analyzing high-throughput sequencing (HTS) and variant call format (VCF) data. The toolkit is well established for germline short variant discovery from whole genome and exome sequencing data. It is a leading tool in variant discovery and best practices for genomics research.

Availability and loading module

In all Compute Canada clusters (Graham, Cedar, Beluga), we provide different versions of GATK. To access the version information you can use the module command:

[name@server ~]$ module spider gatk

which will give you some information about GATK and the versions:

gatk/3.7
gatk/3.8
gatk/4.0.0.0
gatk/4.0.8.1
gatk/4.0.12.0
gatk/4.1.0.0
gatk/4.1.2.0

More specific information of any given version can be access with:

[name@server ~]$ module spider gatk/4.1.2.0

As you can see, this module only has nixpkgs/16.09 module as prerequisite so it can be loaded by:

[name@server ~]$ module load nixpkgs/16.09 gatk/4.1.2.0

Or, given that nixpkgs/16.09 is loaded by default, simply:

[name@server ~]$ module load gatk/4.1.2.0

General usage

The later versions of GATK (>=4.0.0.0) provide a wrapper over the java executables (.jar). Loading the GATK modules will automatically set most of the environmental variables you will need to successfully run GATK.

The module spider command also provides you with usage and examples of that wrapper:

      Usage
      =====
      gatk [--java-options "-Xmx4G"] ToolName [GATK args]
      
      
      Examples
      ========
      gatk --java-options "-Xmx8G" HaplotypeCaller -R reference.fasta -I input.bam -O output.vcf

As you probably notice, there are some arguments to be passed directly to java through the --java-options such as the maximum heap memory (-Xmx8G in the example, reserving 8 Gb of memory for the virtual machine). We recommend that you always use -DGATK_STACKTRACE_ON_USER_EXCEPTION=true since it will give you more information in case the program fails. This information can help you or us (in case you needed support) to solve the issue. Note that all options passed to --java-options have to be within quotation marks.

Considerations in our systems

To use GATK in our systems we recommend you use the --tmp-dir option and set it to ${SLURM_TMPDIR} when in a sbatch job so that the temporary files are redirected to the local storage.

Also, when using GenomicsDBImport make sure to have the option --genomicsdb-shared-posixfs-optimizations enabled as it "Allow[s] for optimizations to improve the usability and performance for shared Posix Filesystems(e.g. NFS, Lustre)". If not possible or if you are using GNU parallel to run multiple intervals at the same time, please copy your database to ${SLURM_TMPDIR} and run it from there as your IO operations might disrupt the function of the Filesystem. ${SLURM_TMPDIR} is a local storage and therefore is not only faster, but the IO operations would not affect other users.

Earlier versions than GATK 4

Earlier versions of GATK do not have the gatk command. Instead, one has to call the jar file:

java -jar GenomeAnalysisTK.jar PROGRAM OPTIONS

However, GenomeAnalysisTK.jar must be in PATH. In Compute Canada systems, the environmental variables $EBROOTPICARD for Picard (included in GATK >= 4) and $EBROOTGATK for GATK contain the path to the jar file, so the appropriate way to call GATK <= 3 is:

module load gatk/3.8
java -jar "${EBROOTGATK}"/GenomeAnalysisTK.jar PROGRAM OPTIONS

You can find the specific usage of GATK <= 3 in the GATK3 guide.

Multicore usage

Most GATK (>=4) tools are not multicore by default. This means that you should request only one core when calling these kind of tools. Some tools use threads in some of the computations (e.g. Mutect2 has the --native-pair-hmm-threads) and therefore you can require more cpus (most of them with up to 4 threads) for these computations. GATK4, however, does provides some SPARK commands:

Not all GATK tools use Spark.
Tools that can use Spark generally have a note to that effect in their respective Tool Doc.
- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions. The "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.
- Some GATK tools only exist in a Spark-capable version. Those tools don't have the "Spark" suffix.

For the commands that do use Spark, you can request multiple cpus. NOTE: Please provide the exact number of cpus to the spark command. For example if you requested 10 cpus, use --spark-master local[10] instead of --spark-master local[*]. If you want to use multiple nodes to scale the Spark cluster, you have to first deploy a SPARK cluster and then set the appropriate variables in the GATK command.

Running GATK via Apptainer

If you encounter errors like "IllegalArgumentException" while using the installed modules on our clusters, we recommend you to try another workflow by using the program via Apptainer.

A Docker image of GATK can be found here and other versions are available at this page. You will need first to build an Apptainer image from the Docker image.

For example, to get the latest version, you can run the following commands on the cluster:

module load apptainer
apptainer build gatk.sif docker://broadinstitute/gatk

or to get a particular version:

module load apptainer
apptainer build gatk_VERSION.sif docker://broadinstitute/gatk:VERSION

In your SBATCH script, you should use something like this:

module load apptainer
apptainer exec -B /home -B /project -B /scratch -B /localscratch \
    <path to the image>/gatk.sif gatk [--java-options "-Xmx4G"] ToolName [GATK args]

For more information about Apptainer, you can watch the recorded Apptainer webinar.

Frequently asked questions

How do I add a read group (RG) tag in my bam file?

Assuming that you want to add a read group called tag to the file called input.bam, you can use the GATK/PICARD command AddOrReplaceReadGroups:

gatk  AddOrReplaceReadGroups \
    -I input.bam \
    -O output.bam \
    --RGLB tag \
    --RGPL ILLUMINA 
    --RGPU tag \
    --RGSM tag \
    --SORT_ORDER 'coordinate' \
    --CREATE_INDEX true

This assumes that your input file is sorted by coordinates and will generate an index along with the annotated output (--CREATE_INDEX true)

How do I deal with `java.lang.OutOfMemoryError: Java heap space`

Oftentimes the subprograms of GATK require more memory to process your files. If you were not using the -Xms command, add it to the --java-options. For example, let's imagine that you run the following command:

gatk MarkDuplicates \
    -I input.bam \
    -O marked_duplicates.bam \
    -M marked_dup_metrics.txt

But it gives you the java.lang.OutOfMemoryError: Java heap space error. Try:

gatk MarkDuplicates \
    --java-options "-Xmx8G DGATK_STACKTRACE_ON_USER_EXCEPTION=true"
    -I input.bam \
    -O marked_duplicates.bam \
    -M marked_dup_metrics.txt

If it fails again, keep increasing the memory until you find the required memory for your particular data set. If you are using any of our systems, remember to request enough memory for this.

If you are interested in knowing more about java heap space, you can start here.

Increasing the heap memory does not fix the `java.lang.OutOfMemoryError: Java heap space`

There are cases in which the memory issue cannot be fixed with increasing the heap memory. This often happens with non-model organisms, and you are using too many scaffolds in your reference. In this case it is recommended to remove small scaffolds and create subsets of your reference. This implies that you have to map multiple times and run the pipelines in each of the subsets. This approach does not work in all pipelines so review your results carefully. GATK is designed with the human genome in mind, and therefore other organism will require adjustment in many parameters and pipelines.

Using more resources than asked for

Sometimes GATK/JAVA applications will use more memory or CPUs/threads than the ones requested. This is often generated by the JAVA garbage collection. To add control for this, you can add -XX:ConcGCThreads=1 to the --java-options argument.

FAQ on GATK

You can find GATK's FAQ's in their website.

References

GATK Home

GATK SPARK

Make GATK run faster

@@ Line 127: / Line 127: @@
 For the commands that do use Spark, you can request multiple cpus. '''NOTE:''' Please provide the exact number of cpus to the spark command.  For example if you requested 10 cpus, use <code>--spark-master local[10]</code> instead of <code>--spark-master local[*]</code>. If you want to use multiple nodes to scale the Spark cluster, you have to first [[Apache_Spark|deploy a SPARK cluster]] and then set the appropriate variables in the GATK command.
-==Running GATK via Singularity== <!--T:36-->
+==Running GATK via Apptainer== <!--T:36-->
 <!--T:37-->
-If you encounter errors like "[https://gatk.broadinstitute.org/hc/en-us/community/posts/360067054832-GATK-4-1-7-0-error-java-lang-IllegalArgumentException-malformed-input-off-17635906-length-1 IllegalArgumentException]" while using the installed modules on our clusters, we recommend you to try another workflow by using the program via [[Singularity]].
+If you encounter errors like "[https://gatk.broadinstitute.org/hc/en-us/community/posts/360067054832-GATK-4-1-7-0-error-java-lang-IllegalArgumentException-malformed-input-off-17635906-length-1 IllegalArgumentException]" while using the installed modules on our clusters, we recommend you to try another workflow by using the program via [[Apptainer]].
 <!--T:38-->
-A Docker image of GATK can be found [https://hub.docker.com/r/broadinstitute/gatk here] and other versions are available at this [https://hub.docker.com/r/broadinstitute/gatk/tags page]. You will need first to [[Singularity/en#Creating_an_image_using_Docker_Hub_and_Dockerfile|build a Singularity image from the Docker image]].
+A Docker image of GATK can be found [https://hub.docker.com/r/broadinstitute/gatk here] and other versions are available at this [https://hub.docker.com/r/broadinstitute/gatk/tags page]. You will need first to [[Apptainer/en#Creating_an_image_using_Docker_Hub_and_Dockerfile|build an Apptainer image from the Docker image]].
 <!--T:39-->
@@ Line 140: / Line 140: @@
 <!--T:40-->
 <pre>
-module load singularity
+module load apptainer
-singularity build gatk.sif docker://broadinstitute/gatk
+apptainer build gatk.sif docker://broadinstitute/gatk
 </pre>
@@ Line 149: / Line 149: @@
 <!--T:42-->
 <pre>
-module load singularity
+module load apptainer
-singularity build gatk_VERSION.sif docker://broadinstitute/gatk:VERSION
+apptainer build gatk_VERSION.sif docker://broadinstitute/gatk:VERSION
 </pre>
@@ Line 158: / Line 158: @@
 <!--T:44-->
 <pre>
-module load singularity
+module load apptainer
-singularity exec -B /home -B /project -B /scratch -B /localscratch \
+apptainer exec -B /home -B /project -B /scratch -B /localscratch \
      <path to the image>/gatk.sif gatk [--java-options "-Xmx4G"] ToolName [GATK args]</pre>
 <!--T:45-->
-For more information about Singularity, you can watch the recorded [https://www.youtube.com/watch?v=kYb0aXS5DEE Singularity webinar].
+For more information about Apptainer, you can watch the recorded [https://www.youtube.com/watch?v=bpmrfVqBowY Apptainer webinar].
 ==Frequently asked questions == <!--T:23-->

GATK: Difference between revisions

Revision as of 18:30, 17 July 2023

Contents

Availability and loading module

General usage

Considerations in our systems

Earlier versions than GATK 4

Multicore usage

Running GATK via Apptainer

Frequently asked questions

How do I add a read group (RG) tag in my bam file?

How do I deal with `java.lang.OutOfMemoryError: Java heap space`

Increasing the heap memory does not fix the `java.lang.OutOfMemoryError: Java heap space`

Using more resources than asked for

FAQ on GATK

References

Navigation menu

GATK: Difference between revisions

Revision as of 18:30, 17 July 2023

Availability and loading module

General usage

Considerations in our systems

Earlier versions than GATK 4

Multicore usage

Running GATK via Apptainer

Frequently asked questions

How do I add a read group (RG) tag in my bam file?

How do I deal with java.lang.OutOfMemoryError: Java heap space

Increasing the heap memory does not fix the java.lang.OutOfMemoryError: Java heap space

Using more resources than asked for

FAQ on GATK

References

Navigation menu

Search

How do I deal with `java.lang.OutOfMemoryError: Java heap space`

Increasing the heap memory does not fix the `java.lang.OutOfMemoryError: Java heap space`