GATK Pipeline for calling variants from one sample

Synopsis: We will outline the GATK pipeline to pre-process a single sample starting from a paired of unaligned paired-ends reads (R1,R2) to variant calls in a vcf file.

For demonstration, we will download reads for a CEPH sample (SRR062634)

This tutorial is based on GATK version 3.7. The next version of GATK (4.0; currently in beta) will not only introduce a host of new features but also be open source.

At this stage, it is assumed that the reference genome (genome.fasta) has been processed by bwa. It is also assumed that the genome fasta has been indexed (genome.fai) and that a dictionary file (genome.dict) has been created. Finally, at least one snp and and one indel reference vcf, along with indices, must be available.

Set program variables.

In [4]:
mysamtools="/opt/NGS/samtools/1.4.1/samtools-1.4/samtools"
myjava="/usr/lib/jvm/oracle-java8-jdk-amd64/bin/java"
mybwa="/opt/NGS/bwa.kit/0.7.15/bwa.kit/bwa"
mypicard="/opt/NGS/picard-tools/2.8.3/picard.jar"
mygatk="/opt/NGS/GATK/3.7/GATK/GenomeAnalysisTK.jar"
mytabix="/opt/NGS/samtools/1.4.1/htslib-1.4/tabix"
mybgzip="/opt/NGS/samtools/1.4.1/htslib-1.4/bgzip"
myfastqdump="/opt/NGS/sratoolkit/sratoolkit.2.8.2-1-ubuntu64/bin/fastq-dump"

Check program versions and availabilty

Check samtools installation

In [5]:
$mysamtools --version
bash: /opt/NGS/samtools/1.4.1/samtools-1.4/samtools: No such file or directory

Check tabix installation

In [ ]:
$mytabix --version

Check bgzip installation

In [ ]:
$mybgzip --version

Check java installation

In [ ]:
$myjava -version

Check md5sum of picard jar

In [ ]:
md5sum $mypicard

Check md5sum of GATK jar

In [ ]:
md5sum $mygatk

Set reference and annotation file names.

In [ ]:
myrefdir="/data1/workspace/tmp/HTS/Bundle/"
myfasta=$myrefdir"Homo_sapiens_assembly38.fasta"
mydict=$myrefdir"Homo_sapiens_assembly38.dict"
mysnp=$myrefdir"Homo_sapiens_assembly38.dbsnp138.vcf.gz"
myindel=$myrefdir"Homo_sapiens_assembly38.known_indels.vcf.gz"

Check md5sum of reference fasta

In [ ]:
md5sum $myfasta

Check md5sum of fasta dictionary

In [ ]:
md5sum $mydict

Check md5sum of snp vcf file

In [ ]:
md5sum $mysnp

Check md5sum of indel vcf file

In [ ]:
md5sum $myindel

Set location for java temp directory.

Be sure to have plenty of scratch space under this directory and that you have write access.

In [ ]:
mytmpjava="/data1/workspace/tmp/GATK/"

Check write access

In [ ]:
[ -w $mytmpjava ] && echo "writeable" || echo "write permission denied"

Get sample data

Here we get the first 200 spots for SRR062634 from the 1000G project

In [ ]:
### ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
$myfastqdump -X 200  --split-3 SRR062634

Set location for data input and output files for the project

Here we assume that all of the raw reads have been assembled under a root directory

In [ ]:
fastqroot="/data1/workspace/tmp/HTS/Data/Reads/"

The output file will be assembled under a directory

In [ ]:
gatkroot="/data1/workspace/tmp/HTS/GATK/"
In [ ]:
samplelabel="SRR062634"
In [ ]:
mkdir $gatkroot$samplelabel
In [ ]:
### ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
$myfastqdump -X 200  --split-3 $samplelabel
In [ ]:
myR1=$fastqroot$samplelabel"/"$samplelabel"_1.fastq"
myR2=$fastqroot$samplelabel"/"$samplelabel"_2.fastq"
mysamplebase=$gatkroot$samplelabel"/"$samplelabel

Set read groups

In [ ]:
### ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
$myfastqdump -X 200  --split-3 SRR062634
myrgID=SRR062634-LANE001 ### This needs to be a UNIQUE label!!
myrgSM=SRR062634 ## Sample. This will be used to name the output vcf
myrgLB=SRR062634 ## Library
myrgPL=illumina ## Platform/technology
myrgPU=SRR062634

Check to make sure that the two fastq files exist

In [ ]:
md5sum $myR1
md5sum $myR2

Put the sample through the GATK pipeline (R1,R2) to recalibrated GATK bam file

Alignment to reference sequence

Align (R1,R2) to reference sequence -> sam file

In [ ]:
time $mybwa mem \
     -aM \
     -t 8 \
     $myfasta \
     $myR1 $myR2 \
     > $mysamplebase.sam 2> $mysamplebase-bwasam.err

Add read groups, sort the reads by coordinate, save as a bam file and index the bam file

https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups

In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mypicard AddOrReplaceReadGroups \
     I=$mysamplebase.sam \
     O=$mysamplebase-sort-rg.bam \
     SORT_ORDER=coordinate \
     CREATE_INDEX=true \
     RGID=$myrgID RGSM=$myrgSM RGLB=$myrgLB RGPL=$myrgPL RGPU=$myrgPU\
     >  $mysamplebase-rgsort.out 2> $mysamplebase-rgsort.err

Mark Duplicate reads

https://gatkforums.broadinstitute.org/gatk/discussion/2799/howto-map-and-mark-duplicates

time \(myjava -Djava.io.tmpdir=\)mytmpjava -jar
\(mypicard MarkDuplicates \ I=\)mysamplebase-sort-rg.bam
O=\(mysamplebase-sort-rg-dedup.bam \ M=\)mysamplebase-dedup-metric.txt
ASSUME_SORT_ORDER=coordinate
CREATE_INDEX=true
> \(mysamplebase-dedup.out 2>\)mysamplebase-dedup.err

Indel realignment (create targets)

“In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs.”

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php

Note: According to GATK local indel realignment is not needed foir variant discovery when using the HaplotypeCaller or MuTect2

In [ ]:

time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T RealignerTargetCreator \
     -R $myfasta \
     -I $mysamplebase"-sort-rg-dedup.bam" \
     -known $myindel \
     -o $mysamplebase"-realign-targets.intervals" \
     > $mysamplebase-realigntarget.out 2> $mysamplebase-realigntarget.err

Indel realignment (Perform local realignment )

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php

In [ ]:

time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T IndelRealigner \
     -R $myfasta \
     -known $myindel \
     -targetIntervals $mysamplebase"-realign-targets.intervals"  \
     -I $mysamplebase-sort-rg-dedup.bam \
     -o $mysamplebase-sort-rg-dedup-realigned.bam \
     --filter_bases_not_stored \
     >$mysamplebase-realignbam.out 2>$mysamplebase-realignbam.err

Create bam file index for the realigned bam

In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mypicard BuildBamIndex \
     I=$mysamplebase"-sort-rg-dedup-realigned.bam" \
     >$mysamplebase-realignbamindex.out 2>$mysamplebase-realignbamindex.err


Base Quality Score Recalibration (Create table)

This steps detects systematic errors in base quality scores

https://software.broadinstitute.org/gatk/gatkdocs/current/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php

In [ ]:

time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T BaseRecalibrator \
     -R $myfasta \
     -knownSites $mysnp \
     -knownSites $myindel \
     -I $mysamplebase"-sort-rg-dedup-realigned.bam" \
     -o $mysamplebase"-recal-table.txt" \
     >$mysamplebase-recalibrate.out 2>$mysamplebase-recalibrate.err

Base Quality Score Recalibration (apply)

Create recalibrated bam file. This file will be used for all subsequent analyses

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php

In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T PrintReads \
     -R $myfasta \
     -I $mysamplebase"-sort-rg-dedup-realigned.bam" \
     -BQSR $mysamplebase"-recal-table.txt" \
     -o  $mysamplebase"-GATK.bam" \
     > $mysamplebase-recalbam.out 2> $mysamplebase-recalbam.err

Create bam file index for the GATK bam file

In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mypicard BuildBamIndex \
     I= $mysamplebase"-GATK.bam" \
     >$mysamplebase-GATKbamindex.out 2>$mysamplebase-GATKbamindex.err

Call Variants

Call variants using the UnifiedGenotyper walker

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php

In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T UnifiedGenotyper \
     -R $myfasta \
     -I  $mysamplebase"-GATK.bam" \
     --dbsnp  $mysnp \
     -o $mysamplebase"-GATK-UG.vcf" \
     --output_mode EMIT_ALL_SITES

Call variants using Haplotype Caller

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T HaplotypeCaller \
     -R $myfasta \
     -I  $mysamplebase"-GATK.bam" \
     --dbsnp  $mysnp \
     -o $mysamplebase"-GATK-HC.vcf"
In [ ]:
To use a capture file, use the -L switch.
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T HaplotypeCaller \
     -R $myfasta \
     -I  $mysamplebase"-GATK.bam" \
     --dbsnp  $mysnp \
     -o $mysamplebase"-GATK-HC.vcf" \
     -L capture.bed

Optimizing and tuning the pipeline

Use the capture file

For targeted sequencing (e.g., exome sequencing) using the capture file (a file indicating which regions were sequenced) may only improve the quality of the analysis but also speed up the process (by avoiding processing off target regions).

A capture file is usually distributed with the “*.bed” extension and is a tab delimited file of the form chr1 100 110 chr1 100102 100503 chr2 50 112

In this dummy example, chr1 was “captured” in the intervals 100-101 and 100102-100503, while chr2 captured in the interval 50-112

Note that the contigs (chr1 and chr2 in this case) have to match the contigs in your *.dict file

For the GATK pipeline, one can introduce the capture file when creating the recalibration table using ‘-L mycapture.bed’

https://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals

In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T BaseRecalibrator \
     -R $myfasta \
     -knownSites $mysnp \
     -knownSites $myindel \
     -L mycapture.bed \
     -I $mysamplebase"-sort-rg-dedup-realigned.bam" \
     -o $mysamplebase"-recal-table.txt" \
     >$mysamplebase-recalibrate.out 2>$mysamplebase-recalibrate.err

One can also introduce the captue file at the time of variant calling. Specifically, one can

Manage the java heap memory

Both picard-tools and GATK are java programs. For large projects, they may consume large chunks of memory. You can control this by capping the maximum the heap memory can reach using the -Xmx{n} switch. For example, to cap at 8Gb add -Xmx8Gb as follows

In [ ]:
time $myjava -Xmx8g -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T HaplotypeCaller \
     -R $myfasta \
     -I  $mysamplebase"-GATK.bam" \
     --dbsnp  $mysnp \
     -o $mysamplebase"-GATK-HC.vcf"

Parallelization

Several of GATK walkers can take advantage of parallelism. See these pages for some details https://gatkforums.broadinstitute.org/gatk/discussion/1988/a-primer-on-parallelism-with-the-gatk

https://gatkforums.broadinstitute.org/gatk/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster

The HaplotyeCaller walker can take advantage of parallelism by specifying the number of CPU threads per data thread (using the -nct switch)

In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T HaplotypeCaller \
     -R $myfasta \
     -I  $mysamplebase"-GATK.bam" \
     --dbsnp  $mysnp \
     -o $mysamplebase"-GATK-HC.vcf" \
     -L capture.bed
     -nct 4

Somatic Mutation calling

GATK also enables somatic mutation calling based on match tumor/normal samples.

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php

For this example, suppose that mytumor-GATK.bam and mynormal-GATK.bam are the bam files you have obtained by putting the the reads from the tumor and normal samples each through the GATK pipeline

In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T MuTect2 \
     -R $myfasta \
     -I:tumor  mytumor-GATK.bam \
     -I:normal mynormal-GATK.bam \
     -o myMuTect2.vcf

You should consider providing reference information for reported germline variants and somatic mutations. For the former, you can use the dbSNP vcf from the GATK pipeline and the COSMIC vcf from http://cancer.sanger.ac.uk/cosmic

In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T MuTect2 \
     -R $myfasta \
     -I:tumor  mytumor-GATK.bam \
     -I:normal mynormal-GATK.bam \
     --dbsnp $mysnp \
     --cosmic cosmic.vcf \
     -o myMuTect2.vcf

Finally, you can specify targets for mutation calling

In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
     $mygatk -T MuTect2 \
     -R $myfasta \
     -I:tumor  mytumor-GATK.bam \
     -I:normal mynormal-GATK.bam \
     --dbsnp $mysnp \
     --cosmic cosmic.vcf \
     -L mytargets.bed
     -o myMuTect2.vcf