GATK Pipeline for calling variants from one sample¶
Synopsis: We will outline the GATK pipeline to pre-process a single sample starting from a paired of unaligned paired-ends reads (R1,R2) to variant calls in a vcf file.
For demonstration, we will download reads for a CEPH sample (SRR062634)
This tutorial is based on GATK version 3.7. The next version of GATK (4.0; currently in beta) will not only introduce a host of new features but also be open source.
At this stage, it is assumed that the reference genome (genome.fasta) has been processed by bwa. It is also assumed that the genome fasta has been indexed (genome.fai) and that a dictionary file (genome.dict) has been created. Finally, at least one snp and and one indel reference vcf, along with indices, must be available.
Set program variables.¶
In [4]:
mysamtools="/opt/NGS/samtools/1.4.1/samtools-1.4/samtools"
myjava="/usr/lib/jvm/oracle-java8-jdk-amd64/bin/java"
mybwa="/opt/NGS/bwa.kit/0.7.15/bwa.kit/bwa"
mypicard="/opt/NGS/picard-tools/2.8.3/picard.jar"
mygatk="/opt/NGS/GATK/3.7/GATK/GenomeAnalysisTK.jar"
mytabix="/opt/NGS/samtools/1.4.1/htslib-1.4/tabix"
mybgzip="/opt/NGS/samtools/1.4.1/htslib-1.4/bgzip"
myfastqdump="/opt/NGS/sratoolkit/sratoolkit.2.8.2-1-ubuntu64/bin/fastq-dump"
Check program versions and availabilty¶
Check samtools installation
In [5]:
$mysamtools --version
bash: /opt/NGS/samtools/1.4.1/samtools-1.4/samtools: No such file or directory
Check tabix installation
In [ ]:
$mytabix --version
Check bgzip installation
In [ ]:
$mybgzip --version
Check java installation
In [ ]:
$myjava -version
Check md5sum of picard jar
In [ ]:
md5sum $mypicard
Check md5sum of GATK jar
In [ ]:
md5sum $mygatk
Set reference and annotation file names.¶
In [ ]:
myrefdir="/data1/workspace/tmp/HTS/Bundle/"
myfasta=$myrefdir"Homo_sapiens_assembly38.fasta"
mydict=$myrefdir"Homo_sapiens_assembly38.dict"
mysnp=$myrefdir"Homo_sapiens_assembly38.dbsnp138.vcf.gz"
myindel=$myrefdir"Homo_sapiens_assembly38.known_indels.vcf.gz"
Check md5sum of reference fasta
In [ ]:
md5sum $myfasta
Check md5sum of fasta dictionary
In [ ]:
md5sum $mydict
Check md5sum of snp vcf file
In [ ]:
md5sum $mysnp
Check md5sum of indel vcf file
In [ ]:
md5sum $myindel
Set location for java temp directory.¶
Be sure to have plenty of scratch space under this directory and that you have write access.
In [ ]:
mytmpjava="/data1/workspace/tmp/GATK/"
Check write access
In [ ]:
[ -w $mytmpjava ] && echo "writeable" || echo "write permission denied"
Get sample data¶
Here we get the first 200 spots for SRR062634 from the 1000G project
In [ ]:
### ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
$myfastqdump -X 200 --split-3 SRR062634
Set location for data input and output files for the project¶
Here we assume that all of the raw reads have been assembled under a root directory
In [ ]:
fastqroot="/data1/workspace/tmp/HTS/Data/Reads/"
The output file will be assembled under a directory
In [ ]:
gatkroot="/data1/workspace/tmp/HTS/GATK/"
In [ ]:
samplelabel="SRR062634"
In [ ]:
mkdir $gatkroot$samplelabel
In [ ]:
### ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
$myfastqdump -X 200 --split-3 $samplelabel
In [ ]:
myR1=$fastqroot$samplelabel"/"$samplelabel"_1.fastq"
myR2=$fastqroot$samplelabel"/"$samplelabel"_2.fastq"
mysamplebase=$gatkroot$samplelabel"/"$samplelabel
Set read groups
In [ ]:
### ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
$myfastqdump -X 200 --split-3 SRR062634
myrgID=SRR062634-LANE001 ### This needs to be a UNIQUE label!!
myrgSM=SRR062634 ## Sample. This will be used to name the output vcf
myrgLB=SRR062634 ## Library
myrgPL=illumina ## Platform/technology
myrgPU=SRR062634
Check to make sure that the two fastq files exist
In [ ]:
md5sum $myR1
md5sum $myR2
Put the sample through the GATK pipeline (R1,R2) to recalibrated GATK bam file¶
Alignment to reference sequence¶
Align (R1,R2) to reference sequence -> sam file
In [ ]:
time $mybwa mem \
-aM \
-t 8 \
$myfasta \
$myR1 $myR2 \
> $mysamplebase.sam 2> $mysamplebase-bwasam.err
Add read groups, sort the reads by coordinate, save as a bam file and index the bam file¶
https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mypicard AddOrReplaceReadGroups \
I=$mysamplebase.sam \
O=$mysamplebase-sort-rg.bam \
SORT_ORDER=coordinate \
CREATE_INDEX=true \
RGID=$myrgID RGSM=$myrgSM RGLB=$myrgLB RGPL=$myrgPL RGPU=$myrgPU\
> $mysamplebase-rgsort.out 2> $mysamplebase-rgsort.err
Mark Duplicate reads¶
https://gatkforums.broadinstitute.org/gatk/discussion/2799/howto-map-and-mark-duplicates
Indel realignment (create targets)¶
“In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs.”
Note: According to GATK local indel realignment is not needed foir variant discovery when using the HaplotypeCaller or MuTect2
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T RealignerTargetCreator \
-R $myfasta \
-I $mysamplebase"-sort-rg-dedup.bam" \
-known $myindel \
-o $mysamplebase"-realign-targets.intervals" \
> $mysamplebase-realigntarget.out 2> $mysamplebase-realigntarget.err
Indel realignment (Perform local realignment )¶
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T IndelRealigner \
-R $myfasta \
-known $myindel \
-targetIntervals $mysamplebase"-realign-targets.intervals" \
-I $mysamplebase-sort-rg-dedup.bam \
-o $mysamplebase-sort-rg-dedup-realigned.bam \
--filter_bases_not_stored \
>$mysamplebase-realignbam.out 2>$mysamplebase-realignbam.err
Create bam file index for the realigned bam
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mypicard BuildBamIndex \
I=$mysamplebase"-sort-rg-dedup-realigned.bam" \
>$mysamplebase-realignbamindex.out 2>$mysamplebase-realignbamindex.err
Base Quality Score Recalibration (Create table)¶
This steps detects systematic errors in base quality scores
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T BaseRecalibrator \
-R $myfasta \
-knownSites $mysnp \
-knownSites $myindel \
-I $mysamplebase"-sort-rg-dedup-realigned.bam" \
-o $mysamplebase"-recal-table.txt" \
>$mysamplebase-recalibrate.out 2>$mysamplebase-recalibrate.err
Base Quality Score Recalibration (apply)¶
Create recalibrated bam file. This file will be used for all subsequent analyses
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T PrintReads \
-R $myfasta \
-I $mysamplebase"-sort-rg-dedup-realigned.bam" \
-BQSR $mysamplebase"-recal-table.txt" \
-o $mysamplebase"-GATK.bam" \
> $mysamplebase-recalbam.out 2> $mysamplebase-recalbam.err
Create bam file index for the GATK bam file
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mypicard BuildBamIndex \
I= $mysamplebase"-GATK.bam" \
>$mysamplebase-GATKbamindex.out 2>$mysamplebase-GATKbamindex.err
Call Variants¶
Call variants using the UnifiedGenotyper walker¶
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T UnifiedGenotyper \
-R $myfasta \
-I $mysamplebase"-GATK.bam" \
--dbsnp $mysnp \
-o $mysamplebase"-GATK-UG.vcf" \
--output_mode EMIT_ALL_SITES
Call variants using Haplotype Caller¶
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T HaplotypeCaller \
-R $myfasta \
-I $mysamplebase"-GATK.bam" \
--dbsnp $mysnp \
-o $mysamplebase"-GATK-HC.vcf"
In [ ]:
To use a capture file, use the -L switch.
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T HaplotypeCaller \
-R $myfasta \
-I $mysamplebase"-GATK.bam" \
--dbsnp $mysnp \
-o $mysamplebase"-GATK-HC.vcf" \
-L capture.bed
Optimizing and tuning the pipeline¶
Use the capture file¶
For targeted sequencing (e.g., exome sequencing) using the capture file (a file indicating which regions were sequenced) may only improve the quality of the analysis but also speed up the process (by avoiding processing off target regions).
A capture file is usually distributed with the “*.bed” extension and is a tab delimited file of the form chr1 100 110 chr1 100102 100503 chr2 50 112
In this dummy example, chr1 was “captured” in the intervals 100-101 and 100102-100503, while chr2 captured in the interval 50-112
Note that the contigs (chr1 and chr2 in this case) have to match the contigs in your *.dict file
For the GATK pipeline, one can introduce the capture file when creating the recalibration table using ‘-L mycapture.bed’
In [ ]:
time $myjava -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T BaseRecalibrator \
-R $myfasta \
-knownSites $mysnp \
-knownSites $myindel \
-L mycapture.bed \
-I $mysamplebase"-sort-rg-dedup-realigned.bam" \
-o $mysamplebase"-recal-table.txt" \
>$mysamplebase-recalibrate.out 2>$mysamplebase-recalibrate.err
One can also introduce the captue file at the time of variant calling. Specifically, one can
Manage the java heap memory¶
Both picard-tools and GATK are java programs. For large projects, they may consume large chunks of memory. You can control this by capping the maximum the heap memory can reach using the -Xmx{n} switch. For example, to cap at 8Gb add -Xmx8Gb as follows
In [ ]:
time $myjava -Xmx8g -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T HaplotypeCaller \
-R $myfasta \
-I $mysamplebase"-GATK.bam" \
--dbsnp $mysnp \
-o $mysamplebase"-GATK-HC.vcf"
Parallelization¶
Several of GATK walkers can take advantage of parallelism. See these pages for some details https://gatkforums.broadinstitute.org/gatk/discussion/1988/a-primer-on-parallelism-with-the-gatk
The HaplotyeCaller walker can take advantage of parallelism by specifying the number of CPU threads per data thread (using the -nct switch)
In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T HaplotypeCaller \
-R $myfasta \
-I $mysamplebase"-GATK.bam" \
--dbsnp $mysnp \
-o $mysamplebase"-GATK-HC.vcf" \
-L capture.bed
-nct 4
Somatic Mutation calling¶
GATK also enables somatic mutation calling based on match tumor/normal samples.
For this example, suppose that mytumor-GATK.bam and mynormal-GATK.bam are the bam files you have obtained by putting the the reads from the tumor and normal samples each through the GATK pipeline
In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T MuTect2 \
-R $myfasta \
-I:tumor mytumor-GATK.bam \
-I:normal mynormal-GATK.bam \
-o myMuTect2.vcf
You should consider providing reference information for reported germline variants and somatic mutations. For the former, you can use the dbSNP vcf from the GATK pipeline and the COSMIC vcf from http://cancer.sanger.ac.uk/cosmic
In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T MuTect2 \
-R $myfasta \
-I:tumor mytumor-GATK.bam \
-I:normal mynormal-GATK.bam \
--dbsnp $mysnp \
--cosmic cosmic.vcf \
-o myMuTect2.vcf
Finally, you can specify targets for mutation calling
In [ ]:
time $myjava -Xmx32g -Djava.io.tmpdir=$mytmpjava -jar \
$mygatk -T MuTect2 \
-R $myfasta \
-I:tumor mytumor-GATK.bam \
-I:normal mynormal-GATK.bam \
--dbsnp $mysnp \
--cosmic cosmic.vcf \
-L mytargets.bed
-o myMuTect2.vcf