GATK: Prepare Reference Files¶

Synposis¶

This notebook willoutline the steps for preparing a reference genome for the GATK pipeline. Note that you will have to download a reference genome sequence file (fasta). You will also need to get reference file for SNPs and indels (in vcf format)

For the human genome, you can download the GATK bundle

https://software.broadinstitute.org/gatk/download/bundle

Otherwise, download a reference fasta file along with SNP/indel vcfs. Be sure that the SNP/indel files are compatible (with respect to contig and positions) with the reference fasta file.

Set program names¶

In [1]:

myjava="/usr/lib/jvm/oracle-java8-jdk-amd64/bin/java"
mybwa="/opt/NGS/bwa.kit/0.7.15/bwa.kit/bwa"
mypicard="/opt/NGS/picard-tools/2.8.3/picard.jar"
mygatk="/opt/NGS/GATK/3.7/GATK/GenomeAnalysisTK.jar"
mytabix="/opt/NGS/samtools/1.4.1/htslib-1.4/tabix"
mybgzip="/opt/NGS/samtools/1.4.1/htslib-1.4/bgzip"

Set reference and annotation files¶

In [2]:

myrefdir="/data1/workspace/tmp/HTS/Bundle/"
myfasta=$myrefdir"Homo_sapiens_assembly38.fasta"
mydict=$myrefdir"Homo_sapiens_assembly38.dict"
mysnp=$myrefdir"Homo_sapiens_assembly38.dbsnp138.vcf.gz"

Create bwa reference¶

In [3]:

time $mybwa index -a bwtsw $myfasta \
     >bwaindex.out 2>bwaindex.err


real    0m0.002s
user    0m0.000s
sys     0m0.001s

Create fasta index¶

In [4]:

time $mysamtools faidx $myfasta \
     >samtoolsindex.out 2>samtoolsindex.err


real    0m0.002s
user    0m0.000s
sys     0m0.001s

Create sequence dictionary¶

In [5]:

### Create dictionary
time $myjava -jar $mypicard  CreateSequenceDictionary \
     REFERENCE=$myfasta \
     OUTPUT=$mydict \
     >dict.out 2>dict.err


real    0m0.002s
user    0m0.000s
sys     0m0.001s

Zip vcf file and create an index¶

A vcf file for the GATK pipeline needs to be sorted and contain the reference dictionary. It also should be zipped and provided an index file. These step are only required if your reference vcf file has not been prepared (the vcf files from the GATK bundle are already prepared for the pipeline).

Note that the vcf needs to be compatible with the reference sequence file in terms of contigs and position.

In [6]:

vcforig="Homo_sapiens_assembly38.dbsnp138.vcf"
vcfgatk="Homo_sapiens_assembly38.dbsnp138-GATK.vcf"

Sort vcf and add dictionary¶

In [7]:

time $myjava -jar $mypicard SortVcf \
     I=$vcforig \
     O=$vcfgatk \
     SEQUENCE_DICTIONARY=$mydict \
     >  vcfsort.out 2> vcfsort.err


real    0m0.002s
user    0m0.000s
sys     0m0.001s

Zip vcf file¶

In [8]:

time $mybgzip -c  $vcfgatk \
     >  $vcfgatk.gz 2> bgzip.err


real    0m0.002s
user    0m0.000s
sys     0m0.001s

Create vcf index¶

In [9]:

time $mytabix -p vcf $vcfgatk.gz \
     >  tabix.out 2> tabix.err


real    0m0.002s
user    0m0.000s
sys     0m0.001s