GATK: Prepare Reference Files

Synposis

This notebook willoutline the steps for preparing a reference genome for the GATK pipeline. Note that you will have to download a reference genome sequence file (fasta). You will also need to get reference file for SNPs and indels (in vcf format)

For the human genome, you can download the GATK bundle

https://software.broadinstitute.org/gatk/download/bundle

Otherwise, download a reference fasta file along with SNP/indel vcfs. Be sure that the SNP/indel files are compatible (with respect to contig and positions) with the reference fasta file.

Set program names

In [1]:
myjava="/usr/lib/jvm/oracle-java8-jdk-amd64/bin/java"
mybwa="/opt/NGS/bwa.kit/0.7.15/bwa.kit/bwa"
mypicard="/opt/NGS/picard-tools/2.8.3/picard.jar"
mygatk="/opt/NGS/GATK/3.7/GATK/GenomeAnalysisTK.jar"
mytabix="/opt/NGS/samtools/1.4.1/htslib-1.4/tabix"
mybgzip="/opt/NGS/samtools/1.4.1/htslib-1.4/bgzip"

Set reference and annotation files

In [2]:
myrefdir="/data1/workspace/tmp/HTS/Bundle/"
myfasta=$myrefdir"Homo_sapiens_assembly38.fasta"
mydict=$myrefdir"Homo_sapiens_assembly38.dict"
mysnp=$myrefdir"Homo_sapiens_assembly38.dbsnp138.vcf.gz"

Create bwa reference

In [3]:
time $mybwa index -a bwtsw $myfasta \
     >bwaindex.out 2>bwaindex.err

real    0m0.002s
user    0m0.000s
sys     0m0.001s

Create fasta index

In [4]:
time $mysamtools faidx $myfasta \
     >samtoolsindex.out 2>samtoolsindex.err

real    0m0.002s
user    0m0.000s
sys     0m0.001s

Create sequence dictionary

In [5]:
### Create dictionary
time $myjava -jar $mypicard  CreateSequenceDictionary \
     REFERENCE=$myfasta \
     OUTPUT=$mydict \
     >dict.out 2>dict.err

real    0m0.002s
user    0m0.000s
sys     0m0.001s

Zip vcf file and create an index

A vcf file for the GATK pipeline needs to be sorted and contain the reference dictionary. It also should be zipped and provided an index file. These step are only required if your reference vcf file has not been prepared (the vcf files from the GATK bundle are already prepared for the pipeline).

Note that the vcf needs to be compatible with the reference sequence file in terms of contigs and position.

In [6]:
vcforig="Homo_sapiens_assembly38.dbsnp138.vcf"
vcfgatk="Homo_sapiens_assembly38.dbsnp138-GATK.vcf"

Sort vcf and add dictionary

In [7]:
time $myjava -jar $mypicard SortVcf \
     I=$vcforig \
     O=$vcfgatk \
     SEQUENCE_DICTIONARY=$mydict \
     >  vcfsort.out 2> vcfsort.err

real    0m0.002s
user    0m0.000s
sys     0m0.001s

Zip vcf file

In [8]:
time $mybgzip -c  $vcfgatk \
     >  $vcfgatk.gz 2> bgzip.err

real    0m0.002s
user    0m0.000s
sys     0m0.001s

Create vcf index

In [9]:
time $mytabix -p vcf $vcfgatk.gz \
     >  tabix.out 2> tabix.err

real    0m0.002s
user    0m0.000s
sys     0m0.001s