GATK: Prepare Reference Files¶
Synposis¶
This notebook willoutline the steps for preparing a reference genome for the GATK pipeline. Note that you will have to download a reference genome sequence file (fasta). You will also need to get reference file for SNPs and indels (in vcf format)
For the human genome, you can download the GATK bundle
https://software.broadinstitute.org/gatk/download/bundle
Otherwise, download a reference fasta file along with SNP/indel vcfs. Be sure that the SNP/indel files are compatible (with respect to contig and positions) with the reference fasta file.
Set program names¶
In [1]:
myjava="/usr/lib/jvm/oracle-java8-jdk-amd64/bin/java"
mybwa="/opt/NGS/bwa.kit/0.7.15/bwa.kit/bwa"
mypicard="/opt/NGS/picard-tools/2.8.3/picard.jar"
mygatk="/opt/NGS/GATK/3.7/GATK/GenomeAnalysisTK.jar"
mytabix="/opt/NGS/samtools/1.4.1/htslib-1.4/tabix"
mybgzip="/opt/NGS/samtools/1.4.1/htslib-1.4/bgzip"
Set reference and annotation files¶
In [2]:
myrefdir="/data1/workspace/tmp/HTS/Bundle/"
myfasta=$myrefdir"Homo_sapiens_assembly38.fasta"
mydict=$myrefdir"Homo_sapiens_assembly38.dict"
mysnp=$myrefdir"Homo_sapiens_assembly38.dbsnp138.vcf.gz"
Create bwa reference¶
In [3]:
time $mybwa index -a bwtsw $myfasta \
>bwaindex.out 2>bwaindex.err
real 0m0.002s
user 0m0.000s
sys 0m0.001s
Create fasta index¶
In [4]:
time $mysamtools faidx $myfasta \
>samtoolsindex.out 2>samtoolsindex.err
real 0m0.002s
user 0m0.000s
sys 0m0.001s
Create sequence dictionary¶
In [5]:
### Create dictionary
time $myjava -jar $mypicard CreateSequenceDictionary \
REFERENCE=$myfasta \
OUTPUT=$mydict \
>dict.out 2>dict.err
real 0m0.002s
user 0m0.000s
sys 0m0.001s
Zip vcf file and create an index¶
A vcf file for the GATK pipeline needs to be sorted and contain the reference dictionary. It also should be zipped and provided an index file. These step are only required if your reference vcf file has not been prepared (the vcf files from the GATK bundle are already prepared for the pipeline).
Note that the vcf needs to be compatible with the reference sequence file in terms of contigs and position.
In [6]:
vcforig="Homo_sapiens_assembly38.dbsnp138.vcf"
vcfgatk="Homo_sapiens_assembly38.dbsnp138-GATK.vcf"
Sort vcf and add dictionary¶
In [7]:
time $myjava -jar $mypicard SortVcf \
I=$vcforig \
O=$vcfgatk \
SEQUENCE_DICTIONARY=$mydict \
> vcfsort.out 2> vcfsort.err
real 0m0.002s
user 0m0.000s
sys 0m0.001s
Zip vcf file¶
In [8]:
time $mybgzip -c $vcfgatk \
> $vcfgatk.gz 2> bgzip.err
real 0m0.002s
user 0m0.000s
sys 0m0.001s
Create vcf index¶
In [9]:
time $mytabix -p vcf $vcfgatk.gz \
> tabix.out 2> tabix.err
real 0m0.002s
user 0m0.000s
sys 0m0.001s