Download and Index Genome¶
Shell Variables¶
In [1]:
# Source the config script
source bioinf_intro_config.sh
# Make the new directories
mkdir -p $GENOME_DIR $STAR_OUT
ls $CUROUT
genome star_out
STAR¶
- Publication: https://doi.org/10.1093/bioinformatics/bts635
- Github repo: https://github.com/alexdobin/STAR
- Manual: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
Reference Genome and Annotation¶
We have some preparation to do before we can map our data. First we need to download a reference genome and its annotation file. It is very important that the genome sequence and annotation are the same version, if they are not, things could go horribly wrong. The best way to ensure that your sequence and annotation are compatible is to download both from the same place, at the same time, and double check that they have the same version number. There are several good places to get genomes data:
- Illumina’s website http://support.illumina.com/sequencing/sequencing_software/igenome.html.
- NCBI http://www.ncbi.nlm.nih.gov
- Ensembl http://www.ensembl.org/info/about/species.html
Illumina does not have any Cryptococcus neoformans genomes, so we need to look elsewhere.
NCBI¶
NCBI has most published genomes, but it is a bit tricky to find exactly what we are looking for. A good place to start is the NCBI Genome Assembly page where we can search for “Cryptococcus neoformans H99”.
But the mapping software that we will be using, STAR, does not like the GFF format that NCBI uses for annotation. We could get the GFF from NCBI and convert it to a format that STAR likes, but it is easier to look elsewhere to see if we can find a GTF formatted file that STAR likes.
Ensembl¶
Downloading with wget¶
Now we can use the wget
command to actually download these files. We
will get the files from NCBI. Here is what we will want to tell wget:
- –directory-prefix : the directory to save the files in
- –no-verbose : don’t output a lot of information while downloading
- URL : what to download
We are going to make a “genome” directory for these files so that things don’t get too messy. Soon we will generate several files based on these that tophat needs. I generally like to keep the original file names, but we are changing the names to make typing easier later.
In [2]:
wget --no-verbose --directory-prefix ${GENOME_DIR} "ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz"
wget --no-verbose --directory-prefix ${GENOME_DIR} "ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/dna/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz"
2018-07-19 13:25:17 URL: ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz [1796344] -> "/home/jovyan/work/scratch/bioinf_intro/genome/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz" [1]
2018-07-19 13:25:27 URL: ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/dna/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz [5922212] -> "/home/jovyan/work/scratch/bioinf_intro/genome/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz" [1]
In [3]:
ls ${GENOME_DIR}
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz
Decompressing the reference files¶
Next we need to uncompress the files using gunzip
In [4]:
gunzip ${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [5]:
gunzip ${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz
And now we have uncompressed versions of the files
In [6]:
ls ${GENOME_DIR}
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa
Let’s define variables for the FASTA and GTF files so we have more manageable names to work with
In [7]:
FASTA=${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa
GTF=${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
Let’s take a quick look at these files. The head
command shows us
the first few lines of a file (default is 10).
In [8]:
head $FASTA
>1 dna:chromosome chromosome:CNA3:1:1:2291499:1 REF
GAATTCTAAAACAGTTGCATTCTATAATTACAAAATAATTGAAACACTTCCATTCTCTTG
ACTAATATTATTTAATTAGACACCAACTCGACATTCTGTCTTCGACCTATCTTTCTCCTC
TCCCAGTCATCGCCCAGTAGAATTACCAGGCAATGAACCACGGCCTTTCATCCCAACGGC
ACAGCATATGGGTTCACTCCAACAGTGAACCATTCCAAAAGGCCTTGCCTGCGTGGCCAT
CTCCTCACAAACCCACCATCCTGCATCATCTCAGGTATCATACCTTCGATTCCTTCATCA
CCCCAAAGATCTTTCTGTTTGCCTTTGCGATTTGAGTGATACCAAACAAGAGAATACAGA
AAGTTAGGGACAAAGGGGGAGGGCTCTCTGAAGCGGCATGCTCCTCTTTCAACATGAAAA
TGGAAAGCCCCATCATCCATGTGGTATATGGGCAGGCAGCAAACCTTACAAACACCACTT
TTCAGATTTTCCAGAGCATCCAACCAGTCATCAATGTGATCGAGAGGAACCTTTTGACAC
In [9]:
head $GTF
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Unfortunately the GFF file has long lines, which are wrapping onto the
next line, making them hard to read. Another option is to use the
command
less -S ${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
in the terminal (after sourcing the configuration script). The “-S”
tells less
to truncate lines instead of wrapping them.
GTF, a brief aside
You can find one description of the GTF format here. Unfortunately it is not entirely standard.
Indexing the Genome¶
Before we can map reads to the reference genome using STAR, we need to index it. This will generate a transformed version of the genome that allows STAR to efficiently map sequences to it. We run STAR in “genomeGenerate” mode to do this.
So here is how we run STAR for genome indexing:
In [10]:
STAR \
--runMode genomeGenerate \
--genomeDir $GENOME_DIR \
--genomeFastaFiles ${FASTA} \
--sjdbGTFfile ${GTF} \
--outFileNamePrefix ${STAR_OUT}/genome_ \
--genomeSAindexNbases 11
# --genomeSAindexNbases 6
Jul 19 13:25:29 ..... started STAR run
Jul 19 13:25:29 ... starting to generate Genome files
Jul 19 13:25:30 ... starting to sort Suffix Array. This may take a long time...
Jul 19 13:25:30 ... sorting Suffix Array chunks and saving them to disk...
Jul 19 13:25:50 ... loading chunks from disk, packing SA...
Jul 19 13:25:51 ... finished generating suffix array
Jul 19 13:25:51 ... generating Suffix Array index
Jul 19 13:25:53 ... completed Suffix Array index
Jul 19 13:25:53 ..... processing annotations GTF
Jul 19 13:25:53 ..... inserting junctions into the genome indices
Jul 19 13:26:32 ... writing Genome to disk ...
Jul 19 13:26:32 ... writing Suffix Array to disk ...
Jul 19 13:26:32 ... writing SAindex to disk
Jul 19 13:26:32 ..... finished successfully
STAR has a lot of command line options! So here is what the above command is doing: * –runMode genomeGenerate: index the genome * –genomeDir : output genome index files to this directory * –genomeFastaFiles : genome sequence file (in FASTA format) * –sjdbGTFfile : annotation file (in GTF format) * –outFileNamePrefix : prefix all output files with this string * –genomeSAindexNbases : selects string length for index, needs to be adjusted based on genome size, can also be made smaller to reduce memory usage at cost of speed
Now let’s check to be sure that worked:
In [11]:
ls $GENOME_DIR
chrLength.txt
chrNameLength.txt
chrName.txt
chrStart.txt
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
Genome
genomeParameters.txt
SA
SAindex
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab