Download and Index Genome

Shell Variables

In [1]:
# Source the config script
source bioinf_intro_config.sh

# Make the new directories
mkdir -p $GENOME_DIR $STAR_OUT

ls $CUROUT
genome  star_out

Reference Genome and Annotation

We have some preparation to do before we can map our data. First we need to download a reference genome and its annotation file. It is very important that the genome sequence and annotation are the same version, if they are not, things could go horribly wrong. The best way to ensure that your sequence and annotation are compatible is to download both from the same place, at the same time, and double check that they have the same version number. There are several good places to get genomes data:

Illumina does not have any Cryptococcus neoformans genomes, so we need to look elsewhere.

NCBI

NCBI has most published genomes, but it is a bit tricky to find exactly what we are looking for. A good place to start is the NCBI Genome Assembly page where we can search for “Cryptococcus neoformans H99”.

But the mapping software that we will be using, STAR, does not like the GFF format that NCBI uses for annotation. We could get the GFF from NCBI and convert it to a format that STAR likes, but it is easier to look elsewhere to see if we can find a GTF formatted file that STAR likes.

Ensembl

Our next stop is Ensembl, which is also difficult to navigate.
1. Start at the Ensembl Fungi page 2. Click on View full list of all Ensembl Fungi species 3. Find “Cryptococcus neoformans var. grubii H99” (it is helpful to type “H99” in the filter box) 4. There is a link to the different FASTA files, we want : ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/dna/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz 5. There is also a link to GFF annotation, but we want GTF and the trail has gone cold. h ere is no obvious sign that there is a GTF file, you just have to know where to find it: ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

Downloading with wget

Now we can use the wget command to actually download these files. We will get the files from NCBI. Here is what we will want to tell wget:

  • –directory-prefix : the directory to save the files in
  • –no-verbose : don’t output a lot of information while downloading
  • URL : what to download

We are going to make a “genome” directory for these files so that things don’t get too messy. Soon we will generate several files based on these that tophat needs. I generally like to keep the original file names, but we are changing the names to make typing easier later.

In [2]:
wget --no-verbose --directory-prefix ${GENOME_DIR} "ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz"

wget --no-verbose --directory-prefix ${GENOME_DIR} "ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/dna/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz"
2018-07-19 13:25:17 URL: ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz [1796344] -> "/home/jovyan/work/scratch/bioinf_intro/genome/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz" [1]
2018-07-19 13:25:27 URL: ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/dna/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz [5922212] -> "/home/jovyan/work/scratch/bioinf_intro/genome/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz" [1]
In [3]:
ls ${GENOME_DIR}
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz

Decompressing the reference files

Next we need to uncompress the files using gunzip

In [4]:
gunzip ${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [5]:
gunzip ${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz

And now we have uncompressed versions of the files

In [6]:
ls ${GENOME_DIR}
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa

Let’s define variables for the FASTA and GTF files so we have more manageable names to work with

In [7]:
FASTA=${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa
GTF=${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf

Let’s take a quick look at these files. The head command shows us the first few lines of a file (default is 10).

In [8]:
head $FASTA
>1 dna:chromosome chromosome:CNA3:1:1:2291499:1 REF
GAATTCTAAAACAGTTGCATTCTATAATTACAAAATAATTGAAACACTTCCATTCTCTTG
ACTAATATTATTTAATTAGACACCAACTCGACATTCTGTCTTCGACCTATCTTTCTCCTC
TCCCAGTCATCGCCCAGTAGAATTACCAGGCAATGAACCACGGCCTTTCATCCCAACGGC
ACAGCATATGGGTTCACTCCAACAGTGAACCATTCCAAAAGGCCTTGCCTGCGTGGCCAT
CTCCTCACAAACCCACCATCCTGCATCATCTCAGGTATCATACCTTCGATTCCTTCATCA
CCCCAAAGATCTTTCTGTTTGCCTTTGCGATTTGAGTGATACCAAACAAGAGAATACAGA
AAGTTAGGGACAAAGGGGGAGGGCTCTCTGAAGCGGCATGCTCCTCTTTCAACATGAAAA
TGGAAAGCCCCATCATCCATGTGGTATATGGGCAGGCAGCAAACCTTACAAACACCACTT
TTCAGATTTTCCAGAGCATCCAACCAGTCATCAATGTGATCGAGAGGAACCTTTTGACAC
In [9]:
head $GTF
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Unfortunately the GFF file has long lines, which are wrapping onto the next line, making them hard to read. Another option is to use the command less -S ${GENOME_DIR}/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf in the terminal (after sourcing the configuration script). The “-S” tells less to truncate lines instead of wrapping them.

GTF, a brief aside

You can find one description of the GTF format here. Unfortunately it is not entirely standard.

Indexing the Genome

Before we can map reads to the reference genome using STAR, we need to index it. This will generate a transformed version of the genome that allows STAR to efficiently map sequences to it. We run STAR in “genomeGenerate” mode to do this.

So here is how we run STAR for genome indexing:

In [10]:
STAR \
    --runMode genomeGenerate \
    --genomeDir $GENOME_DIR \
    --genomeFastaFiles ${FASTA} \
    --sjdbGTFfile ${GTF} \
    --outFileNamePrefix ${STAR_OUT}/genome_ \
    --genomeSAindexNbases 11
# --genomeSAindexNbases 6
Jul 19 13:25:29 ..... started STAR run
Jul 19 13:25:29 ... starting to generate Genome files
Jul 19 13:25:30 ... starting to sort Suffix Array. This may take a long time...
Jul 19 13:25:30 ... sorting Suffix Array chunks and saving them to disk...
Jul 19 13:25:50 ... loading chunks from disk, packing SA...
Jul 19 13:25:51 ... finished generating suffix array
Jul 19 13:25:51 ... generating Suffix Array index
Jul 19 13:25:53 ... completed Suffix Array index
Jul 19 13:25:53 ..... processing annotations GTF
Jul 19 13:25:53 ..... inserting junctions into the genome indices
Jul 19 13:26:32 ... writing Genome to disk ...
Jul 19 13:26:32 ... writing Suffix Array to disk ...
Jul 19 13:26:32 ... writing SAindex to disk
Jul 19 13:26:32 ..... finished successfully

STAR has a lot of command line options! So here is what the above command is doing: * –runMode genomeGenerate: index the genome * –genomeDir : output genome index files to this directory * –genomeFastaFiles : genome sequence file (in FASTA format) * –sjdbGTFfile : annotation file (in GTF format) * –outFileNamePrefix : prefix all output files with this string * –genomeSAindexNbases : selects string length for index, needs to be adjusted based on genome size, can also be made smaller to reduce memory usage at cost of speed

Now let’s check to be sure that worked:

In [11]:
ls $GENOME_DIR
chrLength.txt
chrNameLength.txt
chrName.txt
chrStart.txt
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
Genome
genomeParameters.txt
SA
SAindex
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab