Mapping Reads to a Reference Genome

Shell Variables

Assign the variables in this notebook.

In [1]:
# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
TRIMMED=$CUROUT/trimmed_fastqs

# The following are new for this notebook
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir

Making New Directories

Make the directories that are new in this notebook

In [2]:
mkdir -p $GENOME_DIR $TH_DIR

Now let’s check to be sure that worked. We will run ls and check that these directories now exist in the $CUROUT directory.

In [3]:
ls $CUROUT
counts                  qc_output               trimmed_fastqs
demux_fastqs            stuff_for_igv.tgz
genome                  th_dir

Mapping with Tophat

Reference Genome and Annotation

We have some preparation to do before we can map our data. First we need to download a reference genome and its annotation file. It is very important that the genome sequence and annotation are the same version, if they are not, things could go horribly wrong. The best way to ensure that your sequence and annotation are compatible is to download both from the same place, at the same time, and double check that they have the same version number. There are several good places to get genomes data:

Illumina has several Escherichia coli genomes, but not the one we are working with, which is E. coli K-12 strain W3110.

NCBI

NCBI has most published genomes, but it is a bit tricky to find exactly what we are looking for. Let’s start at the NCBI Genome Assembly page and search for “Escherichia coli W3110”. This gets us to the Genome Assembly for W3110, ASM1024v1. On the right side of the ASM1024v1 page is a link for “GenBank FTP site”: . This is exactly what we want (note that some browsers can’t handle an FTP link like this)! From here we want two files:

  • The genome sequence file, which ends in “.fna.gz” (.fna is short for FASTA nucleic acid, .gz means its gzipped)
  • The genome annotation which ends in “.gff.gz” (.gff is Generic Feature Format)

Ensembl

Ensembl is similarly difficult to navigate. We will start at the Ensembl Bacteria page, and again search for “Escherichia coli W3110” in the Search for a genome search box. This will get us to the Escherichia coli str. K-12 substr. W3110 page. On the right side, are links for FASTA and GFF3. There are several options for the genome sequence, but the one we want is the complete unmasked assembled sequence.

Downloading with wget

Now we can use the wget command to actually download these files. We will get the files from NCBI. Here is what we will want to tell wget:

  • –output-document : the name to use when saving the file
  • –no-verbose : don’t output a lot of information while downloading
  • URL : what to download

We are going to make a “genome” directory for these files so that things don’t get too messy. Soon we will generate several files based on these that tophat needs. I generally like to keep the original file names, but we are changing the names to make typing easier later. We are changing the FASTA file ending from ”.fna” to ”.fa”, because tophat wants a file names ”.fa”, its picky that way.

In [4]:
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FNA=${PREFIX}.fna
FA=${PREFIX}.fa
ln -s ${GENOME_DIR}/${FNA} ${GENOME_DIR}/${FA}
In [5]:
FIRST_PART="ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/latest_assembly_versions"
for CUR in $GFF $FNA ; do
    rsync rsync://${FIRST_PART}/${ACCESSION}/${CUR}.gz ${GENOME_DIR}
done

Warning Notice!

You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.

-------------------------------------------------------------------------------

Welcome to the NCBI rsync server.



Warning Notice!

You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.

-------------------------------------------------------------------------------

Welcome to the NCBI rsync server.


In [6]:
ls $GENOME_DIR
GCA_000010245.1_ASM1024v1_genomic.fa
GCA_000010245.1_ASM1024v1_genomic.fna.gz
GCA_000010245.1_ASM1024v1_genomic.gff.gz

Decompressing the reference files

Now we need to decompress the FASTA and GFF files. We will use gunzip

In [7]:
gunzip ${GENOME_DIR}/${GFF}.gz | cat
gunzip ${GENOME_DIR}/${FNA}.gz | cat
In [8]:
ls $GENOME_DIR
GCA_000010245.1_ASM1024v1_genomic.fa    GCA_000010245.1_ASM1024v1_genomic.gff
GCA_000010245.1_ASM1024v1_genomic.fna

Let’s take a quick look at these files. The head command shows us the first few lines of a file (default is 10).

In [9]:
head ${GENOME_DIR}/$GFF
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM1024v1
#!genome-build-accession NCBI_Assembly:GCA_000010245.1
##sequence-region AP009048.1 1 4646332
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=316407
AP009048.1      DDBJ    region  1       4646332 .       +       .       ID=id0;Dbxref=taxon:316407;Is_circular=true;gbkey=Src;mol_type=genomic DNA;strain=K-12;substrain=W3110
AP009048.1      DDBJ    gene    190     255     .       +       .       ID=gene0;Name=thrL;gbkey=Gene;gene=thrL;gene_biotype=protein_coding
AP009048.1      DDBJ    CDS     190     255     .       +       0       ID=cds0;Parent=gene0;Dbxref=NCBI_GP:BAE76026.1;Name=BAE76026.1;Note=ECK0001:JW4367:b0001;gbkey=CDS;gene=thrL;product=thr operon leader peptide;protein_id=BAE76026.1;transl_table=11
In [10]:
head ${GENOME_DIR}/$FA
>AP009048.1 Escherichia coli str. K-12 substr. W3110 DNA, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
AGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGG
TAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCG
ATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTG
GCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAA
AACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAA

Unfortunately the GFF file has long lines, which are wrapping onto the next line, making them hard to read. Another option is to use the command less -S $GENOME_DIR/ecoli_w3110.gff in the terminal (after pasting in our shell variables from the top of the notebook. The “-S” tells less to truncate lines instead of wrapping them.

GFF, a brief aside

You can find one description of the GFF format here http://www.sequenceontology.org/gff3.shtml. Unfortunately it is not entirely standard, there are several different version numbers (1, 2, 2.5, 3), and some variations within these version numbers. By getting our annotation from NCBI, we have a good chance that the GFF format will be compatible with most software.

Indexing the Genome

Before we can map reads to the reference genome using Bowtie or Tophat, we need to index it. This will generate a transformed version of the genome that allows Bowtie to efficiently map sequences to it. We use bowtie2-build (part of the Bowtie package) to do this. The command for bowtie2-build is bowtie2-build REF_GENOME INDEX_PATH.

  • REF_GENOME : the file containing the reference sequence, this must be in FASTA format.
  • INDEX_PATH : The names of the index files generated by bowtie2-build will all start with INDEX_PATH, and when actually run Bowtie, we will also supply it with INDEX_PATH, and it will find all the files. Note: INDEX_PATH can be anything we want, it does not need to be at related to the name of the REF_GENOME file, but it does make things less confusing if we are consistent.

So here is how we run bowtie2-build:

In [11]:
# the "| cat" is a hack that prevents problems with jupyter
bowtie2-build -h | cat
Bowtie 2 version 2.3.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
    reference_in            comma-separated list of files with ref sequences
    bt2_index_base          write bt2 data to files with this dir/basename
*** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
Options:
    -f                      reference files are Fasta (default)
    -c                      reference sequences given on cmd line (as
                            <reference_in>)
    --large-index           force generated index to be 'large', even if ref
                            has fewer than 4 billion nucleotides
    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
    -p/--packed             use packed strings internally; slower, less memory
    --bmax <int>            max bucket sz for blockwise suffix-array builder
    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
    --dcv <int>             diff-cover period for blockwise (default: 1024)
    --nodc                  disable diff-cover (algorithm becomes quadratic)
    -r/--noref              don't build .3/.4 index files
    -3/--justref            just build .3/.4 index files
    -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
    --threads <int>         # of threads
    --seed <int>            seed for random number generator
    -q/--quiet              verbose output (for debugging)
    -h/--help               print detailed description of tool and its options
    --usage                 print this usage message
    --version               print version information and quit
In [12]:
bowtie2-build $GENOME_DIR/$FA $GENOME_DIR/$PREFIX
Settings:
  Output files: "/Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 4646332 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 4646332
    bwtLen: 4646333
    sz: 1161583
    bwtSz: 1161584
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 290396
    offsSz: 1161584
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 24200
    numLines: 24200
    ebwtTotLen: 1548800
    ebwtTotSz: 1548800
    color: 0
    reverse: 0
Total time for call to driver() for forward index: 00:00:01
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to reverse reference sequence: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 4646332 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.rev.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.rev.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 4646332
    bwtLen: 4646333
    sz: 1161583
    bwtSz: 1161584
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 290396
    offsSz: 1161584
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 24200
    numLines: 24200
    ebwtTotLen: 1548800
    ebwtTotSz: 1548800
    color: 0
    reverse: 1
Total time for backward call to driver() for mirror index: 00:00:01

Now let’s check to be sure that worked:

In [13]:
ls $GENOME_DIR
GCA_000010245.1_ASM1024v1_genomic.1.bt2
GCA_000010245.1_ASM1024v1_genomic.2.bt2
GCA_000010245.1_ASM1024v1_genomic.3.bt2
GCA_000010245.1_ASM1024v1_genomic.4.bt2
GCA_000010245.1_ASM1024v1_genomic.fa
GCA_000010245.1_ASM1024v1_genomic.fna
GCA_000010245.1_ASM1024v1_genomic.gff
GCA_000010245.1_ASM1024v1_genomic.rev.1.bt2
GCA_000010245.1_ASM1024v1_genomic.rev.2.bt2

Mapping with Tophat

We will use Tophat to map the RNA-Seq reads. Tophat runs bowtie, but with a twist - it uses the GTF file and the genome sequence to generate a virtual transcriptome. It tries to map each read to the transcriptome, then to the genome, then it tries to identify novel splice sites that could have resulted in the read. This is explained in this flowchart from the Tophat2 publication

Running Tophat

The minimum tophat command is tophat2 INDEX_PATH FASTQ. This is the same INDEX_PATH that we used for bowtie2-build. Although that command is enough to run Tophat, it is going to be very useful to give it some more options:

  • -G GTF_FILE : Tophat uses the genome annotation to figure out where exons are (i.e. where it expects to see mRNA).
  • –library-type LIBRARY_TYPE : This tells tophat whether the data is stranded, and if so, which strand was sequenced.
  • –output-dir DIRECTORY : where to put the results
  • –max-intron-length NUM_BP : Tophat uses intron size defaults that are optimized for human genomes, these values are unreasonable for a microbial genome (of course we don’t expect any introns in E. coli).
  • –min-intron-length NUM_BP : also optimized for human genomes, and unreasonable for a microbial genome.
  • –transcriptome-max-hits NUM_HITS : Maximum number of mappings allowed for a read, when aligned to the transcriptome (any reads found with more then this number of mappings will be discarded).
  • –max-multihits NUM_HITS : number of alignments to report if there is more than one.
  • –no-coverage-search : turns off searching for novel splice junctions based on read depth
  • –no-novel-juncs : do not try to find novel splice junctions
  • –num-threads NUM : number of cpus to use

We will start with these parameters, but there is an extensive list of command line options detailed in the Tophat Manual, it is a good idea to read through and try to understand all of them. We will discuss some more later.

In [14]:
# the "| cat" is a hack that prevents problems with jupyter
tophat2 -h | cat
readlink: illegal option -- f
usage: readlink [-n] [file ...]
tophat:
TopHat maps short sequences from spliced transcripts to whole genomes.

Usage:
    tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
                                    [quals1,[quals2,...]] [quals1[,quals2,...]]

Options:
    -v/--version
    -o/--output-dir                <string>    [ default: ./tophat_out         ]
    --bowtie1                                  [ default: bowtie2              ]
    -N/--read-mismatches           <int>       [ default: 2                    ]
    --read-gap-length              <int>       [ default: 2                    ]
    --read-edit-dist               <int>       [ default: 2                    ]
    --read-realign-edit-dist       <int>       [ default: "read-edit-dist" + 1 ]
    -a/--min-anchor                <int>       [ default: 8                    ]
    -m/--splice-mismatches         <0-2>       [ default: 0                    ]
    -i/--min-intron-length         <int>       [ default: 50                   ]
    -I/--max-intron-length         <int>       [ default: 500000               ]
    -g/--max-multihits             <int>       [ default: 20                   ]
    --suppress-hits
    -x/--transcriptome-max-hits    <int>       [ default: 60                   ]
    -M/--prefilter-multihits                   ( for -G/--GTF option, enable
                                                 an initial bowtie search
                                                 against the genome )
    --max-insertion-length         <int>       [ default: 3                    ]
    --max-deletion-length          <int>       [ default: 3                    ]
    --solexa-quals
    --solexa1.3-quals                          (same as phred64-quals)
    --phred64-quals                            (same as solexa1.3-quals)
    -Q/--quals
    --integer-quals
    -C/--color                                 (Solid - color space)
    --color-out
    --library-type                 <string>    (fr-unstranded, fr-firststrand,
                                                fr-secondstrand)
    -p/--num-threads               <int>       [ default: 1                   ]
    -R/--resume                    <out_dir>   ( try to resume execution )
    -G/--GTF                       <filename>  (GTF/GFF with known transcripts)
    --transcriptome-index          <bwtidx>    (transcriptome bowtie index)
    -T/--transcriptome-only                    (map only to the transcriptome)
    -j/--raw-juncs                 <filename>
    --insertions                   <filename>
    --deletions                    <filename>
    -r/--mate-inner-dist           <int>       [ default: 50                  ]
    --mate-std-dev                 <int>       [ default: 20                  ]
    --no-novel-juncs
    --no-novel-indels
    --no-gtf-juncs
    --no-coverage-search
    --coverage-search
    --microexon-search
    --keep-tmp
    --tmp-dir                      <dirname>   [ default: <output_dir>/tmp ]
    -z/--zpacker                   <program>   [ default: gzip             ]
    -X/--unmapped-fifo                         [use mkfifo to compress more temporary
                                                 files for color space reads]

Advanced Options:
    --report-secondary-alignments
    --no-discordant
    --no-mixed

    --segment-mismatches           <int>       [ default: 2                ]
    --segment-length               <int>       [ default: 25               ]

    --bowtie-n                                 [ default: bowtie -v        ]
    --min-coverage-intron          <int>       [ default: 50               ]
    --max-coverage-intron          <int>       [ default: 20000            ]
    --min-segment-intron           <int>       [ default: 50               ]
    --max-segment-intron           <int>       [ default: 500000           ]
    --no-sort-bam                              (Output BAM is not coordinate-sorted)
    --no-convert-bam                           (Do not output bam format.
                                                Output is <output_dir>/accepted_hits.sam)
    --keep-fasta-order
    --allow-partial-mapping

Bowtie2 related options:
  Preset options in --end-to-end mode (local alignment is not used in TopHat2)
    --b2-very-fast
    --b2-fast
    --b2-sensitive
    --b2-very-sensitive

  Alignment options
    --b2-N                         <int>       [ default: 0                ]
    --b2-L                         <int>       [ default: 20               ]
    --b2-i                         <func>      [ default: S,1,1.25         ]
    --b2-n-ceil                    <func>      [ default: L,0,0.15         ]
    --b2-gbar                      <int>       [ default: 4                ]

  Scoring options
    --b2-mp                        <int>,<int> [ default: 6,2              ]
    --b2-np                        <int>       [ default: 1                ]
    --b2-rdg                       <int>,<int> [ default: 5,3              ]
    --b2-rfg                       <int>,<int> [ default: 5,3              ]
    --b2-score-min                 <func>      [ default: L,-0.6,-0.6      ]

  Effort options
    --b2-D                         <int>       [ default: 15               ]
    --b2-R                         <int>       [ default: 2                ]

Fusion related options:
    --fusion-search
    --fusion-anchor-length         <int>       [ default: 20               ]
    --fusion-min-dist              <int>       [ default: 10000000         ]
    --fusion-read-mismatches       <int>       [ default: 2                ]
    --fusion-multireads            <int>       [ default: 2                ]
    --fusion-multipairs            <int>       [ default: 2                ]
    --fusion-ignore-chromosomes    <list>      [ e.g, <chrM,chrX>          ]

    --fusion-do-not-resolve-conflicts          [this is for test purposes  ]

SAM Header Options (for embedding sequencing run metadata in output):
    --rg-id                        <string>    (read group ID)
    --rg-sample                    <string>    (sample ID)
    --rg-library                   <string>    (library ID)
    --rg-description               <string>    (descriptive string, no tabs allowed)
    --rg-platform-unit             <string>    (e.g Illumina lane ID)
    --rg-center                    <string>    (sequencing center name)
    --rg-date                      <string>    (ISO 8601 date of the sequencing run)
    --rg-platform                  <string>    (Sequencing platform descriptor)

    for detailed help see http://ccb.jhu.edu/software/tophat/manual.shtml

Running Tophat for one sample

Tophat uses the same output names whenever it is run. In some ways this is useful, but it means that we need to create a separate directory for each sample, and tell tophat to put the results for each sample in the correct directory. So let’s make a directory for the sample we have been working with: r1.8A_pilot.

In [15]:
CUR_DIR=$TH_DIR/r1_8A_pilot
CUR_FASTQ=$TRIMMED/r1.8A_pilot.trim.fastq.gz

mkdir -p $CUR_DIR
ls $TH_DIR
7A_pilot        7C_pilot        8B_pilot        r1_8A_pilot
7B_pilot        8A_pilot        8C_pilot

Now, we will run tophat on the trimmed FASTQ of r1.8A_pilot, and tell tophat to put its output in that directory.

In [16]:
tophat2 -G $GENOME_DIR/$GFF \
    --library-type fr-firststrand \
    --output-dir $CUR_DIR \
    --max-intron-length 5 \
    --min-intron-length 4 \
    --transcriptome-max-hits 1 \
    --max-multihits 1 \
    --no-coverage-search \
    --no-novel-juncs \
    --no-sort-bam \
    $GENOME_DIR/$PREFIX \
    $CUR_FASTQ
readlink: illegal option -- f
usage: readlink [-n] [file ...]

[2017-08-08 09:10:05] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:10:05] Checking for Bowtie
                  Bowtie version:        2.3.2.0
[2017-08-08 09:10:05] Checking for Bowtie index files (genome)..
[2017-08-08 09:10:05] Checking for reference FASTA file
[2017-08-08 09:10:05] Generating SAM header for /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic
Error: could not open pipe gzip -cd < /Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz

Tophat Output

So what happened? Let’s take a look . . .

In [17]:
ls $CUR_DIR
logs    tmp

Tophat generates a lot of files. Let’s focus on these:

accepted_hits.bam : this is the mapped reads
align_summary.txt : summary of mapping results
unmapped.bam : the unmapped reads
In [18]:
head $CUR_DIR/align_summary.txt
head: /Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/align_summary.txt: No such file or directory

Sanity Check: Number of Reads

Are the number of input reads what we expect? Let’s look at how many reads are in the input FASTQ

In [19]:
zcat $CUR_FASTQ | awk '{s++}END{print s/4}'
zcat: can't stat: /Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz (/Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz.Z): No such file or directory
0

Sanity Check: Reads Mapped

There is no right answer for the correct number of reads mapped, but 96% is respectable.

accepted_hits.bam is a BAM file (i.e. binary format), which means we can’t use standard text file tools like head and cat on it directly. However we can use samtools to convert it to SAM (i.e. text) format, and then pipe that to our regular tools. Because the lines are long, this is probably better to do in the terminal, but we will take a crack at it here.

In [20]:
samtools view -h $CUR_DIR/accepted_hits.bam | head
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam'
samtools view: failed to open "/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam" for reading: No such file or directory

Also, try this command in the terminal. You can use the arrow keys to navigate up, down, left, and right.

samtools view -h ~/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam | less -S

Sanity Check: Unmapped Reads

It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST. In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads. In that case there were a large number of unmapped reads, which raised my suspicions.

Even with a high rate of mapped reads, it is worth spending a few minutes to check them out. The simplest thing to do is use samtools to generate a FASTA from the unmapped.bam, grab a few of these sequences, and then BLAST them against the nr database

In [21]:
samtools fasta $CUR_DIR/unmapped.bam | head -n20
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/unmapped.bam'
samtools bam2fq: Cannot read file "/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/unmapped.bam": No such file or directory