Mapping Reads to a Reference Genome¶
Shell Variables¶
Assign the variables in this notebook.
In [1]:
# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
TRIMMED=$CUROUT/trimmed_fastqs
# The following are new for this notebook
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
Making New Directories¶
Make the directories that are new in this notebook
In [2]:
mkdir -p $GENOME_DIR $TH_DIR
Now let’s check to be sure that worked. We will run ls
and check
that these directories now exist in the $CUROUT
directory.
In [3]:
ls $CUROUT
counts qc_output trimmed_fastqs
demux_fastqs stuff_for_igv.tgz
genome th_dir
Mapping with Tophat¶
Tophat Flowchart¶
Reference Genome and Annotation¶
We have some preparation to do before we can map our data. First we need to download a reference genome and its annotation file. It is very important that the genome sequence and annotation are the same version, if they are not, things could go horribly wrong. The best way to ensure that your sequence and annotation are compatible is to download both from the same place, at the same time, and double check that they have the same version number. There are several good places to get genomes data:
- Illumina’s website http://support.illumina.com/sequencing/sequencing_software/igenome.html.
- NCBI http://www.ncbi.nlm.nih.gov
- Ensembl http://www.ensembl.org/info/about/species.html
Illumina has several Escherichia coli genomes, but not the one we are working with, which is E. coli K-12 strain W3110.
NCBI¶
NCBI has most published genomes, but it is a bit tricky to find exactly what we are looking for. Let’s start at the NCBI Genome Assembly page and search for “Escherichia coli W3110”. This gets us to the Genome Assembly for W3110, ASM1024v1. On the right side of the ASM1024v1 page is a link for “GenBank FTP site”: . This is exactly what we want (note that some browsers can’t handle an FTP link like this)! From here we want two files:
Ensembl¶
Ensembl is similarly difficult to navigate. We will start at the Ensembl Bacteria page, and again search for “Escherichia coli W3110” in the Search for a genome search box. This will get us to the Escherichia coli str. K-12 substr. W3110 page. On the right side, are links for FASTA and GFF3. There are several options for the genome sequence, but the one we want is the complete unmasked assembled sequence.
Downloading with wget¶
Now we can use the wget
command to actually download these files. We
will get the files from NCBI. Here is what we will want to tell wget:
- –output-document : the name to use when saving the file
- –no-verbose : don’t output a lot of information while downloading
- URL : what to download
We are going to make a “genome” directory for these files so that things don’t get too messy. Soon we will generate several files based on these that tophat needs. I generally like to keep the original file names, but we are changing the names to make typing easier later. We are changing the FASTA file ending from ”.fna” to ”.fa”, because tophat wants a file names ”.fa”, its picky that way.
In [4]:
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FNA=${PREFIX}.fna
FA=${PREFIX}.fa
ln -s ${GENOME_DIR}/${FNA} ${GENOME_DIR}/${FA}
In [5]:
FIRST_PART="ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/latest_assembly_versions"
for CUR in $GFF $FNA ; do
rsync rsync://${FIRST_PART}/${ACCESSION}/${CUR}.gz ${GENOME_DIR}
done
Warning Notice!
You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.
-------------------------------------------------------------------------------
Welcome to the NCBI rsync server.
Warning Notice!
You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.
-------------------------------------------------------------------------------
Welcome to the NCBI rsync server.
In [6]:
ls $GENOME_DIR
GCA_000010245.1_ASM1024v1_genomic.fa
GCA_000010245.1_ASM1024v1_genomic.fna.gz
GCA_000010245.1_ASM1024v1_genomic.gff.gz
Decompressing the reference files¶
Now we need to decompress the FASTA and GFF files. We will use gunzip
In [7]:
gunzip ${GENOME_DIR}/${GFF}.gz | cat
gunzip ${GENOME_DIR}/${FNA}.gz | cat
In [8]:
ls $GENOME_DIR
GCA_000010245.1_ASM1024v1_genomic.fa GCA_000010245.1_ASM1024v1_genomic.gff
GCA_000010245.1_ASM1024v1_genomic.fna
Let’s take a quick look at these files. The head
command shows us
the first few lines of a file (default is 10).
In [9]:
head ${GENOME_DIR}/$GFF
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM1024v1
#!genome-build-accession NCBI_Assembly:GCA_000010245.1
##sequence-region AP009048.1 1 4646332
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=316407
AP009048.1 DDBJ region 1 4646332 . + . ID=id0;Dbxref=taxon:316407;Is_circular=true;gbkey=Src;mol_type=genomic DNA;strain=K-12;substrain=W3110
AP009048.1 DDBJ gene 190 255 . + . ID=gene0;Name=thrL;gbkey=Gene;gene=thrL;gene_biotype=protein_coding
AP009048.1 DDBJ CDS 190 255 . + 0 ID=cds0;Parent=gene0;Dbxref=NCBI_GP:BAE76026.1;Name=BAE76026.1;Note=ECK0001:JW4367:b0001;gbkey=CDS;gene=thrL;product=thr operon leader peptide;protein_id=BAE76026.1;transl_table=11
In [10]:
head ${GENOME_DIR}/$FA
>AP009048.1 Escherichia coli str. K-12 substr. W3110 DNA, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
AGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGG
TAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCG
ATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTG
GCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAA
AACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAA
Unfortunately the GFF file has long lines, which are wrapping onto the
next line, making them hard to read. Another option is to use the
command less -S $GENOME_DIR/ecoli_w3110.gff
in the terminal (after
pasting in our shell variables from the top of the notebook. The “-S”
tells less
to truncate lines instead of wrapping them.
GFF, a brief aside
You can find one description of the GFF format here http://www.sequenceontology.org/gff3.shtml. Unfortunately it is not entirely standard, there are several different version numbers (1, 2, 2.5, 3), and some variations within these version numbers. By getting our annotation from NCBI, we have a good chance that the GFF format will be compatible with most software.
Indexing the Genome¶
Before we can map reads to the reference genome using Bowtie or Tophat,
we need to index it. This will generate a transformed version of the
genome that allows Bowtie to efficiently map sequences to it. We use
bowtie2-build (part of the Bowtie package) to do this. The command for
bowtie2-build is bowtie2-build REF_GENOME INDEX_PATH
.
- REF_GENOME : the file containing the reference sequence, this must be in FASTA format.
- INDEX_PATH : The names of the index files generated by bowtie2-build will all start with INDEX_PATH, and when actually run Bowtie, we will also supply it with INDEX_PATH, and it will find all the files. Note: INDEX_PATH can be anything we want, it does not need to be at related to the name of the REF_GENOME file, but it does make things less confusing if we are consistent.
So here is how we run bowtie2-build:
In [11]:
# the "| cat" is a hack that prevents problems with jupyter
bowtie2-build -h | cat
Bowtie 2 version 2.3.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
reference_in comma-separated list of files with ref sequences
bt2_index_base write bt2 data to files with this dir/basename
*** Bowtie 2 indexes work only with v2 (not v1). Likewise for v1 indexes. ***
Options:
-f reference files are Fasta (default)
-c reference sequences given on cmd line (as
<reference_in>)
--large-index force generated index to be 'large', even if ref
has fewer than 4 billion nucleotides
-a/--noauto disable automatic -p/--bmax/--dcv memory-fitting
-p/--packed use packed strings internally; slower, less memory
--bmax <int> max bucket sz for blockwise suffix-array builder
--bmaxdivn <int> max bucket sz as divisor of ref len (default: 4)
--dcv <int> diff-cover period for blockwise (default: 1024)
--nodc disable diff-cover (algorithm becomes quadratic)
-r/--noref don't build .3/.4 index files
-3/--justref just build .3/.4 index files
-o/--offrate <int> SA is sampled every 2^<int> BWT chars (default: 5)
-t/--ftabchars <int> # of chars consumed in initial lookup (default: 10)
--threads <int> # of threads
--seed <int> seed for random number generator
-q/--quiet verbose output (for debugging)
-h/--help print detailed description of tool and its options
--usage print this usage message
--version print version information and quit
In [12]:
bowtie2-build $GENOME_DIR/$FA $GENOME_DIR/$PREFIX
Settings:
Output files: "/Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.*.bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
/Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.fa
Building a SMALL index
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
No samples; assembling all-inclusive block
Sorting block of length 4646332 for bucket 1
(Using difference cover)
Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 4646332
bwtLen: 4646333
sz: 1161583
bwtSz: 1161584
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 290396
offsSz: 1161584
lineSz: 64
sideSz: 64
sideBwtSz: 48
sideBwtLen: 192
numSides: 24200
numLines: 24200
ebwtTotLen: 1548800
ebwtTotSz: 1548800
color: 0
reverse: 0
Total time for call to driver() for forward index: 00:00:01
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
Time to reverse reference sequence: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
No samples; assembling all-inclusive block
Sorting block of length 4646332 for bucket 1
(Using difference cover)
Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.rev.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.rev.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 4646332
bwtLen: 4646333
sz: 1161583
bwtSz: 1161584
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 290396
offsSz: 1161584
lineSz: 64
sideSz: 64
sideBwtSz: 48
sideBwtLen: 192
numSides: 24200
numLines: 24200
ebwtTotLen: 1548800
ebwtTotSz: 1548800
color: 0
reverse: 1
Total time for backward call to driver() for mirror index: 00:00:01
Now let’s check to be sure that worked:
In [13]:
ls $GENOME_DIR
GCA_000010245.1_ASM1024v1_genomic.1.bt2
GCA_000010245.1_ASM1024v1_genomic.2.bt2
GCA_000010245.1_ASM1024v1_genomic.3.bt2
GCA_000010245.1_ASM1024v1_genomic.4.bt2
GCA_000010245.1_ASM1024v1_genomic.fa
GCA_000010245.1_ASM1024v1_genomic.fna
GCA_000010245.1_ASM1024v1_genomic.gff
GCA_000010245.1_ASM1024v1_genomic.rev.1.bt2
GCA_000010245.1_ASM1024v1_genomic.rev.2.bt2
Mapping with Tophat¶
We will use Tophat to map the RNA-Seq reads. Tophat runs bowtie, but with a twist - it uses the GTF file and the genome sequence to generate a virtual transcriptome. It tries to map each read to the transcriptome, then to the genome, then it tries to identify novel splice sites that could have resulted in the read. This is explained in this flowchart from the Tophat2 publication
Running Tophat¶
The minimum tophat command is tophat2 INDEX_PATH FASTQ
. This is the
same INDEX_PATH
that we used for bowtie2-build. Although that
command is enough to run Tophat, it is going to be very useful to give
it some more options:
- -G GTF_FILE : Tophat uses the genome annotation to figure out where exons are (i.e. where it expects to see mRNA).
- –library-type LIBRARY_TYPE : This tells tophat whether the data is stranded, and if so, which strand was sequenced.
- –output-dir DIRECTORY : where to put the results
- –max-intron-length NUM_BP : Tophat uses intron size defaults that are optimized for human genomes, these values are unreasonable for a microbial genome (of course we don’t expect any introns in E. coli).
- –min-intron-length NUM_BP : also optimized for human genomes, and unreasonable for a microbial genome.
- –transcriptome-max-hits NUM_HITS : Maximum number of mappings allowed for a read, when aligned to the transcriptome (any reads found with more then this number of mappings will be discarded).
- –max-multihits NUM_HITS : number of alignments to report if there is more than one.
- –no-coverage-search : turns off searching for novel splice junctions based on read depth
- –no-novel-juncs : do not try to find novel splice junctions
- –num-threads NUM : number of cpus to use
We will start with these parameters, but there is an extensive list of command line options detailed in the Tophat Manual, it is a good idea to read through and try to understand all of them. We will discuss some more later.
In [14]:
# the "| cat" is a hack that prevents problems with jupyter
tophat2 -h | cat
readlink: illegal option -- f
usage: readlink [-n] [file ...]
tophat:
TopHat maps short sequences from spliced transcripts to whole genomes.
Usage:
tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
[quals1,[quals2,...]] [quals1[,quals2,...]]
Options:
-v/--version
-o/--output-dir <string> [ default: ./tophat_out ]
--bowtie1 [ default: bowtie2 ]
-N/--read-mismatches <int> [ default: 2 ]
--read-gap-length <int> [ default: 2 ]
--read-edit-dist <int> [ default: 2 ]
--read-realign-edit-dist <int> [ default: "read-edit-dist" + 1 ]
-a/--min-anchor <int> [ default: 8 ]
-m/--splice-mismatches <0-2> [ default: 0 ]
-i/--min-intron-length <int> [ default: 50 ]
-I/--max-intron-length <int> [ default: 500000 ]
-g/--max-multihits <int> [ default: 20 ]
--suppress-hits
-x/--transcriptome-max-hits <int> [ default: 60 ]
-M/--prefilter-multihits ( for -G/--GTF option, enable
an initial bowtie search
against the genome )
--max-insertion-length <int> [ default: 3 ]
--max-deletion-length <int> [ default: 3 ]
--solexa-quals
--solexa1.3-quals (same as phred64-quals)
--phred64-quals (same as solexa1.3-quals)
-Q/--quals
--integer-quals
-C/--color (Solid - color space)
--color-out
--library-type <string> (fr-unstranded, fr-firststrand,
fr-secondstrand)
-p/--num-threads <int> [ default: 1 ]
-R/--resume <out_dir> ( try to resume execution )
-G/--GTF <filename> (GTF/GFF with known transcripts)
--transcriptome-index <bwtidx> (transcriptome bowtie index)
-T/--transcriptome-only (map only to the transcriptome)
-j/--raw-juncs <filename>
--insertions <filename>
--deletions <filename>
-r/--mate-inner-dist <int> [ default: 50 ]
--mate-std-dev <int> [ default: 20 ]
--no-novel-juncs
--no-novel-indels
--no-gtf-juncs
--no-coverage-search
--coverage-search
--microexon-search
--keep-tmp
--tmp-dir <dirname> [ default: <output_dir>/tmp ]
-z/--zpacker <program> [ default: gzip ]
-X/--unmapped-fifo [use mkfifo to compress more temporary
files for color space reads]
Advanced Options:
--report-secondary-alignments
--no-discordant
--no-mixed
--segment-mismatches <int> [ default: 2 ]
--segment-length <int> [ default: 25 ]
--bowtie-n [ default: bowtie -v ]
--min-coverage-intron <int> [ default: 50 ]
--max-coverage-intron <int> [ default: 20000 ]
--min-segment-intron <int> [ default: 50 ]
--max-segment-intron <int> [ default: 500000 ]
--no-sort-bam (Output BAM is not coordinate-sorted)
--no-convert-bam (Do not output bam format.
Output is <output_dir>/accepted_hits.sam)
--keep-fasta-order
--allow-partial-mapping
Bowtie2 related options:
Preset options in --end-to-end mode (local alignment is not used in TopHat2)
--b2-very-fast
--b2-fast
--b2-sensitive
--b2-very-sensitive
Alignment options
--b2-N <int> [ default: 0 ]
--b2-L <int> [ default: 20 ]
--b2-i <func> [ default: S,1,1.25 ]
--b2-n-ceil <func> [ default: L,0,0.15 ]
--b2-gbar <int> [ default: 4 ]
Scoring options
--b2-mp <int>,<int> [ default: 6,2 ]
--b2-np <int> [ default: 1 ]
--b2-rdg <int>,<int> [ default: 5,3 ]
--b2-rfg <int>,<int> [ default: 5,3 ]
--b2-score-min <func> [ default: L,-0.6,-0.6 ]
Effort options
--b2-D <int> [ default: 15 ]
--b2-R <int> [ default: 2 ]
Fusion related options:
--fusion-search
--fusion-anchor-length <int> [ default: 20 ]
--fusion-min-dist <int> [ default: 10000000 ]
--fusion-read-mismatches <int> [ default: 2 ]
--fusion-multireads <int> [ default: 2 ]
--fusion-multipairs <int> [ default: 2 ]
--fusion-ignore-chromosomes <list> [ e.g, <chrM,chrX> ]
--fusion-do-not-resolve-conflicts [this is for test purposes ]
SAM Header Options (for embedding sequencing run metadata in output):
--rg-id <string> (read group ID)
--rg-sample <string> (sample ID)
--rg-library <string> (library ID)
--rg-description <string> (descriptive string, no tabs allowed)
--rg-platform-unit <string> (e.g Illumina lane ID)
--rg-center <string> (sequencing center name)
--rg-date <string> (ISO 8601 date of the sequencing run)
--rg-platform <string> (Sequencing platform descriptor)
for detailed help see http://ccb.jhu.edu/software/tophat/manual.shtml
Running Tophat for one sample¶
Tophat uses the same output names whenever it is run. In some ways this is useful, but it means that we need to create a separate directory for each sample, and tell tophat to put the results for each sample in the correct directory. So let’s make a directory for the sample we have been working with: r1.8A_pilot.
In [15]:
CUR_DIR=$TH_DIR/r1_8A_pilot
CUR_FASTQ=$TRIMMED/r1.8A_pilot.trim.fastq.gz
mkdir -p $CUR_DIR
ls $TH_DIR
7A_pilot 7C_pilot 8B_pilot r1_8A_pilot
7B_pilot 8A_pilot 8C_pilot
Now, we will run tophat on the trimmed FASTQ of r1.8A_pilot, and tell tophat to put its output in that directory.
In [16]:
tophat2 -G $GENOME_DIR/$GFF \
--library-type fr-firststrand \
--output-dir $CUR_DIR \
--max-intron-length 5 \
--min-intron-length 4 \
--transcriptome-max-hits 1 \
--max-multihits 1 \
--no-coverage-search \
--no-novel-juncs \
--no-sort-bam \
$GENOME_DIR/$PREFIX \
$CUR_FASTQ
readlink: illegal option -- f
usage: readlink [-n] [file ...]
[2017-08-08 09:10:05] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:10:05] Checking for Bowtie
Bowtie version: 2.3.2.0
[2017-08-08 09:10:05] Checking for Bowtie index files (genome)..
[2017-08-08 09:10:05] Checking for reference FASTA file
[2017-08-08 09:10:05] Generating SAM header for /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic
Error: could not open pipe gzip -cd < /Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz
Tophat Output¶
So what happened? Let’s take a look . . .
In [17]:
ls $CUR_DIR
logs tmp
Tophat generates a lot of files. Let’s focus on these:
accepted_hits.bam : this is the mapped reads
align_summary.txt : summary of mapping results
unmapped.bam : the unmapped reads
In [18]:
head $CUR_DIR/align_summary.txt
head: /Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/align_summary.txt: No such file or directory
Sanity Check: Number of Reads¶
Are the number of input reads what we expect? Let’s look at how many reads are in the input FASTQ
In [19]:
zcat $CUR_FASTQ | awk '{s++}END{print s/4}'
zcat: can't stat: /Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz (/Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz.Z): No such file or directory
0
Sanity Check: Reads Mapped¶
There is no right answer for the correct number of reads mapped, but 96% is respectable.
accepted_hits.bam is a BAM file (i.e. binary format), which means we
can’t use standard text file tools like head
and cat
on it
directly. However we can use samtools
to convert it to SAM (i.e.
text) format, and then pipe that to our regular tools. Because the lines
are long, this is probably better to do in the terminal, but we will
take a crack at it here.
In [20]:
samtools view -h $CUR_DIR/accepted_hits.bam | head
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam'
samtools view: failed to open "/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam" for reading: No such file or directory
Also, try this command in the terminal. You can use the arrow keys to navigate up, down, left, and right.
samtools view -h ~/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam | less -S
Sanity Check: Unmapped Reads¶
It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST. In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads. In that case there were a large number of unmapped reads, which raised my suspicions.
Even with a high rate of mapped reads, it is worth spending a few
minutes to check them out. The simplest thing to do is use samtools
to generate a FASTA from the unmapped.bam, grab a few of these
sequences, and then BLAST them against the nr
database
In [21]:
samtools fasta $CUR_DIR/unmapped.bam | head -n20
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/unmapped.bam'
samtools bam2fq: Cannot read file "/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/unmapped.bam": No such file or directory