{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mapping Reads to a Reference Genome\n", "\n", "\n", "## Shell Variables\n", "Assign the variables in this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We used these in the last notebook\n", "CUROUT=$HOME/work/scratch/2015_output\n", "TRIMMED=$CUROUT/trimmed_fastqs\n", "\n", "# The following are new for this notebook\n", "GENOME_DIR=$CUROUT/genome\n", "TH_DIR=$CUROUT/th_dir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making New Directories\n", "Make the directories that are new in this notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mkdir -p $GENOME_DIR $TH_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's check to be sure that worked. We will run `ls` and check that these directories now exist in the `$CUROUT` directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ls $CUROUT" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Mapping with Tophat\n", "\n", "## Tophat Flowchart\n", "http://www.genomebiology.com/2013/14/4/R36/figure/F6\n", "\n", "## Reference Genome and Annotation\n", "We have some preparation to do before we can map our data. First we need to download a reference genome and its annotation file. It is very important that **the genome sequence and annotation are the same version**, if they are not, things could go horribly wrong. The best way to ensure that your sequence and annotation are compatible is to download both from the same place, at the same time, and double check that they have the same version number. There are several good places to get genomes data:\n", "\n", "* Illumina's website http://support.illumina.com/sequencing/sequencing_software/igenome.html.\n", "* NCBI http://www.ncbi.nlm.nih.gov\n", "* Ensembl http://www.ensembl.org/info/about/species.html\n", "\n", "Illumina has several *Escherichia coli* genomes, but not the one we are working with, which is *E. coli* K-12 strain W3110.\n", "\n", "### NCBI\n", "NCBI has most published genomes, but it is a bit tricky to find exactly what we are looking for. Let's start at the [NCBI Genome Assembly page](http://www.ncbi.nlm.nih.gov/assembly/) and search for \"Escherichia coli W3110\". This gets us to the Genome Assembly for W3110, [ASM1024v1]( http://www.ncbi.nlm.nih.gov/assembly/GCF_000010245.2/). On the right side of the ASM1024v1 page is a link for \"[GenBank FTP site](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000010245.1_ASM1024v1)\": . This is exactly what we want (note that some browsers can't handle an FTP link like this)! From here we want two files:\n", "\n", "* The genome sequence file, which ends in \"[.fna.gz](http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000010245.1_ASM1024v1/GCA_000010245.1_ASM1024v1_genomic.fna.gz)\" (.fna is short for FASTA nucleic acid, .gz means its gzipped)\n", "* The genome annotation \n", "which ends in \"[.gff.gz](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000010245.1_ASM1024v1/GCA_000010245.1_ASM1024v1_genomic.gff.gz)\" (.gff is Generic Feature Format)\n", "\n", "\n", "### Ensembl\n", "Ensembl is similarly difficult to navigate. We will start at the [Ensembl Bacteria](http://bacteria.ensembl.org/index.html) page, and again search for \"Escherichia coli W3110\" in the *Search for a genome* search box. This will get us to the [Escherichia coli str. K-12 substr. W3110](http://bacteria.ensembl.org/escherichia_coli_str_k_12_substr_w3110/Info/Index) page. On the right side, are links for [FASTA](ftp://ftp.ensemblgenomes.org/pub/bacteria/release-31/fasta/bacteria_23_collection/escherichia_coli_str_k_12_substr_w3110/) and [GFF3](ftp://ftp.ensemblgenomes.org/pub/bacteria/release-31/gff3/bacteria_23_collection/escherichia_coli_str_k_12_substr_w3110). There are several options for the genome sequence, but the one we want is the [complete unmasked assembled sequence](ftp://ftp.ensemblgenomes.org/pub/bacteria/release-28/fasta/bacteria_23_collection/escherichia_coli_str_k_12_substr_w3110/dna/Escherichia_coli_str_k_12_substr_w3110.GCA_000010245.1.28.dna.genome.fa.gz).\n", "\n", "### Downloading with wget\n", "Now we can use the `wget` command to actually download these files. We will get the files from NCBI. Here is what we will want to tell wget:\n", "\n", "* --output-document : the name to use when saving the file\n", "* --no-verbose : don't output a lot of information while downloading\n", "* URL : what to download\n", "\n", "We are going to make a \"genome\" directory for these files so that things don't get too messy. Soon we will generate several files based on these that tophat needs. I generally like to keep the original file names, but we are changing the names to make typing easier later. We are changing the FASTA file ending from \".fna\" to \".fa\", because tophat wants a file names \".fa\", its picky that way.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ACCESSION=\"GCA_000010245.1_ASM1024v1\"\n", "PREFIX=${ACCESSION}_genomic\n", "GFF=${PREFIX}.gff\n", "FNA=${PREFIX}.fna\n", "FA=${PREFIX}.fa\n", "ln -s ${GENOME_DIR}/${FNA} ${GENOME_DIR}/${FA}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "FIRST_PART=\"ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/latest_assembly_versions\"\n", "for CUR in $GFF $FNA ; do\n", " rsync rsync://${FIRST_PART}/${ACCESSION}/${CUR}.gz ${GENOME_DIR}\n", "done" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ls $GENOME_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decompressing the reference files\n", "Now we need to decompress the FASTA and GFF files. We will use gunzip" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gunzip ${GENOME_DIR}/${GFF}.gz | cat \n", "gunzip ${GENOME_DIR}/${FNA}.gz | cat" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ls $GENOME_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick look at these files. The `head` command shows us the first few lines of a file (default is 10)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "head ${GENOME_DIR}/$GFF" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "head ${GENOME_DIR}/$FA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately the GFF file has long lines, which are wrapping onto the next line, making them hard to read. Another option is to use the command `less -S $GENOME_DIR/ecoli_w3110.gff` in the terminal (after pasting in our shell variables from the top of the notebook. The \"-S\" tells `less` to truncate lines instead of wrapping them.\n", "\n", "> #### GFF, a brief aside\n", "> You can find one description of the GFF format here http://www.sequenceontology.org/gff3.shtml. Unfortunately it is not entirely standard, there are several different version numbers (1, 2, 2.5, 3), and some variations within these version numbers. By getting our annotation from NCBI, we have a good chance that the GFF format will be compatible with most software.\n", "\n", "### Indexing the Genome\n", "Before we can map reads to the reference genome using Bowtie or Tophat, we need to index it. This will generate a transformed version of the genome that allows Bowtie to efficiently map sequences to it. We use bowtie2-build (part of the Bowtie package) to do this. The command for bowtie2-build is `bowtie2-build REF_GENOME INDEX_PATH`. \n", "\n", "* REF_GENOME : the file containing the reference sequence, this must be in FASTA format.\n", "* INDEX_PATH : The names of the index files generated by bowtie2-build will all start with INDEX_PATH, and when actually run Bowtie, we will also supply it with INDEX_PATH, and it will find all the files. Note: INDEX_PATH can be anything we want, it does not need to be at related to the name of the REF_GENOME file, but it does make things less confusing if we are consistent.\n", "\n", "So here is how we run bowtie2-build: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# the \"| cat\" is a hack that prevents problems with jupyter\n", "bowtie2-build -h | cat" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bowtie2-build $GENOME_DIR/$FA $GENOME_DIR/$PREFIX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's check to be sure that worked:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ls $GENOME_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mapping with Tophat\n", "\n", "We will use Tophat to map the RNA-Seq reads. Tophat runs bowtie, but with a twist - it uses the GTF file and the genome sequence to generate a virtual transcriptome. It tries to map each read to the transcriptome, then to the genome, then it tries to identify novel splice sites that could have resulted in the read. This is explained in this [flowchart](http://genomebiology.com/2013/14/4/R36/figure/F6) from the [Tophat2 publication](http://dx.doi.org/10.1186/gb-2013-14-4-r36)\n", "\n", "### Running Tophat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The minimum tophat command is `tophat2 INDEX_PATH FASTQ`. This is the same `INDEX_PATH` that we used for bowtie2-build. Although that command is enough to run Tophat, it is going to be very useful to give it some more options: \n", "\n", "* -G GTF_FILE : Tophat uses the genome annotation to figure out where exons are (i.e. where it expects to see mRNA). \n", "* –library-type LIBRARY_TYPE : This tells tophat whether the data is stranded, and if so, which strand was sequenced.\n", "* –output-dir DIRECTORY : where to put the results\n", "* –max-intron-length NUM_BP : Tophat uses intron size defaults that are optimized for human genomes, these values are unreasonable for a microbial genome (of course we don't expect any introns in *E. coli*).\n", "* –min-intron-length NUM_BP : also optimized for human genomes, and unreasonable for a microbial genome.\n", "* --transcriptome-max-hits NUM_HITS : Maximum number of mappings allowed for a read, when aligned to the transcriptome (any reads found with more then this number of mappings will be discarded). \n", "* --max-multihits NUM_HITS : number of alignments to report if there is more than one.\n", "* --no-coverage-search : turns off searching for novel splice junctions based on read depth\n", "* --no-novel-juncs : do not try to find novel splice junctions\n", "* --num-threads NUM : number of cpus to use\n", "\n", "We will start with these parameters, but there is an extensive list of command line options detailed in the [Tophat Manual](https://ccb.jhu.edu/software/tophat/manual.shtml), it is a good idea to read through and try to understand all of them. We will discuss some more later." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# the \"| cat\" is a hack that prevents problems with jupyter\n", "tophat2 -h | cat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Running Tophat for one sample\n", "Tophat uses the same output names whenever it is run. In some ways this is useful, but it means that we need to create a separate directory for each sample, and tell tophat to put the results for each sample in the correct directory. So let's make a directory for the sample we have been working with: r1.8A_pilot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "CUR_DIR=$TH_DIR/r1_8A_pilot\n", "CUR_FASTQ=$TRIMMED/r1.8A_pilot.trim.fastq.gz\n", "\n", "mkdir -p $CUR_DIR\n", "ls $TH_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we will run tophat on the trimmed FASTQ of r1.8A_pilot, and tell tophat to put its output in that directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tophat2 -G $GENOME_DIR/$GFF \\\n", " --library-type fr-firststrand \\\n", " --output-dir $CUR_DIR \\\n", " --max-intron-length 5 \\\n", " --min-intron-length 4 \\\n", " --transcriptome-max-hits 1 \\\n", " --max-multihits 1 \\\n", " --no-coverage-search \\\n", " --no-novel-juncs \\\n", " --no-sort-bam \\\n", " $GENOME_DIR/$PREFIX \\\n", " $CUR_FASTQ" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tophat Output\n", "So what happened? Let's take a look . . ." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ls $CUR_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tophat generates a lot of files. Let's focus on these:\n", "\n", " accepted_hits.bam : this is the mapped reads\n", " align_summary.txt : summary of mapping results\n", " unmapped.bam : the unmapped reads" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "head $CUR_DIR/align_summary.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sanity Check: Number of Reads\n", "Are the number of input reads what we expect? Let's look at how many reads are in the input FASTQ" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "zcat $CUR_FASTQ | awk '{s++}END{print s/4}' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sanity Check: Reads Mapped\n", "There is no right answer for the correct number of reads mapped, but 96% is respectable.\n", "\n", "accepted_hits.bam is a BAM file (i.e. binary format), which means we can't use standard text file tools like `head` and `cat` on it directly. However we can use `samtools` to convert it to SAM (i.e. text) format, and then pipe that to our regular tools. Because the lines are long, this is probably better to do in the terminal, but we will take a crack at it here." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "samtools view -h $CUR_DIR/accepted_hits.bam | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, try this command in the terminal. You can use the arrow keys to navigate up, down, left, and right.\n", "\n", "`samtools view -h ~/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam | less -S` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sanity Check: Unmapped Reads\n", "It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST. In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads. In that case there were a large number of unmapped reads, which raised my suspicions.\n", "\n", "Even with a high rate of mapped reads, it is worth spending a few minutes to check them out. The simplest thing to do is use `samtools` to generate a FASTA from the unmapped.bam, grab a few of these sequences, and then [BLAST them against the nr database](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "samtools fasta $CUR_DIR/unmapped.bam | head -n20" ] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 0 }