{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mapping Reads to a Reference Genome\n",
    "<!---\n",
    "** Add Intro Text***\n",
    "-->\n",
    "\n",
    "## Shell Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source the config script\n",
    "source bioinf_intro_config.sh\n",
    "\n",
    "ls $CUROUT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mapping with STAR\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "STAR \\\n",
    "    --runMode alignReads \\\n",
    "    --twopassMode None \\\n",
    "    --genomeDir $GENOME_DIR \\\n",
    "    --readFilesIn $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz \\\n",
    "    --readFilesCommand gunzip -c \\\n",
    "    --outFileNamePrefix ${STAR_OUT}/27_MA_P_S38_L002_R1_ \\\n",
    "    --quantMode GeneCounts \\\n",
    "    --outSAMtype BAM Unsorted \\\n",
    "    --outSAMunmapped Within"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will start with these parameters, but there is an extensive list of command line options detailed in the [STAR Manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf), it is a good idea to read through and try to understand all of them.  We will discuss some more later.\n",
    "\n",
    "* --runMode alignReads : map reads \n",
    "* --twopassMode : Run one pass or two? If two-pass mode is on, STAR tries to discover novel junctions, then reruns mapping with these added to the annotation\n",
    "* --genomeDir : directory containing the genome index\n",
    "* --readFilesIn : input FASTQ\n",
    "* --readFilesCommand gunzip -c : use \"gunzip -c\" to uncompress FASTQ on-the-fly, since it is gzipped\n",
    "* --outFileNamePrefix : prefix (and path) to use for all output files\n",
    "* --quantMode GeneCounts : output a table of read counts per gene\n",
    "* --outSAMtype BAM Unsorted : output an unsort BAM file\n",
    "* --outSAMunmapped Within : included unmapped reads in the BAM file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### STAR Output\n",
    "So what happened? Let's take a look . . ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls ${STAR_OUT}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head ${STAR_OUT}/27_MA_P_S38_L002_R1*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "STAR generates several files for each FASTQ:\n",
    "* Log.out : lots of details of the run, including all parameters used\n",
    "* Log.final.out : Important summary statistics\n",
    "* ReadsPerGene.out.tab : Count table, the main thing we are interested in\n",
    "* SJ.out.tab : All splice junctions, including ones from the GTF and novel junctions discovered by STAR\n",
    "* Log.progress.out: run statistics updated during run, not so interesting at the end\n",
    "\n",
    "Let's take a closer look at Log.final.out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Sanity Check: Number of Reads\n",
    "Are the number of input reads what we expect? Let's look at how many reads are in the input FASTQ"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "zcat $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz | awk '{s++}END{print s/4}' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Sanity Check: Unmapped Reads\n",
    "It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST.  In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads.  In that case there were a large number of unmapped reads, which raised my suspicions.\n",
    "\n",
    "Even with a high rate of mapped reads, it is worth spending a few minutes to check them out.  The simplest thing to do is use `samtools` to generate a FASTA from the unmapped.bam, grab a few of these sequences, and then [BLAST them against the nr database](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samtools fasta -f 4 ${STAR_OUT}/27_MA_P_S38_L002_R1_Aligned.out.bam | head -n20"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}