{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# time this notebook\n",
    "SECONDS=0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with Multiple Samples\n",
    "Let's kick it up another notch - we have six samples, let's run our analysis on all of them!  \n",
    "\n",
    "## Shell Variables\n",
    "Assign the variables in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "CUROUT=$HOME/work/scratch/2015_output_all_data\n",
    "DEMUX=$CUROUT/demux_fastqs\n",
    "CURDATA=/data/HTS_2015_data/runs_combined\n",
    "INFO=$HOME/work/myinfo\n",
    "TRIMMED=$CUROUT/trimmed_fastqs\n",
    "GENOME_DIR=$CUROUT/genome\n",
    "TH_DIR=$CUROUT/th_dir\n",
    "COUNT_DIR=$CUROUT/counts\n",
    "QC=$CUROUT/qc_output\n",
    "ADAPTERS=$INFO/neb_adapters.fasta\n",
    "\n",
    "ACCESSION=\"GCA_000010245.1_ASM1024v1\"\n",
    "PREFIX=${ACCESSION}_genomic\n",
    "GFF=${PREFIX}.gff\n",
    "FNA=${PREFIX}.fna\n",
    "FA=${PREFIX}.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making New Directories\n",
    "No new directories are necessary, we are using directories that were already created for the single-end pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mkdir -p $QC $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality Control\n",
    "First let's run fastqc on everything.  This is very easy, we can just give it all the FASTQ files on the command line, and it runs on all of them.  We can use the wildcard `*` to do this simply."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# fastqc --threads 4 --quiet --outdir $QC $DEMUX/*.fq.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Reference Genome and Annotation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "FIRST_PART=\"ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/latest_assembly_versions\"\n",
    "for CUR in $GFF $FNA ; do\n",
    "    rsync rsync://${FIRST_PART}/${ACCESSION}/${CUR}.gz ${GENOME_DIR}\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gunzip ${GENOME_DIR}/${GFF}.gz | cat \n",
    "gunzip ${GENOME_DIR}/${FNA}.gz | cat\n",
    "mv ${GENOME_DIR}/${FNA} ${GENOME_DIR}/${FA} # tophat wants fasta file to be named *.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bowtie2-build $GENOME_DIR/$FA $GENOME_DIR/$PREFIX"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looping over all the samples\n",
    "Now we can put all of the previous commands into one big loop.  This is probably a good time for copying and pasting.  But we will make a small changes - We need to generate a list of samples that the loop will iterate over\n",
    "\n",
    "### Generating a list of sample names\n",
    "Since we have \"standardized\" file names (they have a regular structure) we can use some bash tools to extract the sample names from the names of the demultiplexed files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# ls -1 $DEMUX/i1.[78][ABC]_*.fq.gz | wc\n",
    "for FASTQ in $DEMUX/i1.[78][ABC]_*.fq.gz\n",
    "    do\n",
    "        # echo $FASTQ\n",
    "        FASTQ=${FASTQ##*/} # strip directory from file path\n",
    "        # echo $FASTQ\n",
    "        FASTQ=\"${FASTQ%.fq.gz}\" # strip .fq.gz file extension\n",
    "        # echo $FASTQ\n",
    "        SAMPLE=${FASTQ##i1.} # strip \"i1.\" prefix\n",
    "        echo $SAMPLE\n",
    "\n",
    "        # filename=\"${filename%.*}\"\n",
    "    done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running the full pipeline on all the samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# for SAMPLE in 7A_pilot 7B_pilot 7C_pilot 8A_pilot 8B_pilot 8C_pilot\n",
    "#     do\n",
    "for FASTQ in $DEMUX/i1.[78][ABC]_*.fq.gz\n",
    "    do\n",
    "        # echo $FASTQ\n",
    "        FASTQ=${FASTQ##*/} # strip directory from file path\n",
    "        # echo $FASTQ\n",
    "        FASTQ=\"${FASTQ%.fq.gz}\" # strip .fq.gz file extension\n",
    "        # echo $FASTQ\n",
    "        SAMPLE=${FASTQ##i1.} # strip \"i1.\" prefix\n",
    "        echo $SAMPLE\n",
    "        fastq-mcf $ADAPTERS \\\n",
    "            $DEMUX/r1.${SAMPLE}.fq.gz \\\n",
    "            $DEMUX/r2.${SAMPLE}.fq.gz \\\n",
    "            -q 20 -x 0.5 \\\n",
    "            -o $TRIMMED/r1.${SAMPLE}.trim.fq.gz \\\n",
    "            -o $TRIMMED/r2.${SAMPLE}.trim.fq.gz\n",
    "            \n",
    "        mkdir -p $TH_DIR/${SAMPLE}\n",
    "        tophat2 -G $GENOME_DIR/$GFF \\\n",
    "            --library-type fr-firststrand \\\n",
    "            --output-dir $TH_DIR/${SAMPLE} \\\n",
    "            --max-intron-length 5 \\\n",
    "            --min-intron-length 4 \\\n",
    "            --transcriptome-max-hits 1 \\\n",
    "            --max-multihits 1 \\\n",
    "            --no-coverage-search \\\n",
    "            --no-novel-juncs \\\n",
    "            --no-sort-bam \\\n",
    "            $GENOME_DIR/$PREFIX \\\n",
    "            $TRIMMED/r1.${SAMPLE}.trim.fq.gz \\\n",
    "            $TRIMMED/r2.${SAMPLE}.trim.fq.gz\n",
    "            \n",
    "        # ln $TH_DIR/${SAMPLE}/accepted_hits.bam $TH_DIR/${SAMPLE}.bam\n",
    "        # samtools index $TH_DIR/${SAMPLE}.bam\n",
    "        \n",
    "        samtools sort -n $TH_DIR/${SAMPLE}/accepted_hits.bam \\\n",
    "            -o $TH_DIR/${SAMPLE}/accepted_hits.name.bam\n",
    "\n",
    "        htseq-count --quiet --order=name --format=bam --stranded=reverse --type=gene \\\n",
    "            --idattr=Name  $TH_DIR/${SAMPLE}/accepted_hits.name.bam \\\n",
    "            $GENOME_DIR/$GFF > $COUNT_DIR/${SAMPLE}.csv\n",
    "    done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ls $TH_DIR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ls -ltr $COUNT_DIR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "echo \"The time it took to run this notebooks was:\"\n",
    "echo \"$(($SECONDS/3600)) Hours, $(($SECONDS%3600/60)) Minutes, $(($SECONDS%60)) Seconds\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}