{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Computing capstone exercise\n",
    "\n",
    "**Note**: This exercise is HARD. Do not be discouraged if you find it difficult. Google as much as you like."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "The data set consists of a simulated (and highly contrived) gene expression levels for 100 subjects. 50 of the subjects are cases, and 50 are controls.\n",
    "\n",
    "- The expression level of 20,000 genes for each subject is found in a file `expr-XXX.txt` where `XXX` is the subject ID. Missing values are indicated by the string `nan`.\n",
    "- The file `cases.txt` contain the IDs of subjects who are in the cases group.\n",
    "- The file `controls.txt` contains the IDs of subjects who are in the controls group.\n",
    "- The file `outcomes.txt` contains the subject ID and blood sugar level for all subjects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unix shell/command line\n",
    "\n",
    "For this part - click on the `Kernel` menu item and select `Change Kernel | Bash`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Download the data from https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz\n",
    "- You will need to quote the URL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Regenerate the original data folder from `data.tar.gz`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Check if any files have been corrupted using the MDFSUM checksum file and note its <FILENAME>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Replace the corrupted file with a correct copy from https://www.dropbox.com/s/vf8qcoj07mcq7wn/FILENAME\n",
    "- You will need to replace FILENAME with the correct filename"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Check that there are no `md5sum` errors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data munging\n",
    "\n",
    "For this part - click on the `Kernel` menu item and select `Change Kernel | R`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Create a `data.frame` called `expr` where each row represents a subject, and each column represents a variable. The variables should include all genes with names s `gene1`, `gene2`, etc, as well as a column `PID` containing the subject ID and a column `Group` that is either `case` or `ctrl`. You will have to combine all the versions of `expr-XXX.txt`, `cases.txt`, `controls.txt` appropriately to create this `data.frame`. Make sure you also handle missing data correctly.\n",
    "\n",
    "**Hints**: \n",
    "\n",
    "- First create a list of filenames\n",
    "- Then read in one file at a time using `read.table`, adding to a list of data.frames\n",
    "- You can use the `lapply` function to avoid writing a for loop if you prefer\n",
    "- Create the `expr` data.frame by using the `bind_cols` function\n",
    "- Add row and column names in the usual way"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Remove any gene(row) whose values are all zero. How many genes were dropped?\n",
    "\n",
    "**Hints**\n",
    "\n",
    "- You should use the `tidyverse` library\n",
    "- One way to do this is to find rows where the sum of the absolute values is 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Remove any genes (row) with missing data. How many genes were dropped?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Scale all genes to have zero mean and unit standard deviation\n",
    "\n",
    "**Hints**\n",
    "\n",
    "- Remember `scale` works on columns and not rows\n",
    "- When checking, use `head` to limit the output - otherwise it takes a long time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unsupervised Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Use classic MDS (multi-dimensional scaling) to embed the data in 2D and make a scatter plot with `ggplot2`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Use a t-test to find all genes differentially expressed in the cases and controls group with a False Discovery Rate (FDR) of 0.01 using the Benjamini–Hochberg (BH) procedure. Save the filtered genes in a `data.frame` called `hits`.\n",
    "\n",
    "**Hint**\n",
    "\n",
    "- The `rowttests` function comes from the `genefilter` library\n",
    "- You can adjust the p-values using the `p.adjust` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Plot a heatmap of the genes that meet the FDR filter using agglomerative hierarchical clustering with `single` linkage. \n",
    "\n",
    "**Hints**\n",
    "\n",
    "- Use the `pheatmap` library to plot the heatmap\n",
    "- Check what arguments you can give to the `pheatmap` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Supervised Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Perform logistic regression using LOOCV and the genes selected by FDR to generate class predictions (case or control) for all subjects\n",
    "\n",
    "**Hints**\n",
    "\n",
    "- Use `type = \"response\"` in the call to `predict` to get probability values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Evaluate the accuracy, sensitivity, specificity, PPV, NPV of the LOOCV logistic regression\n",
    "\n",
    "**Hints**\n",
    "\n",
    "- The `confusionMatrix` function lives in the `caret` library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Plot an ROC curve for the LOOCV and resubstitution predictions\n",
    "\n",
    "**Hints**\n",
    "\n",
    "- Use the `ROCR` library for plotting\n",
    "- Remember to add in the diagonal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Read the `outcomes.txt` data into a `data.frame` called `outcomes` with two columns `PID` and `outcome`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Merge the `outcomes` and `expr` by joining on the `PID` column.\n",
    "\n",
    "**Hints\n",
    "\n",
    "- Use `inner_join` since we don't want any NAs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Perform LOOCV linear regression using the 5 genes most correlated with outcome to get outcome predictions for each subject\n",
    "\n",
    "**Hints**\n",
    "\n",
    "- Use `cor` to get correlations and get the indexes for descending order of the absolute value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Plot a scatter plot with a linear regression curve for predicted (y) versus observed (x) values using `ggplot2`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}