{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Computing capstone exercise\n", "\n", "**Note**: This exercise is HARD. Do not be discouraged if you find it difficult. Google as much as you like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "The data set consists of a simulated (and highly contrived) gene expression levels for 100 subjects. 50 of the subjects are cases, and 50 are controls.\n", "\n", "- The expression level of 20,000 genes for each subject is found in a file `expr-XXX.txt` where `XXX` is the subject ID. Missing values are indicated by the string `nan`.\n", "- The file `cases.txt` contain the IDs of subjects who are in the cases group.\n", "- The file `controls.txt` contains the IDs of subjects who are in the controls group.\n", "- The file `outcomes.txt` contains the subject ID and blood sugar level for all subjects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unix shell/command line\n", "\n", "For this part - click on the `Kernel` menu item and select `Change Kernel | Bash`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Download the data from https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz\n", "- You will need to quote the URL" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Regenerate the original data folder from `data.tar.gz`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Check if any files have been corrupted using the MDFSUM checksum file and note its " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Replace the corrupted file with a correct copy from https://www.dropbox.com/s/vf8qcoj07mcq7wn/FILENAME\n", "- You will need to replace FILENAME with the correct filename" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Check that there are no `md5sum` errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data munging\n", "\n", "For this part - click on the `Kernel` menu item and select `Change Kernel | R`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Create a `data.frame` called `expr` where each row represents a subject, and each column represents a variable. The variables should include all genes with names s `gene1`, `gene2`, etc, as well as a column `PID` containing the subject ID and a column `Group` that is either `case` or `ctrl`. You will have to combine all the versions of `expr-XXX.txt`, `cases.txt`, `controls.txt` appropriately to create this `data.frame`. Make sure you also handle missing data correctly.\n", "\n", "**Hints**: \n", "\n", "- First create a list of filenames\n", "- Then read in one file at a time using `read.table`, adding to a list of data.frames\n", "- You can use the `lapply` function to avoid writing a for loop if you prefer\n", "- Create the `expr` data.frame by using the `bind_cols` function\n", "- Add row and column names in the usual way" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Remove any gene(row) whose values are all zero. How many genes were dropped?\n", "\n", "**Hints**\n", "\n", "- You should use the `tidyverse` library\n", "- One way to do this is to find rows where the sum of the absolute values is 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Remove any genes (row) with missing data. How many genes were dropped?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Scale all genes to have zero mean and unit standard deviation\n", "\n", "**Hints**\n", "\n", "- Remember `scale` works on columns and not rows\n", "- When checking, use `head` to limit the output - otherwise it takes a long time" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unsupervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Use classic MDS (multi-dimensional scaling) to embed the data in 2D and make a scatter plot with `ggplot2`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Use a t-test to find all genes differentially expressed in the cases and controls group with a False Discovery Rate (FDR) of 0.01 using the Benjamini–Hochberg (BH) procedure. Save the filtered genes in a `data.frame` called `hits`.\n", "\n", "**Hint**\n", "\n", "- The `rowttests` function comes from the `genefilter` library\n", "- You can adjust the p-values using the `p.adjust` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot a heatmap of the genes that meet the FDR filter using agglomerative hierarchical clustering with `single` linkage. \n", "\n", "**Hints**\n", "\n", "- Use the `pheatmap` library to plot the heatmap\n", "- Check what arguments you can give to the `pheatmap` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Perform logistic regression using LOOCV and the genes selected by FDR to generate class predictions (case or control) for all subjects\n", "\n", "**Hints**\n", "\n", "- Use `type = \"response\"` in the call to `predict` to get probability values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Evaluate the accuracy, sensitivity, specificity, PPV, NPV of the LOOCV logistic regression\n", "\n", "**Hints**\n", "\n", "- The `confusionMatrix` function lives in the `caret` library" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot an ROC curve for the LOOCV and resubstitution predictions\n", "\n", "**Hints**\n", "\n", "- Use the `ROCR` library for plotting\n", "- Remember to add in the diagonal" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Read the `outcomes.txt` data into a `data.frame` called `outcomes` with two columns `PID` and `outcome`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Merge the `outcomes` and `expr` by joining on the `PID` column.\n", "\n", "**Hints\n", "\n", "- Use `inner_join` since we don't want any NAs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Perform LOOCV linear regression using the 5 genes most correlated with outcome to get outcome predictions for each subject\n", "\n", "**Hints**\n", "\n", "- Use `cor` to get correlations and get the indexes for descending order of the absolute value" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot a scatter plot with a linear regression curve for predicted (y) versus observed (x) values using `ggplot2`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.0" } }, "nbformat": 4, "nbformat_minor": 2 }