Computing capstone exercise

Note: This exercise is HARD. Do not be discouraged if you find it difficult. Google as much as you like.

Data

The data set consists of a simulated (and highly contrived) gene expression levels for 100 subjects. 50 of the subjects are cases, and 50 are controls.

  • The expression level of 20,000 genes for each subject is found in a file expr-XXX.txt where XXX is the subject ID. Missing values are indicated by the string nan.
  • The file cases.txt contain the IDs of subjects who are in the cases group.
  • The file controls.txt contains the IDs of subjects who are in the controls group.
  • The file outcomes.txt contains the subject ID and blood sugar level for all subjects.

Exercise

Unix shell/command line

For this part - click on the Kernel menu item and select Change Kernel | Bash

In [ ]:

  • Regenerate the original data folder from data.tar.gz
In [ ]:

  • Check if any files have been corrupted using the MDFSUM checksum file and note its
In [ ]:

In [ ]:

  • Check that there are no md5sum errors
In [ ]:

Data munging

For this part - click on the Kernel menu item and select Change Kernel | R

  • Create a data.frame called expr where each row represents a subject, and each column represents a variable. The variables should include all genes with names s gene1, gene2, etc, as well as a column PID containing the subject ID and a column Group that is either case or ctrl. You will have to combine all the versions of expr-XXX.txt, cases.txt, controls.txt appropriately to create this data.frame. Make sure you also handle missing data correctly.

Hints:

  • First create a list of filenames
  • Then read in one file at a time using read.table, adding to a list of data.frames
  • You can use the lapply function to avoid writing a for loop if you prefer
  • Create the expr data.frame by using the bind_cols function
  • Add row and column names in the usual way
In [ ]:

  • Remove any gene(row) whose values are all zero. How many genes were dropped?

Hints

  • You should use the tidyverse library
  • One way to do this is to find rows where the sum of the absolute values is 0
In [ ]:

  • Remove any genes (row) with missing data. How many genes were dropped?
In [ ]:

  • Scale all genes to have zero mean and unit standard deviation

Hints

  • Remember scale works on columns and not rows
  • When checking, use head to limit the output - otherwise it takes a long time
In [ ]:

Unsupervised Learning

  • Use classic MDS (multi-dimensional scaling) to embed the data in 2D and make a scatter plot with ggplot2
In [ ]:

  • Use a t-test to find all genes differentially expressed in the cases and controls group with a False Discovery Rate (FDR) of 0.01 using the Benjamini–Hochberg (BH) procedure. Save the filtered genes in a data.frame called hits.

Hint

  • The rowttests function comes from the genefilter library
  • You can adjust the p-values using the p.adjust function
In [ ]:

  • Plot a heatmap of the genes that meet the FDR filter using agglomerative hierarchical clustering with single linkage.

Hints

  • Use the pheatmap library to plot the heatmap
  • Check what arguments you can give to the pheatmap function
In [ ]:

Supervised Learning

  • Perform logistic regression using LOOCV and the genes selected by FDR to generate class predictions (case or control) for all subjects

Hints

  • Use type = "response" in the call to predict to get probability values
In [ ]:

  • Evaluate the accuracy, sensitivity, specificity, PPV, NPV of the LOOCV logistic regression

Hints

  • The confusionMatrix function lives in the caret library
In [ ]:

  • Plot an ROC curve for the LOOCV and resubstitution predictions

Hints

  • Use the ROCR library for plotting
  • Remember to add in the diagonal
In [ ]:

  • Read the outcomes.txt data into a data.frame called outcomes with two columns PID and outcome.
In [ ]:

  • Merge the outcomes and expr by joining on the PID column.

**Hints

  • Use inner_join since we don’t want any NAs
In [ ]:

  • Perform LOOCV linear regression using the 5 genes most correlated with outcome to get outcome predictions for each subject

Hints

  • Use cor to get correlations and get the indexes for descending order of the absolute value
In [ ]:

  • Plot a scatter plot with a linear regression curve for predicted (y) versus observed (x) values using ggplot2
In [ ]: