Computing capstone exercise¶
Note: This exercise is HARD. Do not be discouraged if you find it difficult. Google as much as you like.
Data¶
The data set consists of a simulated (and highly contrived) gene expression levels for 100 subjects. 50 of the subjects are cases, and 50 are controls.
- The expression level of 20,000 genes for each subject is found in a
file
expr-XXX.txt
whereXXX
is the subject ID. Missing values are indicated by the stringnan
. - The file
cases.txt
contain the IDs of subjects who are in the cases group. - The file
controls.txt
contains the IDs of subjects who are in the controls group. - The file
outcomes.txt
contains the subject ID and blood sugar level for all subjects.
Exercise¶
Unix shell/command line¶
For this part - click on the Kernel
menu item and select
Change Kernel | Bash
- Download the data from https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz
- You will need to quote the URL
In [ ]:
- Regenerate the original data folder from
data.tar.gz
In [ ]:
- Check if any files have been corrupted using the MDFSUM checksum file and note its
In [ ]:
- Replace the corrupted file with a correct copy from https://www.dropbox.com/s/vf8qcoj07mcq7wn/FILENAME
- You will need to replace FILENAME with the correct filename
In [ ]:
- Check that there are no
md5sum
errors
In [ ]:
Data munging¶
For this part - click on the Kernel
menu item and select
Change Kernel | R
- Create a
data.frame
calledexpr
where each row represents a subject, and each column represents a variable. The variables should include all genes with names sgene1
,gene2
, etc, as well as a columnPID
containing the subject ID and a columnGroup
that is eithercase
orctrl
. You will have to combine all the versions ofexpr-XXX.txt
,cases.txt
,controls.txt
appropriately to create thisdata.frame
. Make sure you also handle missing data correctly.
Hints:
- First create a list of filenames
- Then read in one file at a time using
read.table
, adding to a list of data.frames - You can use the
lapply
function to avoid writing a for loop if you prefer - Create the
expr
data.frame by using thebind_cols
function - Add row and column names in the usual way
In [ ]:
- Remove any gene(row) whose values are all zero. How many genes were dropped?
Hints
- You should use the
tidyverse
library - One way to do this is to find rows where the sum of the absolute values is 0
In [ ]:
- Remove any genes (row) with missing data. How many genes were dropped?
In [ ]:
- Scale all genes to have zero mean and unit standard deviation
Hints
- Remember
scale
works on columns and not rows - When checking, use
head
to limit the output - otherwise it takes a long time
In [ ]:
Unsupervised Learning¶
- Use classic MDS (multi-dimensional scaling) to embed the data in 2D
and make a scatter plot with
ggplot2
In [ ]:
- Use a t-test to find all genes differentially expressed in the cases
and controls group with a False Discovery Rate (FDR) of 0.01 using
the Benjamini–Hochberg (BH) procedure. Save the filtered genes in a
data.frame
calledhits
.
Hint
- The
rowttests
function comes from thegenefilter
library - You can adjust the p-values using the
p.adjust
function
In [ ]:
- Plot a heatmap of the genes that meet the FDR filter using
agglomerative hierarchical clustering with
single
linkage.
Hints
- Use the
pheatmap
library to plot the heatmap - Check what arguments you can give to the
pheatmap
function
In [ ]:
Supervised Learning¶
- Perform logistic regression using LOOCV and the genes selected by FDR to generate class predictions (case or control) for all subjects
Hints
- Use
type = "response"
in the call topredict
to get probability values
In [ ]:
- Evaluate the accuracy, sensitivity, specificity, PPV, NPV of the LOOCV logistic regression
Hints
- The
confusionMatrix
function lives in thecaret
library
In [ ]:
- Plot an ROC curve for the LOOCV and resubstitution predictions
Hints
- Use the
ROCR
library for plotting - Remember to add in the diagonal
In [ ]:
- Read the
outcomes.txt
data into adata.frame
calledoutcomes
with two columnsPID
andoutcome
.
In [ ]:
- Merge the
outcomes
andexpr
by joining on thePID
column.
**Hints
- Use
inner_join
since we don’t want any NAs
In [ ]:
- Perform LOOCV linear regression using the 5 genes most correlated with outcome to get outcome predictions for each subject
Hints
- Use
cor
to get correlations and get the indexes for descending order of the absolute value
In [ ]:
- Plot a scatter plot with a linear regression curve for predicted (y)
versus observed (x) values using
ggplot2
In [ ]: