Computing capstone exercise¶

Note: This exercise is HARD. Do not be discouraged if you find it difficult. Google as much as you like.

Data¶

The data set consists of a simulated (and highly contrived) gene expression levels for 100 subjects. 50 of the subjects are cases, and 50 are controls.

The expression level of 20,000 genes for each subject is found in a file expr-XXX.txt where XXX is the subject ID. Missing values are indicated by the string nan.
The file cases.txt contain the IDs of subjects who are in the cases group.
The file controls.txt contains the IDs of subjects who are in the controls group.
The file outcomes.txt contains the subject ID and blood sugar level for all subjects.

Exercise¶

Unix shell/command line¶

For this part - click on the Kernel menu item and select Change Kernel | Bash

Download the data from https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz
You will need to quote the URL

In [ ]:

Regenerate the original data folder from data.tar.gz

In [ ]:

Check if any files have been corrupted using the MDFSUM checksum file and note its

In [ ]:

Replace the corrupted file with a correct copy from https://www.dropbox.com/s/vf8qcoj07mcq7wn/FILENAME
You will need to replace FILENAME with the correct filename

In [ ]:

Check that there are no md5sum errors

In [ ]:

Data munging¶

For this part - click on the Kernel menu item and select Change Kernel | R

Create a data.frame called expr where each row represents a subject, and each column represents a variable. The variables should include all genes with names s gene1, gene2, etc, as well as a column PID containing the subject ID and a column Group that is either case or ctrl. You will have to combine all the versions of expr-XXX.txt, cases.txt, controls.txt appropriately to create this data.frame. Make sure you also handle missing data correctly.

Hints:

First create a list of filenames
Then read in one file at a time using read.table, adding to a list of data.frames
You can use the lapply function to avoid writing a for loop if you prefer
Create the expr data.frame by using the bind_cols function
Add row and column names in the usual way

In [ ]:

Remove any gene(row) whose values are all zero. How many genes were dropped?

Hints

You should use the tidyverse library
One way to do this is to find rows where the sum of the absolute values is 0

In [ ]:

Remove any genes (row) with missing data. How many genes were dropped?

In [ ]:

Scale all genes to have zero mean and unit standard deviation

Hints

Remember scale works on columns and not rows
When checking, use head to limit the output - otherwise it takes a long time

In [ ]:

Unsupervised Learning¶

Use classic MDS (multi-dimensional scaling) to embed the data in 2D and make a scatter plot with ggplot2

In [ ]:

Use a t-test to find all genes differentially expressed in the cases and controls group with a False Discovery Rate (FDR) of 0.01 using the Benjamini–Hochberg (BH) procedure. Save the filtered genes in a data.frame called hits.

Hint

The rowttests function comes from the genefilter library
You can adjust the p-values using the p.adjust function

In [ ]:

Plot a heatmap of the genes that meet the FDR filter using agglomerative hierarchical clustering with single linkage.

Hints

Use the pheatmap library to plot the heatmap
Check what arguments you can give to the pheatmap function

In [ ]:

Supervised Learning¶

Perform logistic regression using LOOCV and the genes selected by FDR to generate class predictions (case or control) for all subjects

Hints

Use type = "response" in the call to predict to get probability values

In [ ]:

Evaluate the accuracy, sensitivity, specificity, PPV, NPV of the LOOCV logistic regression

Hints

The confusionMatrix function lives in the caret library

In [ ]:

Plot an ROC curve for the LOOCV and resubstitution predictions

Hints

Use the ROCR library for plotting
Remember to add in the diagonal

In [ ]:

Read the outcomes.txt data into a data.frame called outcomes with two columns PID and outcome.

In [ ]:

Merge the outcomes and expr by joining on the PID column.

**Hints

Use inner_join since we don’t want any NAs

In [ ]:

Perform LOOCV linear regression using the 5 genes most correlated with outcome to get outcome predictions for each subject

Hints

Use cor to get correlations and get the indexes for descending order of the absolute value

In [ ]:

Plot a scatter plot with a linear regression curve for predicted (y) versus observed (x) values using ggplot2

In [ ]:

Computing capstone exercise¶

Data¶

Exercise¶

Unix shell/command line¶

Data munging¶

Unsupervised Learning¶

Supervised Learning¶

Page contents

Previous page

Next page

This Page