Computing capstone exercise¶
Note: This exercise is HARD. Do not be discouraged if you find it difficult. Google as much as you like.
Data¶
The data set consists of a simulated (and highly contrived) gene expression levels for 100 subjects. 50 of the subjects are cases, and 50 are controls.
- The expression level of 20,000 genes for each subject is found in a
file
expr-XXX.txtwhereXXXis the subject ID. Missing values are indicated by the stringnan. - The file
cases.txtcontain the IDs of subjects who are in the cases group. - The file
controls.txtcontains the IDs of subjects who are in the controls group. - The file
outcomes.txtcontains the subject ID and blood sugar level for all subjects.
Exercise¶
Unix shell/command line¶
For this part - click on the Kernel menu item and select
Change Kernel | Bash
- Download the data from https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz
- You will need to quote the URL
In [ ]:
- Regenerate the original data folder from
data.tar.gz
In [ ]:
- Check if any files have been corrupted using the MDFSUM checksum file and note its
In [ ]:
- Replace the corrupted file with a correct copy from https://www.dropbox.com/s/vf8qcoj07mcq7wn/FILENAME
- You will need to replace FILENAME with the correct filename
In [ ]:
- Check that there are no
md5sumerrors
In [ ]:
Data munging¶
For this part - click on the Kernel menu item and select
Change Kernel | R
- Create a
data.framecalledexprwhere each row represents a subject, and each column represents a variable. The variables should include all genes with names sgene1,gene2, etc, as well as a columnPIDcontaining the subject ID and a columnGroupthat is eithercaseorctrl. You will have to combine all the versions ofexpr-XXX.txt,cases.txt,controls.txtappropriately to create thisdata.frame. Make sure you also handle missing data correctly.
Hints:
- First create a list of filenames
- Then read in one file at a time using
read.table, adding to a list of data.frames - You can use the
lapplyfunction to avoid writing a for loop if you prefer - Create the
exprdata.frame by using thebind_colsfunction - Add row and column names in the usual way
In [ ]:
- Remove any gene(row) whose values are all zero. How many genes were dropped?
Hints
- You should use the
tidyverselibrary - One way to do this is to find rows where the sum of the absolute values is 0
In [ ]:
- Remove any genes (row) with missing data. How many genes were dropped?
In [ ]:
- Scale all genes to have zero mean and unit standard deviation
Hints
- Remember
scaleworks on columns and not rows - When checking, use
headto limit the output - otherwise it takes a long time
In [ ]:
Unsupervised Learning¶
- Use classic MDS (multi-dimensional scaling) to embed the data in 2D
and make a scatter plot with
ggplot2
In [ ]:
- Use a t-test to find all genes differentially expressed in the cases
and controls group with a False Discovery Rate (FDR) of 0.01 using
the Benjamini–Hochberg (BH) procedure. Save the filtered genes in a
data.framecalledhits.
Hint
- The
rowttestsfunction comes from thegenefilterlibrary - You can adjust the p-values using the
p.adjustfunction
In [ ]:
- Plot a heatmap of the genes that meet the FDR filter using
agglomerative hierarchical clustering with
singlelinkage.
Hints
- Use the
pheatmaplibrary to plot the heatmap - Check what arguments you can give to the
pheatmapfunction
In [ ]:
Supervised Learning¶
- Perform logistic regression using LOOCV and the genes selected by FDR to generate class predictions (case or control) for all subjects
Hints
- Use
type = "response"in the call topredictto get probability values
In [ ]:
- Evaluate the accuracy, sensitivity, specificity, PPV, NPV of the LOOCV logistic regression
Hints
- The
confusionMatrixfunction lives in thecaretlibrary
In [ ]:
- Plot an ROC curve for the LOOCV and resubstitution predictions
Hints
- Use the
ROCRlibrary for plotting - Remember to add in the diagonal
In [ ]:
- Read the
outcomes.txtdata into adata.framecalledoutcomeswith two columnsPIDandoutcome.
In [ ]:
- Merge the
outcomesandexprby joining on thePIDcolumn.
**Hints
- Use
inner_joinsince we don’t want any NAs
In [ ]:
- Perform LOOCV linear regression using the 5 genes most correlated with outcome to get outcome predictions for each subject
Hints
- Use
corto get correlations and get the indexes for descending order of the absolute value
In [ ]:
- Plot a scatter plot with a linear regression curve for predicted (y)
versus observed (x) values using
ggplot2
In [ ]: