Computing capstone exercise: Solutions

Data

The data set consists of a simulated (and highly contrived) gene expression levels for 100 subjects. 50 of the subjects are cases, and 50 are controls.

  • The expression level of 20,000 genes for each subject is found in a file expr-XXX.txt where XXX is the subject ID. Missing values are indicated by the string nan.
  • The file cases.txt contain the IDs of subjects who are in the cases group.
  • The file controls.txt contains the IDs of subjects who are in the controls group.
  • The file outcomes.txt contains the subject ID and blood sugar level for all subjects.

Exercise

Unix shell/command line

For this part - click on the Kernel menu item and select Change Kernel | Bash

In [1]:
wget "https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz"
--2017-07-27 10:06:58--  https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz
Resolving www.dropbox.com... 162.125.6.1, 2620:100:601c:1::a27d:601
Connecting to www.dropbox.com|162.125.6.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dl.dropboxusercontent.com/content_link/zTZFO6FdHTj3YjweRBYauZZ5YuV5Ej0gsr2rtOUPSdbCs0NQ40ADNWTJMcOm1wGd/file [following]
--2017-07-27 10:06:58--  https://dl.dropboxusercontent.com/content_link/zTZFO6FdHTj3YjweRBYauZZ5YuV5Ej0gsr2rtOUPSdbCs0NQ40ADNWTJMcOm1wGd/file
Resolving dl.dropboxusercontent.com... 162.125.6.6
Connecting to dl.dropboxusercontent.com|162.125.6.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3973292 (3.8M) [application/octet-stream]
Saving to: ‘data.tar.gz.1’

data.tar.gz.1       100%[===================>]   3.79M  7.28MB/s    in 0.5s

2017-07-27 10:07:00 (7.28 MB/s) - ‘data.tar.gz.1’ saved [3973292/3973292]

  • Regenerate the original data folder from data.tar.gz
In [2]:
tar -xzf data.tar.gz
  • Check if any files have been corrupted using the MDFSUM checksum file and note its
In [3]:
cd data
md5sum -c MD5SUM
cases.txt: OK
controls.txt: OK
expr-1001.txt: OK
expr-1002.txt: OK
expr-1003.txt: OK
expr-1004.txt: OK
expr-1005.txt: OK
expr-1006.txt: OK
expr-1007.txt: OK
expr-1008.txt: OK
expr-1009.txt: OK
expr-1010.txt: OK
expr-1011.txt: OK
expr-1012.txt: OK
expr-1013.txt: OK
expr-1014.txt: OK
expr-1015.txt: OK
expr-1016.txt: OK
expr-1017.txt: OK
expr-1018.txt: OK
expr-1019.txt: OK
expr-1020.txt: OK
expr-1021.txt: OK
expr-1022.txt: OK
expr-1023.txt: OK
expr-1024.txt: OK
expr-1025.txt: OK
expr-1026.txt: OK
expr-1027.txt: OK
expr-1028.txt: OK
expr-1029.txt: OK
expr-1030.txt: OK
expr-1031.txt: OK
expr-1032.txt: OK
expr-1033.txt: OK
expr-1034.txt: OK
expr-1035.txt: OK
expr-1036.txt: OK
expr-1037.txt: OK
expr-1038.txt: OK
expr-1039.txt: OK
expr-1040.txt: OK
expr-1041.txt: OK
expr-1042.txt: OK
expr-1043.txt: OK
expr-1044.txt: OK
expr-1045.txt: OK
expr-1046.txt: OK
expr-1047.txt: OK
expr-1048.txt: OK
expr-1049.txt: OK
expr-1050.txt: OK
expr-1051.txt: OK
expr-1052.txt: OK
expr-1053.txt: OK
expr-1054.txt: OK
expr-1055.txt: OK
expr-1056.txt: OK
expr-1057.txt: OK
expr-1058.txt: OK
expr-1059.txt: OK
expr-1060.txt: OK
expr-1061.txt: OK
expr-1062.txt: OK
expr-1063.txt: OK
expr-1064.txt: OK
expr-1065.txt: OK
expr-1066.txt: OK
expr-1067.txt: OK
expr-1068.txt: OK
expr-1069.txt: OK
expr-1070.txt: OK
expr-1071.txt: OK
expr-1072.txt: OK
expr-1073.txt: OK
expr-1074.txt: FAILED
expr-1075.txt: OK
expr-1076.txt: OK
expr-1077.txt: OK
expr-1078.txt: OK
expr-1079.txt: OK
expr-1080.txt: OK
expr-1081.txt: OK
expr-1082.txt: OK
expr-1083.txt: OK
expr-1084.txt: OK
expr-1085.txt: OK
expr-1086.txt: OK
expr-1087.txt: OK
expr-1088.txt: OK
expr-1089.txt: OK
expr-1090.txt: OK
expr-1091.txt: OK
expr-1092.txt: OK
expr-1093.txt: OK
expr-1094.txt: OK
expr-1095.txt: OK
expr-1096.txt: OK
expr-1097.txt: OK
expr-1098.txt: OK
expr-1099.txt: OK
expr-1100.txt: OK
outcomes.txt: OK
md5sum: WARNING: 1 of 103 computed checksums did NOT match

In [4]:
rm expr-1074.txt
wget "https://www.dropbox.com/s/vf8qcoj07mcq7wn/expr-1074.txt"
--2017-07-27 10:07:04--  https://www.dropbox.com/s/vf8qcoj07mcq7wn/expr-1074.txt
Resolving www.dropbox.com... 162.125.6.1, 2620:100:601c:1::a27d:601
Connecting to www.dropbox.com|162.125.6.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dl.dropboxusercontent.com/content_link/GMBFcFrWZQOBq8z3MtOT0haOIcLYErRBgLumqB7oFbhwdCbQbJtKU3kLjf2lArVp/file [following]
--2017-07-27 10:07:05--  https://dl.dropboxusercontent.com/content_link/GMBFcFrWZQOBq8z3MtOT0haOIcLYErRBgLumqB7oFbhwdCbQbJtKU3kLjf2lArVp/file
Resolving dl.dropboxusercontent.com... 162.125.6.6
Connecting to dl.dropboxusercontent.com|162.125.6.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 111905 (109K) [text/plain]
Saving to: ‘expr-1074.txt’

expr-1074.txt       100%[===================>] 109.28K  --.-KB/s    in 0.03s

2017-07-27 10:07:05 (3.32 MB/s) - ‘expr-1074.txt’ saved [111905/111905]

  • Check that there are no md5sum errors
In [5]:
md5sum -c MD5SUM
cases.txt: OK
controls.txt: OK
expr-1001.txt: OK
expr-1002.txt: OK
expr-1003.txt: OK
expr-1004.txt: OK
expr-1005.txt: OK
expr-1006.txt: OK
expr-1007.txt: OK
expr-1008.txt: OK
expr-1009.txt: OK
expr-1010.txt: OK
expr-1011.txt: OK
expr-1012.txt: OK
expr-1013.txt: OK
expr-1014.txt: OK
expr-1015.txt: OK
expr-1016.txt: OK
expr-1017.txt: OK
expr-1018.txt: OK
expr-1019.txt: OK
expr-1020.txt: OK
expr-1021.txt: OK
expr-1022.txt: OK
expr-1023.txt: OK
expr-1024.txt: OK
expr-1025.txt: OK
expr-1026.txt: OK
expr-1027.txt: OK
expr-1028.txt: OK
expr-1029.txt: OK
expr-1030.txt: OK
expr-1031.txt: OK
expr-1032.txt: OK
expr-1033.txt: OK
expr-1034.txt: OK
expr-1035.txt: OK
expr-1036.txt: OK
expr-1037.txt: OK
expr-1038.txt: OK
expr-1039.txt: OK
expr-1040.txt: OK
expr-1041.txt: OK
expr-1042.txt: OK
expr-1043.txt: OK
expr-1044.txt: OK
expr-1045.txt: OK
expr-1046.txt: OK
expr-1047.txt: OK
expr-1048.txt: OK
expr-1049.txt: OK
expr-1050.txt: OK
expr-1051.txt: OK
expr-1052.txt: OK
expr-1053.txt: OK
expr-1054.txt: OK
expr-1055.txt: OK
expr-1056.txt: OK
expr-1057.txt: OK
expr-1058.txt: OK
expr-1059.txt: OK
expr-1060.txt: OK
expr-1061.txt: OK
expr-1062.txt: OK
expr-1063.txt: OK
expr-1064.txt: OK
expr-1065.txt: OK
expr-1066.txt: OK
expr-1067.txt: OK
expr-1068.txt: OK
expr-1069.txt: OK
expr-1070.txt: OK
expr-1071.txt: OK
expr-1072.txt: OK
expr-1073.txt: OK
expr-1074.txt: OK
expr-1075.txt: OK
expr-1076.txt: OK
expr-1077.txt: OK
expr-1078.txt: OK
expr-1079.txt: OK
expr-1080.txt: OK
expr-1081.txt: OK
expr-1082.txt: OK
expr-1083.txt: OK
expr-1084.txt: OK
expr-1085.txt: OK
expr-1086.txt: OK
expr-1087.txt: OK
expr-1088.txt: OK
expr-1089.txt: OK
expr-1090.txt: OK
expr-1091.txt: OK
expr-1092.txt: OK
expr-1093.txt: OK
expr-1094.txt: OK
expr-1095.txt: OK
expr-1096.txt: OK
expr-1097.txt: OK
expr-1098.txt: OK
expr-1099.txt: OK
expr-1100.txt: OK
outcomes.txt: OK

Data munging

For this part - click on the Kernel menu item and select Change Kernel | R

  • Create a data.frame called expr where each row represents a gene, and each column represents a subject. Give meaningful row names of the form geneX (where X goes from 1:20000) and ptX (where X goes from 1001:1100, using the PID in the filename).
  • When loading the files, make sure you also handle missing data indicated by the string ‘nan’ correctly.
In [186]:
suppressPackageStartupMessages(library(tidyverse))
In [187]:
files <- paste('data/', 'expr-', 1001:1100, '.txt', sep='')
In [188]:
data <- lapply(files, read.table)
In [189]:
expr <- bind_cols(data)
In [190]:
dim(expr)
  1. 20000
  2. 100
In [191]:
rownames(expr) = paste('gene', 1:nrow(expr), sep='')
In [192]:
colnames(expr) = paste('pt', 1001:1100, sep='')
In [193]:
head(expr)
pt1001pt1002pt1003pt1004pt1005pt1006pt1007pt1008pt1009pt1010pt1091pt1092pt1093pt1094pt1095pt1096pt1097pt1098pt1099pt1100
gene1 8.39 3.60 12.55 -2.15 -8.04-10.42-4.39 -1.82 7.38 5.70 -6.29 -3.52 -4.81 0.37-2.22 -4.14 7.40 -1.34 -3.34 5.17
gene2 0.63 6.18 8.35 9.87 -4.97 0.86-3.28 -4.73 -14.21-7.32 4.05 2.45 -1.42 2.29-3.84 -6.12 -2.89 -5.94 15.15 12.76
gene3 1.40 9.47 2.42 -1.58 -8.86 -2.05-1.31 -1.64 9.68-3.50 -4.38 0.96 5.34 7.45-6.63 4.84 -3.58 -8.33 3.76 -7.13
gene4 1.44 -2.81 4.60 1.22 4.14 10.04-0.99 12.57 -7.22-7.75 -5.44 3.74 -6.08 1.96-2.47 0.66 7.34 -7.44 1.48 6.83
gene5-0.38 0.46 0.41 -12.49-11.06 3.19-4.40 3.55 -9.26 1.63 1.80 4.17 -7.01 0.80 2.02 0.30 3.85 -7.74 9.59 6.05
gene6-8.52 5.84 1.56 -1.52 4.73 5.90-5.81 -0.16 -0.62 0.91 9.14 -9.82 7.51 -11.27-2.70 -0.78 3.66 10.58 7.58 14.13
  • Remove any gene(row) whose values are all zero. How many genes were dropped?
In [194]:
idx <- rowSums(abs(expr)) == 0
expr <- expr[!idx,]
In [195]:
dim(expr)
  1. 19998
  2. 100
  • Remove any genes (row) with missing data. How many genes were dropped?
In [196]:
expr <- expr %>% drop_na()
In [197]:
dim(expr)
  1. 19993
  2. 100
  • Scale all genes to have zero mean and unit standard deviation
In [198]:
expr <- t(scale(t(expr)))
In [199]:
head(round(apply(expr, 1, sd), 2), 3)
gene1
1
gene2
1
gene3
1
In [200]:
head(round(apply(expr, 1, mean), 2), 3)
gene1
0
gene2
0
gene3
0
  • Create a data frame group with two columns PID and Group that indicates whether each patient is a case or control. Use the information from the cases.txt and controls.txt files.
In [201]:
cases <- read.table('data/cases.txt')
ctrls <- read.table('data/controls.txt')
In [202]:
n1 <- nrow(cases)
n2 <- nrow(ctrls)
In [203]:
grp <- c(rep('case', n1), rep('ctrl', n2))
In [204]:
group <- data.frame(PID=bind_rows(cases, ctrls), grp)
In [205]:
colnames(group) <- c('PID', 'Group')
In [206]:
head(group, 3)
PIDGroup
1088case
1022case
1064case
In [207]:
tail(group, 3)
PIDGroup
981096ctrl
991072ctrl
1001038ctrl

Unsupervised Learning

  • Use classic MDS (multi-dimensional scaling) to embed the data in 2D and make a scatter plot with ggplot2
In [208]:
mds <- as.data.frame(cmdscale(dist(t(expr)), k=2))
In [209]:
str(mds)
'data.frame':   100 obs. of  2 variables:
 $ V1: num  -19.5 21.9 19.5 -22 -21.1 ...
 $ V2: num  -7.44 13.636 -36.314 5.243 -0.752 ...
In [210]:
options(repr.plot.width=4, repr.plot.height=3)
In [211]:
ggplot(mds, aes(x=V1, y=V2)) + geom_point()
Data type cannot be displayed:
../../_images/Computation_Wk4_Day4_PM_Computing_Capstone_Solutions_47_1.png
  • Use a t-test to find all genes differentially expressed in the cases and controls group with a False Discovery Rate (FDR) of 0.05 using the Benjamini–Hochberg (BH) procedure. Save the filtered genes in a data.frame called hits.
In [212]:
suppressPackageStartupMessages(library(genefilter))
In [213]:
head(group, 3)
PIDGroup
1088case
1022case
1064case
In [214]:
group <- group %>% arrange(PID)
head(group)
PIDGroup
1001case
1002ctrl
1003ctrl
1004case
1005case
1006case
In [215]:
dim(expr)
  1. 19993
  2. 100
In [216]:
p.values <- rowttests(expr, fac = group$Group)$p.value
In [217]:
head(p.values)
  1. 0.810340264500016
  2. 0.0204475853560852
  3. 0.233159743584127
  4. 0.412635280852636
  5. 0.165387577174332
  6. 0.933613549615968
In [218]:
p.bh <- p.adjust(p.values, method='BH')
In [219]:
idx <- p.bh < 0.01
In [220]:
hits <- expr[idx,]
rownames(hits)
  1. 'gene160'
  2. 'gene393'
  3. 'gene412'
  4. 'gene439'
  5. 'gene452'
  6. 'gene490'
  7. 'gene506'
  8. 'gene601'
  9. 'gene683'
  10. 'gene748'
  11. 'gene766'
  12. 'gene771'
  13. 'gene808'
  14. 'gene830'
  15. 'gene834'
  16. 'gene862'
  17. 'gene888'
  18. 'gene914'
  19. 'gene1013'
  20. 'gene1035'
  21. 'gene1057'
  22. 'gene1082'
  23. 'gene1138'
  24. 'gene1172'
  25. 'gene1194'
  26. 'gene1322'
  27. 'gene1338'
  28. 'gene1462'
  29. 'gene1563'
  30. 'gene1582'
  31. 'gene1594'
  32. 'gene1775'
  33. 'gene1844'
  34. 'gene1857'
  35. 'gene1868'
  36. 'gene1910'
  37. 'gene1950'
  38. 'gene1957'
  39. 'gene2094'
  40. 'gene2106'
  41. 'gene2119'
  42. 'gene2148'
  43. 'gene2233'
  44. 'gene2253'
  45. 'gene2300'
  46. 'gene2358'
  47. 'gene2407'
  48. 'gene2450'
  49. 'gene2454'
  50. 'gene2461'
  51. 'gene2508'
  52. 'gene2548'
  53. 'gene2560'
  54. 'gene2608'
  55. 'gene2630'
  56. 'gene2642'
  57. 'gene2671'
  58. 'gene2736'
  59. 'gene2772'
  60. 'gene2818'
  61. 'gene2822'
  62. 'gene2941'
  63. 'gene2971'
  64. 'gene3012'
  65. 'gene3072'
  66. 'gene3116'
  67. 'gene3215'
  68. 'gene3262'
  69. 'gene3269'
  70. 'gene3306'
  71. 'gene3403'
  72. 'gene3455'
  73. 'gene3490'
  74. 'gene3492'
  75. 'gene3546'
  76. 'gene3574'
  77. 'gene3647'
  78. 'gene3693'
  79. 'gene3731'
  80. 'gene3748'
  81. 'gene3769'
  82. 'gene3802'
  83. 'gene3964'
  84. 'gene3971'
  85. 'gene4007'
  86. 'gene4028'
  87. 'gene4059'
  88. 'gene4097'
  89. 'gene4117'
  90. 'gene4233'
  91. 'gene4293'
  92. 'gene4427'
  93. 'gene4438'
  94. 'gene4501'
  95. 'gene4527'
  96. 'gene4634'
  97. 'gene4652'
  98. 'gene4723'
  99. 'gene4846'
  100. 'gene4890'
  101. 'gene5212'
  102. 'gene5233'
  103. 'gene5272'
  104. 'gene5375'
  105. 'gene5478'
  106. 'gene5520'
  107. 'gene5664'
  108. 'gene5770'
  109. 'gene5828'
  110. 'gene5936'
  111. 'gene5975'
  112. 'gene5987'
  113. 'gene6025'
  114. 'gene6029'
  115. 'gene6096'
  116. 'gene6097'
  117. 'gene6102'
  118. 'gene6104'
  119. 'gene6210'
  120. 'gene6412'
  121. 'gene6478'
  122. 'gene6534'
  123. 'gene6650'
  124. 'gene6672'
  125. 'gene6678'
  126. 'gene6756'
  127. 'gene6779'
  128. 'gene6908'
  129. 'gene7116'
  130. 'gene7123'
  131. 'gene7215'
  132. 'gene7220'
  133. 'gene7250'
  134. 'gene7252'
  135. 'gene7363'
  136. 'gene7374'
  137. 'gene7496'
  138. 'gene7568'
  139. 'gene7578'
  140. 'gene7648'
  141. 'gene7692'
  142. 'gene7704'
  143. 'gene7713'
  144. 'gene7766'
  145. 'gene7783'
  146. 'gene7873'
  147. 'gene8102'
  148. 'gene8163'
  149. 'gene8304'
  150. 'gene8559'
  151. 'gene8758'
  152. 'gene8843'
  153. 'gene8906'
  154. 'gene8924'
  155. 'gene8993'
  156. 'gene8995'
  157. 'gene9104'
  158. 'gene9192'
  159. 'gene9203'
  160. 'gene9227'
  161. 'gene9326'
  162. 'gene9403'
  163. 'gene9506'
  164. 'gene9516'
  165. 'gene9652'
  166. 'gene9653'
  167. 'gene9679'
  168. 'gene9725'
  169. 'gene9741'
  170. 'gene9817'
  171. 'gene9834'
  172. 'gene9848'
  173. 'gene9857'
  174. 'gene9906'
  175. 'gene10003'
  176. 'gene10057'
  177. 'gene10105'
  178. 'gene10119'
  179. 'gene10190'
  180. 'gene10210'
  181. 'gene10211'
  182. 'gene10283'
  183. 'gene10326'
  184. 'gene10437'
  185. 'gene10446'
  186. 'gene10529'
  187. 'gene10587'
  188. 'gene10639'
  189. 'gene10657'
  190. 'gene10667'
  191. 'gene10674'
  192. 'gene10682'
  193. 'gene10696'
  194. 'gene10706'
  195. 'gene10737'
  196. 'gene10778'
  197. 'gene10812'
  198. 'gene10876'
  199. 'gene10918'
  200. 'gene10949'
  201. 'gene10956'
  202. 'gene11212'
  203. 'gene11218'
  204. 'gene11295'
  205. 'gene11357'
  206. 'gene11360'
  207. 'gene11366'
  208. 'gene11384'
  209. 'gene11429'
  210. 'gene11433'
  211. 'gene11478'
  212. 'gene11527'
  213. 'gene11553'
  214. 'gene11564'
  215. 'gene11631'
  216. 'gene11824'
  217. 'gene11931'
  218. 'gene12102'
  219. 'gene12211'
  220. 'gene12253'
  221. 'gene12354'
  222. 'gene12370'
  223. 'gene12377'
  224. 'gene12425'
  225. 'gene12508'
  226. 'gene12547'
  227. 'gene12559'
  228. 'gene12567'
  229. 'gene12674'
  230. 'gene12675'
  231. 'gene12694'
  232. 'gene12704'
  233. 'gene12723'
  234. 'gene12769'
  235. 'gene12777'
  236. 'gene12809'
  237. 'gene12819'
  238. 'gene12825'
  239. 'gene12864'
  240. 'gene12902'
  241. 'gene12927'
  242. 'gene12947'
  243. 'gene13148'
  244. 'gene13170'
  245. 'gene13179'
  246. 'gene13233'
  247. 'gene13381'
  248. 'gene13461'
  249. 'gene13567'
  250. 'gene13607'
  251. 'gene13649'
  252. 'gene13711'
  253. 'gene13723'
  254. 'gene13742'
  255. 'gene13780'
  256. 'gene13823'
  257. 'gene13832'
  258. 'gene13872'
  259. 'gene14038'
  260. 'gene14154'
  261. 'gene14161'
  262. 'gene14178'
  263. 'gene14434'
  264. 'gene14443'
  265. 'gene14508'
  266. 'gene14539'
  267. 'gene14560'
  268. 'gene14590'
  269. 'gene14604'
  270. 'gene14623'
  271. 'gene14656'
  272. 'gene14660'
  273. 'gene14782'
  274. 'gene14784'
  275. 'gene14808'
  276. 'gene14889'
  277. 'gene14924'
  278. 'gene15062'
  279. 'gene15069'
  280. 'gene15083'
  281. 'gene15087'
  282. 'gene15101'
  283. 'gene15128'
  284. 'gene15173'
  285. 'gene15177'
  286. 'gene15180'
  287. 'gene15232'
  288. 'gene15263'
  289. 'gene15361'
  290. 'gene15415'
  291. 'gene15454'
  292. 'gene15523'
  293. 'gene15530'
  294. 'gene15592'
  295. 'gene15596'
  296. 'gene15619'
  297. 'gene15709'
  298. 'gene15714'
  299. 'gene15725'
  300. 'gene15743'
  301. 'gene15892'
  302. 'gene16056'
  303. 'gene16088'
  304. 'gene16095'
  305. 'gene16122'
  306. 'gene16164'
  307. 'gene16197'
  308. 'gene16213'
  309. 'gene16240'
  310. 'gene16256'
  311. 'gene16482'
  312. 'gene16647'
  313. 'gene16701'
  314. 'gene16705'
  315. 'gene16788'
  316. 'gene16810'
  317. 'gene16830'
  318. 'gene16870'
  319. 'gene16967'
  320. 'gene16978'
  321. 'gene16987'
  322. 'gene17113'
  323. 'gene17129'
  324. 'gene17146'
  325. 'gene17224'
  326. 'gene17233'
  327. 'gene17396'
  328. 'gene17432'
  329. 'gene17562'
  330. 'gene17582'
  331. 'gene17605'
  332. 'gene17625'
  333. 'gene17862'
  334. 'gene17910'
  335. 'gene18009'
  336. 'gene18055'
  337. 'gene18169'
  338. 'gene18206'
  339. 'gene18207'
  340. 'gene18316'
  341. 'gene18329'
  342. 'gene18341'
  343. 'gene18375'
  344. 'gene18410'
  345. 'gene18453'
  346. 'gene18487'
  347. 'gene18509'
  348. 'gene18530'
  349. 'gene18722'
  350. 'gene18811'
  351. 'gene18847'
  352. 'gene18905'
  353. 'gene18957'
  354. 'gene19008'
  355. 'gene19066'
  356. 'gene19080'
  357. 'gene19086'
  358. 'gene19137'
  359. 'gene19216'
  360. 'gene19251'
  361. 'gene19287'
  362. 'gene19564'
  363. 'gene19666'
  364. 'gene19701'
  365. 'gene19757'
  366. 'gene19771'
  367. 'gene19795'
  368. 'gene19930'
  369. 'gene19978'
  • Plot a heatmap of the genes that meet the FDR filter using agglomerative hierarchical clustering with complete linkage for the dendrograms. Arrange the subjects so all controls are on the left and all cases are on the right in the heatmap.
In [221]:
clusters <- hclust(dist(hits), method = 'complete')
In [222]:
library(pheatmap)
In [223]:
annot <- data.frame(grp=group$Group, row.names=colnames(hits))
In [224]:
options(repr.plot.width=8, repr.plot.height=8)
In [303]:
pheatmap(hits,
         clustering_method = 'single',
         annotation_col = annot,
         annotation_colors = list(grp = c(case = "blue", ctrl = "yellow")),
         show_rownames = FALSE, show_colnames = FALSE,
        )
../../_images/Computation_Wk4_Day4_PM_Computing_Capstone_Solutions_63_0.png

Supervised Learning

  • Perform logistic regression using LOOCV and the top 10 genes to generate class predictions (case or control) for all subjects
In [226]:
n <- nrow(df)
pred <- numeric(n)
for (i in 1:n) {
    p.values <- rowttests(expr[,-i], fac = group$Group[-i])$p.value
    idx <- order(p.values)
    hits <- expr[idx[1:10],]

    # transpose so geens are variables then add group informaiton
    df <- t(hits)
    df <- data.frame(y=as.factor(group$Group), df)

    model <- glm(y ~ ., data=df[-i,], family='binomial')
    pred[i] <- predict(model, df[i,], type='response')
}
yhat <- ifelse(pred < 0.5, 1, 2)
  • Evaluate the accuracy, sensitivity, specificity, PPV, NPV of the LOOCV logistic regression
In [227]:
suppressPackageStartupMessages(library(caret))
In [228]:
confusionMatrix(yhat, as.integer(group$Group))
Confusion Matrix and Statistics

          Reference
Prediction  1  2
         1 50  0
         2  0 50

               Accuracy : 1
                 95% CI : (0.9638, 1)
    No Information Rate : 0.5
    P-Value [Acc > NIR] : < 2.2e-16

                  Kappa : 1
 Mcnemar's Test P-Value : NA

            Sensitivity : 1.0
            Specificity : 1.0
         Pos Pred Value : 1.0
         Neg Pred Value : 1.0
             Prevalence : 0.5
         Detection Rate : 0.5
   Detection Prevalence : 0.5
      Balanced Accuracy : 1.0

       'Positive' Class : 1
  • Plot an ROC curve for the LOOCV predictions
In [229]:
suppressPackageStartupMessages(library(ROCR))
In [230]:
ROC <- function(preddat, mlab = "probhat", ylab = "y") {
    m <- preddat[[mlab]]
    y <- preddat[[ylab]]
    performance(prediction(m, y), measure = "tpr", x.measure = "fpr")
}
In [231]:
preddat <- data.frame(probhat=pred, yhat=yhat, y=as.integer(group$Group))
my.roc <- ROC(preddat)
In [232]:
options(repr.plot.width=4, repr.plot.height=4)
In [233]:
plot(my.roc, col = "blue")
abline(0, 1)
../../_images/Computation_Wk4_Day4_PM_Computing_Capstone_Solutions_75_0.png
  • Read the outcomes.txt data into a data.frame called outcomes with two columns PID and outcome.
In [234]:
outcomes <- read.table('data/outcomes.txt')
In [235]:
head(outcomes)
V1V2
1035 491.65
1091 512.43
1095 513.15
1019 571.56
1081 517.13
1077 535.44
In [236]:
colnames(outcomes) = c('PID', 'outcome')
In [243]:
outcomes$PID = as.factor(outcomes$PID)
In [244]:
head(outcomes)
PIDoutcome
1035 491.65
1091 512.43
1095 513.15
1019 571.56
1081 517.13
1077 535.44
  • Merge the outcomes and expr by joining on the PID column.
In [239]:
head(data)
PIDgene1gene2gene3gene4gene5gene6gene7gene8gene9gene19991gene19992gene19993gene19994gene19995gene19996gene19997gene19998gene19999gene20000
pt1001 1.5619416 0.1274296 0.3818537 0.1754658 -0.119502825-1.40226660 -0.8833635 -0.94309346 -0.4272967 -0.62969491 0.2422296 0.8998351 0.6209216 -0.4764382 -0.08677132 -0.8747234 0.5187669 0.2592983 -0.2476950
pt1002 0.7334951 0.8649770 1.8528351 -0.5721328 0.012561868 0.54493040 0.5795712 0.06164405 -0.5987428 0.37135403 0.8149647 0.9313236 -0.1974330 -1.5460305 0.43927981 0.1554110 -0.4076643 -1.3342893 -0.1228203
pt1003 2.2814275 1.1533513 0.5677770 0.7313274 0.004700874-0.03543193 -0.9456161 -1.74386169 1.6602020 0.93993227 -0.6044223 0.1441122 1.1243218 -0.2946378 -1.49409366 -0.1392395 1.0223122 0.7461696 1.3233091
pt1004 -0.2609866 1.3553463 -0.1613340 0.1367666 -2.023435482-0.45307586 -0.4475958 0.96817412 -1.0207642 0.63478797 -1.6004834 0.2385775 0.4706923 -2.1459718 1.00600499 1.1101697 -1.0333510 -0.9573566 0.8587216
pt1005 -1.2796818 -0.6167624 -1.4883161 0.6504108 -1.798611064 0.39441587 -0.3542170 -0.33118565 -0.2219380 0.06840501 0.5348225 -0.3387108 0.2914713 0.9537248 1.61069262 -0.7970636 1.7090697 -0.3343707 -1.0479672
pt1006 -1.6913107 0.1579946 -0.2470046 1.6882536 0.441772120 0.55306632 1.2643491 1.61029959 0.9706492 -0.83824677 -1.2207351 -0.2862301 1.4972595 -0.8763991 -2.01472158 0.2444913 -0.3154821 -0.8359005 0.8976610
In [241]:
head(outcomes)
PIDoutcome
1035 491.65
1091 512.43
1095 513.15
1019 571.56
1081 517.13
1077 535.44
In [246]:
data <- data.frame(t(expr)) %>% rownames_to_column('PID') %>% mutate(PID=as.factor(substr(PID, 3,6)))
df <- inner_join(data, outcomes, by='PID')
  • Perform LOOCV linear regression using the 5 genes most correlated with outcome to get outcome predictions for each subject
In [260]:
head(df, 2)
gene1gene2gene3gene4gene5gene6gene7gene8gene9gene10gene19992gene19993gene19994gene19995gene19996gene19997gene19998gene19999gene20000outcome
10011.5619416 0.1274296 0.3818537 0.1754658 -0.11950283-1.4022666 -0.8833635 -0.94309346-0.4272967 0.3923721 0.2422296 0.8998351 0.6209216 -0.4764382 -0.08677132-0.8747234 0.5187669 0.2592983 -0.2476950 629.10
10020.7334951 0.8649770 1.8528351 -0.5721328 0.01256187 0.5449304 0.5795712 0.06164405-0.5987428 0.3893356 0.8149647 0.9313236 -0.1974330 -1.5460305 0.43927981 0.1554110 -0.4076643 -1.3342893 -0.1228203 558.71
In [259]:
df <- df %>% column_to_rownames(var='PID')
In [282]:
df %>% select(outcome) %>% head
outcome
1001629.10
1002558.71
1003583.89
1004482.18
1005403.61
1006512.62
In [292]:
n <- nrow(df)
pred <- numeric(n)
for (i in 1:n) {
    y <- df$outcome
    x <- df[, -ncol(df)]

    r <- abs(cor(y[-i], x[-i,]))
    idx <- order(desc(r))[1:5]

    data <- data.frame(y=y, x=x[, idx])
    model <- lm(y ~ ., data=data[-i,])
    pred[i] <- predict(model, data[i,])
}
  • Plot a scatter plot with a linear regression curve for predicted (y) versus observed (x) values using ggplot2
In [293]:
df.1 <- data.frame(observed=df$outcome, predicted=pred)
In [298]:
ggplot(df.1, aes(x=observed, y=predicted)) +
geom_point() +
geom_smooth(method='lm')
Data type cannot be displayed:
../../_images/Computation_Wk4_Day4_PM_Computing_Capstone_Solutions_93_1.png
In [ ]: