Supervised Learning

In [1]:
suppressPackageStartupMessages(library(tidyverse))
Conflicts with tidy packages ---------------------------------------------------
In [2]:
options(repr.plot.width=4, repr.plot.height=4)

Regression

In [3]:
set.seed(10)

x <- 1:10
y = x + rnorm(10, 0.5, 1)
In [4]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_5_0.png

Detour: Image formatting in base graphics

Point characters

Point characters

Linear Regression

In [9]:
model.lm <- lm(y ~ x)
In [10]:
summary(model.lm)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0211 -0.5231  0.1832  0.4320  0.9085

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.31825    0.49193   0.647    0.536
x            0.94384    0.07928  11.905 2.28e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7201 on 8 degrees of freedom
Multiple R-squared:  0.9466,        Adjusted R-squared:  0.9399
F-statistic: 141.7 on 1 and 8 DF,  p-value: 2.278e-06

Predicting from a fitted model

In [11]:
predict(model.lm)
1
1.26208403232277
2
2.20591939228898
3
3.14975475225518
4
4.09359011222138
5
5.03742547218758
6
5.98126083215378
7
6.92509619211998
8
7.86893155208618
9
8.81276691205238
10
9.75660227201858

Predicting for new data

In [12]:
predict(model.lm, data.frame(x = c(1.5, 3.5, 5.5)))
1
1.73400171230588
2
3.62167243223828
3
5.50934315217068
In [13]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, predict(model.lm), col = 3)
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_14_0.png

Alternative

Note that abline plots the fitted line throughout the plot limits for the x-axis.

In [9]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
abline(model.lm, col="purple")
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_16_0.png

Detour: Colors in base graphics

Colors

Colors

You can also used named colors - run colors to get all named colors available.

In [14]:
colors()
  1. 'white'
  2. 'aliceblue'
  3. 'antiquewhite'
  4. 'antiquewhite1'
  5. 'antiquewhite2'
  6. 'antiquewhite3'
  7. 'antiquewhite4'
  8. 'aquamarine'
  9. 'aquamarine1'
  10. 'aquamarine2'
  11. 'aquamarine3'
  12. 'aquamarine4'
  13. 'azure'
  14. 'azure1'
  15. 'azure2'
  16. 'azure3'
  17. 'azure4'
  18. 'beige'
  19. 'bisque'
  20. 'bisque1'
  21. 'bisque2'
  22. 'bisque3'
  23. 'bisque4'
  24. 'black'
  25. 'blanchedalmond'
  26. 'blue'
  27. 'blue1'
  28. 'blue2'
  29. 'blue3'
  30. 'blue4'
  31. 'blueviolet'
  32. 'brown'
  33. 'brown1'
  34. 'brown2'
  35. 'brown3'
  36. 'brown4'
  37. 'burlywood'
  38. 'burlywood1'
  39. 'burlywood2'
  40. 'burlywood3'
  41. 'burlywood4'
  42. 'cadetblue'
  43. 'cadetblue1'
  44. 'cadetblue2'
  45. 'cadetblue3'
  46. 'cadetblue4'
  47. 'chartreuse'
  48. 'chartreuse1'
  49. 'chartreuse2'
  50. 'chartreuse3'
  51. 'chartreuse4'
  52. 'chocolate'
  53. 'chocolate1'
  54. 'chocolate2'
  55. 'chocolate3'
  56. 'chocolate4'
  57. 'coral'
  58. 'coral1'
  59. 'coral2'
  60. 'coral3'
  61. 'coral4'
  62. 'cornflowerblue'
  63. 'cornsilk'
  64. 'cornsilk1'
  65. 'cornsilk2'
  66. 'cornsilk3'
  67. 'cornsilk4'
  68. 'cyan'
  69. 'cyan1'
  70. 'cyan2'
  71. 'cyan3'
  72. 'cyan4'
  73. 'darkblue'
  74. 'darkcyan'
  75. 'darkgoldenrod'
  76. 'darkgoldenrod1'
  77. 'darkgoldenrod2'
  78. 'darkgoldenrod3'
  79. 'darkgoldenrod4'
  80. 'darkgray'
  81. 'darkgreen'
  82. 'darkgrey'
  83. 'darkkhaki'
  84. 'darkmagenta'
  85. 'darkolivegreen'
  86. 'darkolivegreen1'
  87. 'darkolivegreen2'
  88. 'darkolivegreen3'
  89. 'darkolivegreen4'
  90. 'darkorange'
  91. 'darkorange1'
  92. 'darkorange2'
  93. 'darkorange3'
  94. 'darkorange4'
  95. 'darkorchid'
  96. 'darkorchid1'
  97. 'darkorchid2'
  98. 'darkorchid3'
  99. 'darkorchid4'
  100. 'darkred'
  101. 'darksalmon'
  102. 'darkseagreen'
  103. 'darkseagreen1'
  104. 'darkseagreen2'
  105. 'darkseagreen3'
  106. 'darkseagreen4'
  107. 'darkslateblue'
  108. 'darkslategray'
  109. 'darkslategray1'
  110. 'darkslategray2'
  111. 'darkslategray3'
  112. 'darkslategray4'
  113. 'darkslategrey'
  114. 'darkturquoise'
  115. 'darkviolet'
  116. 'deeppink'
  117. 'deeppink1'
  118. 'deeppink2'
  119. 'deeppink3'
  120. 'deeppink4'
  121. 'deepskyblue'
  122. 'deepskyblue1'
  123. 'deepskyblue2'
  124. 'deepskyblue3'
  125. 'deepskyblue4'
  126. 'dimgray'
  127. 'dimgrey'
  128. 'dodgerblue'
  129. 'dodgerblue1'
  130. 'dodgerblue2'
  131. 'dodgerblue3'
  132. 'dodgerblue4'
  133. 'firebrick'
  134. 'firebrick1'
  135. 'firebrick2'
  136. 'firebrick3'
  137. 'firebrick4'
  138. 'floralwhite'
  139. 'forestgreen'
  140. 'gainsboro'
  141. 'ghostwhite'
  142. 'gold'
  143. 'gold1'
  144. 'gold2'
  145. 'gold3'
  146. 'gold4'
  147. 'goldenrod'
  148. 'goldenrod1'
  149. 'goldenrod2'
  150. 'goldenrod3'
  151. 'goldenrod4'
  152. 'gray'
  153. 'gray0'
  154. 'gray1'
  155. 'gray2'
  156. 'gray3'
  157. 'gray4'
  158. 'gray5'
  159. 'gray6'
  160. 'gray7'
  161. 'gray8'
  162. 'gray9'
  163. 'gray10'
  164. 'gray11'
  165. 'gray12'
  166. 'gray13'
  167. 'gray14'
  168. 'gray15'
  169. 'gray16'
  170. 'gray17'
  171. 'gray18'
  172. 'gray19'
  173. 'gray20'
  174. 'gray21'
  175. 'gray22'
  176. 'gray23'
  177. 'gray24'
  178. 'gray25'
  179. 'gray26'
  180. 'gray27'
  181. 'gray28'
  182. 'gray29'
  183. 'gray30'
  184. 'gray31'
  185. 'gray32'
  186. 'gray33'
  187. 'gray34'
  188. 'gray35'
  189. 'gray36'
  190. 'gray37'
  191. 'gray38'
  192. 'gray39'
  193. 'gray40'
  194. 'gray41'
  195. 'gray42'
  196. 'gray43'
  197. 'gray44'
  198. 'gray45'
  199. 'gray46'
  200. 'gray47'
  201. 'gray48'
  202. 'gray49'
  203. 'gray50'
  204. 'gray51'
  205. 'gray52'
  206. 'gray53'
  207. 'gray54'
  208. 'gray55'
  209. 'gray56'
  210. 'gray57'
  211. 'gray58'
  212. 'gray59'
  213. 'gray60'
  214. 'gray61'
  215. 'gray62'
  216. 'gray63'
  217. 'gray64'
  218. 'gray65'
  219. 'gray66'
  220. 'gray67'
  221. 'gray68'
  222. 'gray69'
  223. 'gray70'
  224. 'gray71'
  225. 'gray72'
  226. 'gray73'
  227. 'gray74'
  228. 'gray75'
  229. 'gray76'
  230. 'gray77'
  231. 'gray78'
  232. 'gray79'
  233. 'gray80'
  234. 'gray81'
  235. 'gray82'
  236. 'gray83'
  237. 'gray84'
  238. 'gray85'
  239. 'gray86'
  240. 'gray87'
  241. 'gray88'
  242. 'gray89'
  243. 'gray90'
  244. 'gray91'
  245. 'gray92'
  246. 'gray93'
  247. 'gray94'
  248. 'gray95'
  249. 'gray96'
  250. 'gray97'
  251. 'gray98'
  252. 'gray99'
  253. 'gray100'
  254. 'green'
  255. 'green1'
  256. 'green2'
  257. 'green3'
  258. 'green4'
  259. 'greenyellow'
  260. 'grey'
  261. 'grey0'
  262. 'grey1'
  263. 'grey2'
  264. 'grey3'
  265. 'grey4'
  266. 'grey5'
  267. 'grey6'
  268. 'grey7'
  269. 'grey8'
  270. 'grey9'
  271. 'grey10'
  272. 'grey11'
  273. 'grey12'
  274. 'grey13'
  275. 'grey14'
  276. 'grey15'
  277. 'grey16'
  278. 'grey17'
  279. 'grey18'
  280. 'grey19'
  281. 'grey20'
  282. 'grey21'
  283. 'grey22'
  284. 'grey23'
  285. 'grey24'
  286. 'grey25'
  287. 'grey26'
  288. 'grey27'
  289. 'grey28'
  290. 'grey29'
  291. 'grey30'
  292. 'grey31'
  293. 'grey32'
  294. 'grey33'
  295. 'grey34'
  296. 'grey35'
  297. 'grey36'
  298. 'grey37'
  299. 'grey38'
  300. 'grey39'
  301. 'grey40'
  302. 'grey41'
  303. 'grey42'
  304. 'grey43'
  305. 'grey44'
  306. 'grey45'
  307. 'grey46'
  308. 'grey47'
  309. 'grey48'
  310. 'grey49'
  311. 'grey50'
  312. 'grey51'
  313. 'grey52'
  314. 'grey53'
  315. 'grey54'
  316. 'grey55'
  317. 'grey56'
  318. 'grey57'
  319. 'grey58'
  320. 'grey59'
  321. 'grey60'
  322. 'grey61'
  323. 'grey62'
  324. 'grey63'
  325. 'grey64'
  326. 'grey65'
  327. 'grey66'
  328. 'grey67'
  329. 'grey68'
  330. 'grey69'
  331. 'grey70'
  332. 'grey71'
  333. 'grey72'
  334. 'grey73'
  335. 'grey74'
  336. 'grey75'
  337. 'grey76'
  338. 'grey77'
  339. 'grey78'
  340. 'grey79'
  341. 'grey80'
  342. 'grey81'
  343. 'grey82'
  344. 'grey83'
  345. 'grey84'
  346. 'grey85'
  347. 'grey86'
  348. 'grey87'
  349. 'grey88'
  350. 'grey89'
  351. 'grey90'
  352. 'grey91'
  353. 'grey92'
  354. 'grey93'
  355. 'grey94'
  356. 'grey95'
  357. 'grey96'
  358. 'grey97'
  359. 'grey98'
  360. 'grey99'
  361. 'grey100'
  362. 'honeydew'
  363. 'honeydew1'
  364. 'honeydew2'
  365. 'honeydew3'
  366. 'honeydew4'
  367. 'hotpink'
  368. 'hotpink1'
  369. 'hotpink2'
  370. 'hotpink3'
  371. 'hotpink4'
  372. 'indianred'
  373. 'indianred1'
  374. 'indianred2'
  375. 'indianred3'
  376. 'indianred4'
  377. 'ivory'
  378. 'ivory1'
  379. 'ivory2'
  380. 'ivory3'
  381. 'ivory4'
  382. 'khaki'
  383. 'khaki1'
  384. 'khaki2'
  385. 'khaki3'
  386. 'khaki4'
  387. 'lavender'
  388. 'lavenderblush'
  389. 'lavenderblush1'
  390. 'lavenderblush2'
  391. 'lavenderblush3'
  392. 'lavenderblush4'
  393. 'lawngreen'
  394. 'lemonchiffon'
  395. 'lemonchiffon1'
  396. 'lemonchiffon2'
  397. 'lemonchiffon3'
  398. 'lemonchiffon4'
  399. 'lightblue'
  400. 'lightblue1'
  401. 'lightblue2'
  402. 'lightblue3'
  403. 'lightblue4'
  404. 'lightcoral'
  405. 'lightcyan'
  406. 'lightcyan1'
  407. 'lightcyan2'
  408. 'lightcyan3'
  409. 'lightcyan4'
  410. 'lightgoldenrod'
  411. 'lightgoldenrod1'
  412. 'lightgoldenrod2'
  413. 'lightgoldenrod3'
  414. 'lightgoldenrod4'
  415. 'lightgoldenrodyellow'
  416. 'lightgray'
  417. 'lightgreen'
  418. 'lightgrey'
  419. 'lightpink'
  420. 'lightpink1'
  421. 'lightpink2'
  422. 'lightpink3'
  423. 'lightpink4'
  424. 'lightsalmon'
  425. 'lightsalmon1'
  426. 'lightsalmon2'
  427. 'lightsalmon3'
  428. 'lightsalmon4'
  429. 'lightseagreen'
  430. 'lightskyblue'
  431. 'lightskyblue1'
  432. 'lightskyblue2'
  433. 'lightskyblue3'
  434. 'lightskyblue4'
  435. 'lightslateblue'
  436. 'lightslategray'
  437. 'lightslategrey'
  438. 'lightsteelblue'
  439. 'lightsteelblue1'
  440. 'lightsteelblue2'
  441. 'lightsteelblue3'
  442. 'lightsteelblue4'
  443. 'lightyellow'
  444. 'lightyellow1'
  445. 'lightyellow2'
  446. 'lightyellow3'
  447. 'lightyellow4'
  448. 'limegreen'
  449. 'linen'
  450. 'magenta'
  451. 'magenta1'
  452. 'magenta2'
  453. 'magenta3'
  454. 'magenta4'
  455. 'maroon'
  456. 'maroon1'
  457. 'maroon2'
  458. 'maroon3'
  459. 'maroon4'
  460. 'mediumaquamarine'
  461. 'mediumblue'
  462. 'mediumorchid'
  463. 'mediumorchid1'
  464. 'mediumorchid2'
  465. 'mediumorchid3'
  466. 'mediumorchid4'
  467. 'mediumpurple'
  468. 'mediumpurple1'
  469. 'mediumpurple2'
  470. 'mediumpurple3'
  471. 'mediumpurple4'
  472. 'mediumseagreen'
  473. 'mediumslateblue'
  474. 'mediumspringgreen'
  475. 'mediumturquoise'
  476. 'mediumvioletred'
  477. 'midnightblue'
  478. 'mintcream'
  479. 'mistyrose'
  480. 'mistyrose1'
  481. 'mistyrose2'
  482. 'mistyrose3'
  483. 'mistyrose4'
  484. 'moccasin'
  485. 'navajowhite'
  486. 'navajowhite1'
  487. 'navajowhite2'
  488. 'navajowhite3'
  489. 'navajowhite4'
  490. 'navy'
  491. 'navyblue'
  492. 'oldlace'
  493. 'olivedrab'
  494. 'olivedrab1'
  495. 'olivedrab2'
  496. 'olivedrab3'
  497. 'olivedrab4'
  498. 'orange'
  499. 'orange1'
  500. 'orange2'
  501. 'orange3'
  502. 'orange4'
  503. 'orangered'
  504. 'orangered1'
  505. 'orangered2'
  506. 'orangered3'
  507. 'orangered4'
  508. 'orchid'
  509. 'orchid1'
  510. 'orchid2'
  511. 'orchid3'
  512. 'orchid4'
  513. 'palegoldenrod'
  514. 'palegreen'
  515. 'palegreen1'
  516. 'palegreen2'
  517. 'palegreen3'
  518. 'palegreen4'
  519. 'paleturquoise'
  520. 'paleturquoise1'
  521. 'paleturquoise2'
  522. 'paleturquoise3'
  523. 'paleturquoise4'
  524. 'palevioletred'
  525. 'palevioletred1'
  526. 'palevioletred2'
  527. 'palevioletred3'
  528. 'palevioletred4'
  529. 'papayawhip'
  530. 'peachpuff'
  531. 'peachpuff1'
  532. 'peachpuff2'
  533. 'peachpuff3'
  534. 'peachpuff4'
  535. 'peru'
  536. 'pink'
  537. 'pink1'
  538. 'pink2'
  539. 'pink3'
  540. 'pink4'
  541. 'plum'
  542. 'plum1'
  543. 'plum2'
  544. 'plum3'
  545. 'plum4'
  546. 'powderblue'
  547. 'purple'
  548. 'purple1'
  549. 'purple2'
  550. 'purple3'
  551. 'purple4'
  552. 'red'
  553. 'red1'
  554. 'red2'
  555. 'red3'
  556. 'red4'
  557. 'rosybrown'
  558. 'rosybrown1'
  559. 'rosybrown2'
  560. 'rosybrown3'
  561. 'rosybrown4'
  562. 'royalblue'
  563. 'royalblue1'
  564. 'royalblue2'
  565. 'royalblue3'
  566. 'royalblue4'
  567. 'saddlebrown'
  568. 'salmon'
  569. 'salmon1'
  570. 'salmon2'
  571. 'salmon3'
  572. 'salmon4'
  573. 'sandybrown'
  574. 'seagreen'
  575. 'seagreen1'
  576. 'seagreen2'
  577. 'seagreen3'
  578. 'seagreen4'
  579. 'seashell'
  580. 'seashell1'
  581. 'seashell2'
  582. 'seashell3'
  583. 'seashell4'
  584. 'sienna'
  585. 'sienna1'
  586. 'sienna2'
  587. 'sienna3'
  588. 'sienna4'
  589. 'skyblue'
  590. 'skyblue1'
  591. 'skyblue2'
  592. 'skyblue3'
  593. 'skyblue4'
  594. 'slateblue'
  595. 'slateblue1'
  596. 'slateblue2'
  597. 'slateblue3'
  598. 'slateblue4'
  599. 'slategray'
  600. 'slategray1'
  601. 'slategray2'
  602. 'slategray3'
  603. 'slategray4'
  604. 'slategrey'
  605. 'snow'
  606. 'snow1'
  607. 'snow2'
  608. 'snow3'
  609. 'snow4'
  610. 'springgreen'
  611. 'springgreen1'
  612. 'springgreen2'
  613. 'springgreen3'
  614. 'springgreen4'
  615. 'steelblue'
  616. 'steelblue1'
  617. 'steelblue2'
  618. 'steelblue3'
  619. 'steelblue4'
  620. 'tan'
  621. 'tan1'
  622. 'tan2'
  623. 'tan3'
  624. 'tan4'
  625. 'thistle'
  626. 'thistle1'
  627. 'thistle2'
  628. 'thistle3'
  629. 'thistle4'
  630. 'tomato'
  631. 'tomato1'
  632. 'tomato2'
  633. 'tomato3'
  634. 'tomato4'
  635. 'turquoise'
  636. 'turquoise1'
  637. 'turquoise2'
  638. 'turquoise3'
  639. 'turquoise4'
  640. 'violet'
  641. 'violetred'
  642. 'violetred1'
  643. 'violetred2'
  644. 'violetred3'
  645. 'violetred4'
  646. 'wheat'
  647. 'wheat1'
  648. 'wheat2'
  649. 'wheat3'
  650. 'wheat4'
  651. 'whitesmoke'
  652. 'yellow'
  653. 'yellow1'
  654. 'yellow2'
  655. 'yellow3'
  656. 'yellow4'
  657. 'yellowgreen'

Polynomial Regression

In [15]:
model.poly5 <- lm(y ~ poly(x, 5))
In [16]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, predict(model.poly5), col = "orange")
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_21_0.png

Spline Regression

Splines are essentially piecewise polynomial fits. There are two parameters

  • degree determines the type of piecewise polynomials used (e.g. degree=3 uses cubic polynomials)
  • knots are where the piecewise polynomials meet (determine number of pieces)
In [17]:
library(splines)
In [18]:
model.spl <- lm(y ~ bs(x, degree=3, knots=4))
In [19]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, predict(model.spl), col = 3)
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_25_0.png

Connect the dots

In [20]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, y, col="red")
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_27_0.png

Classification

In [21]:
levels(iris$Species)
  1. 'setosa'
  2. 'versicolor'
  3. 'virginica'

Can we separate two species of iris?

In [22]:
df <- iris %>% filter(Species != "setosa")
df$Species <- droplevels(df$Species)
In [23]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,2))
plot(df[,1], df[,2], col=as.integer(df$Species))
plot(df[,3], df[,4], col=as.integer(df$Species))
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_32_0.png
In [24]:
sample.int(10, 5)
  1. 10
  2. 5
  3. 6
  4. 9
  5. 1

Split into 50% training and 50% test sets

In [25]:
library(class)
In [26]:
set.seed(10)
sample <- sample.int(n = nrow(df), size = floor(.5*nrow(df)), replace = F)
train <- df[sample, 1:4]
test  <- df[-sample, 1:4]
train.cls = df$Species[sample]
test.cls = df$Species[-sample]
In [27]:
head(train)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
516.33.36.02.5
315.52.43.81.1
426.13.04.61.4
687.73.86.72.2
96.62.94.61.3
226.12.84.01.3
In [28]:
head(train.cls)
  1. virginica
  2. versicolor
  3. versicolor
  4. virginica
  5. versicolor
  6. versicolor
In [29]:
test.pred <- knn(train, test, train.cls, k = 3)
In [31]:
head(test.pred)
  1. versicolor
  2. versicolor
  3. versicolor
  4. versicolor
  5. versicolor
  6. versicolor
In [32]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,2))
plot(test[,1], test[,2], col=as.integer(test.cls), main="Truth")
plot(test[,1], test[,2], col=as.integer(test.pred), main="Predicted")
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_41_0.png
In [33]:
table(test.pred, test.cls)
test.cls
test.pred    versicolor virginica
  versicolor         25         0
  virginica           3        22

Logistic Regression

In [34]:
head(train)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
516.33.36.02.5
315.52.43.81.1
426.13.04.61.4
687.73.86.72.2
96.62.94.61.3
226.12.84.01.3

The warning is due to the fact that vanilla logistic regression does not like perfectly separated data sets. The usual remedy is to add a penalization factor.

In [35]:
model.logistic <- glm(train.cls ~ .,
                      family=binomial(link='logit'), data=train)
Warning message:
“glm.fit: algorithm did not converge”Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
In [36]:
summary(model.logistic)

Call:
glm(formula = train.cls ~ ., family = binomial(link = "logit"),
    data = train)

Deviance Residuals:
       Min          1Q      Median          3Q         Max
-2.347e-05  -2.110e-08   2.110e-08   2.110e-08   2.792e-05

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.658e+00  1.263e+06   0.000    1.000
Sepal.Length -6.152e+01  9.061e+04  -0.001    0.999
Sepal.Width  -8.214e+01  4.829e+05   0.000    1.000
Petal.Length  7.759e+01  1.673e+05   0.000    1.000
Petal.Width   1.453e+02  2.148e+05   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.8593e+01  on 49  degrees of freedom
Residual deviance: 2.0984e-09  on 45  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

In [37]:
test.pred <- ifelse(predict(model.logistic, test) < 0, 1, 2)
In [38]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,2))
plot(test[,1], test[,2], col=as.integer(test.cls), main="Truth")
plot(test[,1], test[,2], col=test.pred, main="Predicted")
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_49_0.png
In [39]:
table(test.pred, test.cls)
test.cls
test.pred versicolor virginica
        1         24         3
        2          4        19

Exercise

Perform a knn with 5 neighbors using only the first 2 featrues of the iris data set to separate versicolor and virginica species.

In [41]:
head(iris, 3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
In [44]:
df <- iris %>% filter(Species != "setosa")
In [45]:
df$Species <- droplevels(df$Species)
In [46]:
levels(df$Species)
  1. 'versicolor'
  2. 'virginica'
In [47]:
dim(df)
  1. 100
  2. 5
In [48]:
data <- df[,1:2]
In [49]:
head(data,3)
Sepal.LengthSepal.Width
7.03.2
6.43.2
6.93.1
In [50]:
cls <- df[,5]
In [51]:
head(cls)
  1. versicolor
  2. versicolor
  3. versicolor
  4. versicolor
  5. versicolor
  6. versicolor
In [52]:
idx <- sample.int(nrow(data), 50)
In [53]:
idx
  1. 53
  2. 4
  3. 54
  4. 37
  5. 93
  6. 25
  7. 20
  8. 81
  9. 43
  10. 21
  11. 57
  12. 19
  13. 2
  14. 70
  15. 24
  16. 15
  17. 85
  18. 41
  19. 59
  20. 48
  21. 42
  22. 31
  23. 66
  24. 94
  25. 80
  26. 56
  27. 3
  28. 26
  29. 5
  30. 27
  31. 29
  32. 64
  33. 61
  34. 78
  35. 32
  36. 46
  37. 28
  38. 50
  39. 100
  40. 12
  41. 67
  42. 76
  43. 62
  44. 84
  45. 45
  46. 34
  47. 38
  48. 6
  49. 18
  50. 17
In [54]:
train <- data[idx,]
train.cls <- cls[idx]
test <- data[-idx,]
test.cls <- cls[-idx]
In [55]:
dim(train)
  1. 50
  2. 2
In [56]:
dim(test)
  1. 50
  2. 2
In [57]:
length(test.cls)
50
In [58]:
length(train.cls)
50
In [59]:
train[1:3,]
Sepal.LengthSepal.Width
537.13.0
45.52.3
546.32.9
In [60]:
test[1:3,]
Sepal.LengthSepal.Width
17.03.2
76.33.3
84.92.4
In [61]:
pred <- knn(train, test, train.cls, k=5)
In [62]:
pred[1:5]
  1. virginica
  2. virginica
  3. versicolor
  4. versicolor
  5. versicolor
In [63]:
table(pred, test.cls)
test.cls
pred         versicolor virginica
  versicolor         16        16
  virginica           4        14
In [65]:
options(repr.plot.width=8, repr.plot.height=4)
In [69]:
par(mfrow = c(1,2))
plot(test[,1], test[,2], col=as.integer(test.cls), main="True")
plot(test[,1], test[,2], col=as.integer(pred), main="Predicted")
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_74_0.png
In [ ]:

Overfitting

In [70]:
set.seed(10)

x <- 1:10
y = 0.1*x^2 + 0.2*x + rnorm(10, 0.5, 1)
In [71]:
m1 <- lm(y ~ x)
m2 <- lm(y ~ poly(x, 2))
m5 <- lm(y ~ poly(x, 5))
m9 <- lm(y ~ poly(x, 9))
In [72]:
options(repr.plot.width=8, repr.plot.height=8)

xp = seq(0, 10, length.out = 100)
df <- data.frame(x=xp)

par(mfrow=c(2,2))
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m1, df), col = "red")
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m2, df), col = "red")
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m5, df), col = "red")
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m9, df), col = "red")
../../_images/Computation_Wk3_Day4_PM_Supervised_Learning_79_0.png

Evaluate model RSS using in-sample testing (Error resubstitution)

In [73]:
for (model in list(m1, m2, m5, m9)) {
    print(round(sum((predict(model, data.frame(x=x)) - y)^2), 2))
    }
[1] 9.18
[1] 4.15
[1] 2.09
[1] 0

Evaluate model RSS using out-of-sample testing

In [74]:
test.x <- runif(10, 0, 10)
test.y <- 0.1*test.x^2 + 0.2*test.x + rnorm(10, 0.5, 3)
In [75]:
for (model in list(m1, m2, m5, m9)) {
    print(round(sum((predict(model, data.frame(x=test.x)) - test.y)^2), 2))
    }
[1] 112.25
[1] 103.79
[1] 99.8
[1] 111.98

Cross-validation

In [76]:
suppressPackageStartupMessages(library(R.utils))
Error in library(R.utils): there is no package called ‘R.utils’
Traceback:

1. suppressPackageStartupMessages(library(R.utils))
2. withCallingHandlers(expr, packageStartupMessage = function(c) invokeRestart("muffleMessage"))
3. library(R.utils)
4. stop(txt, domain = NA)
In [77]:
for (k in 1:5) {
    rss <- 0
    for (i in 1:10) {
        xmo <- x[-i]
        ymo <- y[-i]
        model <- lm(ymo ~ poly(xmo, k))
        res <- (predict(model, data.frame(xmo=x[i])) - y[i])^2
        rss <- rss + res
    }
    print(k)
    print(rss)
}
[1] 1
       1
17.61035
[1] 2
      1
8.68587
[1] 3
       1
19.42014
[1] 4
       1
20.58078
[1] 5
       1
50.26568

Feature Selection

In [79]:
suppressPackageStartupMessages(library(genefilter))
Error in library(genefilter): there is no package called ‘genefilter’
Traceback:

1. suppressPackageStartupMessages(library(genefilter))
2. withCallingHandlers(expr, packageStartupMessage = function(c) invokeRestart("muffleMessage"))
3. library(genefilter)
4. stop(txt, domain = NA)
In [41]:
set.seed(123)
n = 20
m = 1000
EXPRS = matrix(rnorm(2 * n * m), 2 * n, m)
rownames(EXPRS) = paste("pt", 1:(2 * n), sep = "")
colnames(EXPRS) = paste("g", 1:m, sep = "")
grp = as.factor(rep(1:2, c(n, n)))
In [42]:
head(EXPRS, 3)
g1g2g3g4g5g6g7g8g9g10g991g992g993g994g995g996g997g998g999g1000
pt1-0.5604756 -0.6947070 0.005764186 0.1176466 1.052711 2.1988103 -0.7886220 -1.6674751 0.2374303 -0.2052993 0.3780725 1.974814 -0.4535280 -0.4552866 -2.3004639 -0.3804398 0.2870161 -0.2018602 -1.6727583 1.1379048
pt2-0.2301775 -0.2079173 0.385280401-0.9474746 -1.049177 1.3124130 -0.5021987 0.7364960 1.2181086 0.6511933 0.5981352 -1.021826 -2.0371182 1.5636880 -0.9501855 0.6671640 -0.6702249 1.1181721 -0.5414325 1.2684239
pt3 1.5587083 -1.2653964 -0.370660032-0.4905574 -1.260155 -0.2651451 1.4960607 0.3860266 -1.3387743 0.2737665 0.5774870 0.853561 -0.3030158 -0.1434414 -0.8478627 0.2413405 -0.5417177 0.1625707 0.1995339 0.0427062
In [43]:
stats = abs(rowttests(t(EXPRS), grp)$statistic)
In [44]:
head(stats,3)
  1. 0.674624258566226
  2. 0.741717473727548
  3. 3.02542265710511
In [45]:
ii <- order(-stats)
In [46]:
TOPEXPRS <- EXPRS[, ii[1:10]]

Error Resubstitution (In-sample error)

In [47]:
mod0 = knn(train = TOPEXPRS, test = TOPEXPRS, cl = grp, k = 3)
table(mod0, grp)
grp
mod0  1  2
   1 17  0
   2  3 20

Optimistic Cross-validated predictions

Note: Feature selection is not part of the CV process, and so the results are OVER-OPTIMISTIC

In [48]:
mode1 = knn.cv(TOPEXPRS, grp, k = 3)
table(mode1, grp)
grp
mode1  1  2
    1 16  0
    2  4 20

Cross-validation done right (fancy version)

In [49]:
suppressPackageStartupMessages(library(multtest))
In [50]:
top.features <- function(EXP, resp, test, fsnum) {
     top.features.i <- function(i, EXP, resp, test, fsnum) {
         stats <- abs(mt.teststat(EXP[, -i], resp[-i], test = test))
         ii <- order(-stats)[1:fsnum]
         rownames(EXP)[ii]
    }
    sapply(1:ncol(EXP), top.features.i,
           EXP = EXP, resp = resp, test = test, fsnum = fsnum)
}
In [51]:
# This function evaluates the knn
knn.loocv <- function(EXP, resp, test, k, fsnum, tabulate = FALSE, permute = FALSE) {
    if (permute) {
        resp = sample(resp)
        }
    topfeat = top.features(EXP, resp, test, fsnum)
    pids = rownames(EXP)
    EXP = t(EXP)
    colnames(EXP) = as.character(pids)
    knn.loocv.i = function(i, EXP, resp, k, topfeat) {
    ii = topfeat[, i]
    mod = knn(train = EXP[-i, ii], test = EXP[i, ii], cl = resp[-i], k = k)[1] }
    out = sapply(1:nrow(EXP), knn.loocv.i,
                 EXP = EXP, resp = resp, k
                 = k, topfeat = topfeat)
    if (tabulate)
        out = ftable(pred = out, obs = resp)
    return(out)
}

Reminder of what the data look like

In [52]:
EXPRS[1:5, 1:5]
g1g2g3g4g5
pt1-0.56047565 -0.6947070 0.005764186 0.1176466 1.0527115
pt2-0.23017749 -0.2079173 0.385280401-0.9474746 -1.0491770
pt3 1.55870831 -1.2653964 -0.370660032-0.4905574 -1.2601552
pt4 0.07050839 2.1689560 0.644376549-0.2560922 3.2410399
pt5 0.12928774 1.2079620 -0.220486562 1.8438620 -0.4168576
In [53]:
levels(grp)
  1. '1'
  2. '2'
In [54]:
knn.loocv(t(EXPRS), grp, "t.equalvar", 3, 10, TRUE)
obs  1  2
pred
1         7  7
2        13 13

Cross-validation done right (simple version)

In [55]:
pred <- numeric(2*n)
for (i in 1:(2*n)) {
    stats = abs(rowttests(t(EXPRS[-i,]), grp[-i])$statistic)
    ii <- order(-stats)
    TOPEXPRS <- EXPRS[-i, ii[1:10]]
    pred[i] = knn(TOPEXPRS, EXPRS[i, ii[1:10]], grp[-i], k = 3)
}
In [56]:
table(pred, grp)
grp
pred  1  2
   1  7  7
   2 13 13

Exercise

  • Repeat the last experiment with a noisy quantitative outcome
  • First simulate a data matrix of dimension n = 50 (patients) and m =50000 (genes)
  • Next draw the outcome for n=50 patients from a standard normal distribution independent of the data matrix
  • There is no relationship between the expressions and the outcome (by design)
  • Compare Naive LOOCV using the top 10 features with LOOCV done properly.
In [ ]:

In [ ]: