Unsupervised Learning¶
Preprocessing¶
When calculating distances, we usually want the features to be measured on thes same scale. One popular way of doing this is to transform each feature so that it has a mean of zero (centering) and a standard devaition of one (scaling).
In [1]:
x <- matrix(rnorm(20), nrow=5)
colnames(x) <- paste("Feature", 1:4)
rownames(x) <- paste("PDD", 1:5)
x <- matrix(rep(c(1,2,10,100), 5), nrow=5, byrow = TRUE) * x
x
Out[1]:
Feature 1 | Feature 2 | Feature 3 | Feature 4 | |
---|---|---|---|---|
PDD 1 | 0.7074344 | -2.8638060 | 9.8519705 | -69.1688351 |
PDD 2 | 1.001159e+00 | 4.536595e-05 | 8.293787e+00 | -1.301068e+02 |
PDD 3 | -1.7044244 | 0.4592508 | 15.7773198 | 110.7800991 |
PDD 4 | 0.7771078 | 1.7948562 | 3.6190092 | 4.7258575 |
PDD 5 | -0.1943930 | 0.9271623 | 9.3818563 | -22.1818030 |
Pairwise distances¶
In [2]:
dist(x, method = "euclidean", upper = FALSE)
Out[2]:
PDD 1 PDD 2 PDD 3 PDD 4
PDD 2 61.02584
PDD 3 180.09328 241.01875
PDD 4 74.30332 134.92581 106.78609
PDD 5 47.15068 107.94110 133.12501 27.54867
Scaling¶
In [3]:
y <- scale(x, center = TRUE, scale = TRUE)
y
Out[3]:
Feature 1 | Feature 2 | Feature 3 | Feature 4 | |
---|---|---|---|---|
PDD 1 | 0.5287880 | -1.6577085 | 0.1075210 | -0.5343037 |
PDD 2 | 0.79201355 | -0.03593478 | -0.25109174 | -1.21292760 |
PDD 3 | -1.6326317 | 0.2241092 | 1.4712276 | 1.4696627 |
PDD 4 | 0.5912268 | 0.9804507 | -1.3269820 | 0.2886102 |
PDD 5 | -0.2793967098 | 0.4890833450 | -0.0006748504 | -0.0110416953 |
In [4]:
apply(y, MARGIN = 2, FUN = mean)
Out[4]:
- Feature 1
- 1.11022302462516e-17
- Feature 2
- 1.38777878078145e-17
- Feature 3
- -2.0323152671048e-16
- Feature 4
- 1.31600120883112e-17
In [5]:
apply(y, 2, sd)
Out[5]:
- Feature 1
- 1
- Feature 2
- 1
- Feature 3
- 1
- Feature 4
- 1
Pairwsie distances¶
In [6]:
dist(y)
Out[6]:
PDD 1 PDD 2 PDD 3 PDD 4
PDD 2 1.813442
PDD 3 3.753472 4.013627
PDD 4 3.114285 2.117902 3.839591
PDD 5 2.355289 1.711959 2.502087 1.687693
Dimension reduction¶
In [7]:
head(iris, 3)
Out[7]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
In [8]:
ir <- iris[,1:4]
In [9]:
ir.pca <- prcomp(ir,
center = TRUE,
scale. = TRUE)
In [10]:
summary(ir.pca)
Out[10]:
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
In [11]:
plot(ir.pca, type="l")
In [12]:
options(warn=-1)
suppressMessages(install.packages("devtools",repos = "http://cran.r-project.org"))
library(devtools)
suppressMessages(install_github("vqv/ggbiplot"))
library(ggbiplot)
options(warn=0)
The downloaded binary packages are in
/var/folders/xf/rzdg30ps11g93j3w0h589q780000gn/T//Rtmph45cuG/downloaded_packages
Loading required package: ggplot2
Loading required package: plyr
Loading required package: scales
Loading required package: grid
In [13]:
ggbiplot(ir.pca, groups = iris$Species, var.scale=1, obs.scale=1, ellipse = TRUE)
In [14]:
ir.mds <- cmdscale(dist(ir), k = 2)
In [15]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="Sepal", xlab="Wddth", ylab="Length")
plot(ir.mds[,1], ir.mds[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="MDS", xlab="", ylab="")
Clustering¶
In [16]:
ir.kmeans <- kmeans(ir, centers=3)
In [17]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=ir.kmeans$cluster, pch=16,
main="K-means", xlab="Wddth", ylab="Length")
In [18]:
ir.kmeans <- kmeans(ir, centers=6)
In [19]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=ir.kmeans$cluster, pch=16,
main="K-means", xlab="Wddth", ylab="Length")
In [20]:
ir.ahc <- hclust(dist(ir), method = "complete")
In [21]:
plot(ir.ahc)
rect.hclust(ir.ahc, k=3, border = "red")
In [22]:
groups <- cutree(ir.ahc, k=3)
In [23]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=groups, pch=16,
main="Hiearchical Clustering", xlab="Wddth", ylab="Length")
Heatmaps¶
Heatmaps are a grpahical means of displaying the results of agglomerative hierarchical clustering and a matrix of values (e.g. gene expression).
In [24]:
library(pheatmap)
In [25]:
pheatmap(mtcars, scale="column")
In [ ]: