Unsupervised Learning

Preprocessing

When calculating distances, we usually want the features to be measured on thes same scale. One popular way of doing this is to transform each feature so that it has a mean of zero (centering) and a standard devaition of one (scaling).

In [1]:
x <- matrix(rnorm(20), nrow=5)
colnames(x) <- paste("Feature", 1:4)
rownames(x) <- paste("PDD", 1:5)
x <- matrix(rep(c(1,2,10,100), 5), nrow=5, byrow = TRUE) * x
x
Out[1]:
Feature 1Feature 2Feature 3Feature 4
PDD 1 0.7074344 -2.8638060 9.8519705-69.1688351
PDD 2 1.001159e+00 4.536595e-05 8.293787e+00-1.301068e+02
PDD 3 -1.7044244 0.4592508 15.7773198110.7800991
PDD 40.77710781.79485623.61900924.7258575
PDD 5 -0.1943930 0.9271623 9.3818563-22.1818030

Pairwise distances

In [2]:
dist(x, method = "euclidean", upper = FALSE)
Out[2]:
PDD 1     PDD 2     PDD 3     PDD 4
PDD 2  61.02584
PDD 3 180.09328 241.01875
PDD 4  74.30332 134.92581 106.78609
PDD 5  47.15068 107.94110 133.12501  27.54867

Scaling

In [3]:
y <- scale(x, center = TRUE, scale = TRUE)
y
Out[3]:
Feature 1Feature 2Feature 3Feature 4
PDD 1 0.5287880-1.6577085 0.1075210-0.5343037
PDD 2 0.79201355-0.03593478-0.25109174-1.21292760
PDD 3-1.6326317 0.2241092 1.4712276 1.4696627
PDD 4 0.5912268 0.9804507-1.3269820 0.2886102
PDD 5-0.2793967098 0.4890833450-0.0006748504-0.0110416953
In [4]:
apply(y, MARGIN = 2, FUN = mean)
Out[4]:
Feature 1
1.11022302462516e-17
Feature 2
1.38777878078145e-17
Feature 3
-2.0323152671048e-16
Feature 4
1.31600120883112e-17
In [5]:
apply(y, 2, sd)
Out[5]:
Feature 1
1
Feature 2
1
Feature 3
1
Feature 4
1

Pairwsie distances

In [6]:
dist(y)
Out[6]:
PDD 1    PDD 2    PDD 3    PDD 4
PDD 2 1.813442
PDD 3 3.753472 4.013627
PDD 4 3.114285 2.117902 3.839591
PDD 5 2.355289 1.711959 2.502087 1.687693

Dimension reduction

In [7]:
head(iris, 3)
Out[7]:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
In [8]:
ir <- iris[,1:4]
In [9]:
ir.pca <- prcomp(ir,
                 center = TRUE,
                 scale. = TRUE)
In [10]:
summary(ir.pca)
Out[10]:
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
In [11]:
plot(ir.pca, type="l")
In [12]:
options(warn=-1)
suppressMessages(install.packages("devtools",repos = "http://cran.r-project.org"))
library(devtools)
suppressMessages(install_github("vqv/ggbiplot"))
library(ggbiplot)
options(warn=0)

The downloaded binary packages are in
    /var/folders/xf/rzdg30ps11g93j3w0h589q780000gn/T//Rtmph45cuG/downloaded_packages
Loading required package: ggplot2
Loading required package: plyr
Loading required package: scales
Loading required package: grid
In [13]:
ggbiplot(ir.pca, groups = iris$Species, var.scale=1, obs.scale=1, ellipse = TRUE)
In [14]:
ir.mds <- cmdscale(dist(ir), k = 2)
In [15]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="Sepal", xlab="Wddth", ylab="Length")
plot(ir.mds[,1], ir.mds[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="MDS", xlab="", ylab="")

Clustering

In [16]:
ir.kmeans <- kmeans(ir, centers=3)
In [17]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=ir.kmeans$cluster, pch=16,
     main="K-means", xlab="Wddth", ylab="Length")
In [18]:
ir.kmeans <- kmeans(ir, centers=6)
In [19]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=ir.kmeans$cluster, pch=16,
     main="K-means", xlab="Wddth", ylab="Length")
In [20]:
ir.ahc <- hclust(dist(ir), method = "complete")
In [21]:
plot(ir.ahc)
rect.hclust(ir.ahc, k=3, border = "red")
In [22]:
groups <- cutree(ir.ahc, k=3)
In [23]:
par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=groups, pch=16,
     main="Hiearchical Clustering", xlab="Wddth", ylab="Length")

Heatmaps

Heatmaps are a grpahical means of displaying the results of agglomerative hierarchical clustering and a matrix of values (e.g. gene expression).

In [24]:
library(pheatmap)
In [25]:
pheatmap(mtcars, scale="column")
In [ ]: