Unsupervised Learning¶

Preprocessing¶

When calculating distances, we usually want the features to be measured on thes same scale. One popular way of doing this is to transform each feature so that it has a mean of zero (centering) and a standard devaition of one (scaling).

In [1]:

x <- matrix(rnorm(20), nrow=5)
colnames(x) <- paste("Feature", 1:4)
rownames(x) <- paste("PDD", 1:5)
x <- matrix(rep(c(1,2,10,100), 5), nrow=5, byrow = TRUE) * x
x

Out[1]:

	Feature 1	Feature 2	Feature 3	Feature 4
PDD 1	0.7074344	-2.8638060	9.8519705	-69.1688351
PDD 2	1.001159e+00	4.536595e-05	8.293787e+00	-1.301068e+02
PDD 3	-1.7044244	0.4592508	15.7773198	110.7800991
PDD 4	0.7771078	1.7948562	3.6190092	4.7258575
PDD 5	-0.1943930	0.9271623	9.3818563	-22.1818030

Pairwise distances¶

In [2]:

dist(x, method = "euclidean", upper = FALSE)

Out[2]:

PDD 1     PDD 2     PDD 3     PDD 4
PDD 2  61.02584
PDD 3 180.09328 241.01875
PDD 4  74.30332 134.92581 106.78609
PDD 5  47.15068 107.94110 133.12501  27.54867

Scaling¶

In [3]:

y <- scale(x, center = TRUE, scale = TRUE)
y

Out[3]:

	Feature 1	Feature 2	Feature 3	Feature 4
PDD 1	0.5287880	-1.6577085	0.1075210	-0.5343037
PDD 2	0.79201355	-0.03593478	-0.25109174	-1.21292760
PDD 3	-1.6326317	0.2241092	1.4712276	1.4696627
PDD 4	0.5912268	0.9804507	-1.3269820	0.2886102
PDD 5	-0.2793967098	0.4890833450	-0.0006748504	-0.0110416953

In [4]:

apply(y, MARGIN = 2, FUN = mean)

Out[4]:

Feature 1: 1.11022302462516e-17
Feature 2: 1.38777878078145e-17
Feature 3: -2.0323152671048e-16
Feature 4: 1.31600120883112e-17

In [5]:

apply(y, 2, sd)

Out[5]:

Feature 1: 1
Feature 2: 1
Feature 3: 1
Feature 4: 1

Pairwsie distances¶

In [6]:

dist(y)

Out[6]:

PDD 1    PDD 2    PDD 3    PDD 4
PDD 2 1.813442
PDD 3 3.753472 4.013627
PDD 4 3.114285 2.117902 3.839591
PDD 5 2.355289 1.711959 2.502087 1.687693

Dimension reduction¶

In [7]:

head(iris, 3)

Out[7]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa

In [8]:

ir <- iris[,1:4]

In [9]:

ir.pca <- prcomp(ir,
                 center = TRUE,
                 scale. = TRUE)

In [10]:

summary(ir.pca)

Out[10]:

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

In [11]:

plot(ir.pca, type="l")

In [12]:

options(warn=-1)
suppressMessages(install.packages("devtools",repos = "http://cran.r-project.org"))
library(devtools)
suppressMessages(install_github("vqv/ggbiplot"))
library(ggbiplot)
options(warn=0)

The downloaded binary packages are in
    /var/folders/xf/rzdg30ps11g93j3w0h589q780000gn/T//Rtmph45cuG/downloaded_packages

Loading required package: ggplot2
Loading required package: plyr
Loading required package: scales
Loading required package: grid

In [13]:

ggbiplot(ir.pca, groups = iris$Species, var.scale=1, obs.scale=1, ellipse = TRUE)

In [14]:

ir.mds <- cmdscale(dist(ir), k = 2)

In [15]:

par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="Sepal", xlab="Wddth", ylab="Length")
plot(ir.mds[,1], ir.mds[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="MDS", xlab="", ylab="")

Clustering¶

In [16]:

ir.kmeans <- kmeans(ir, centers=3)

In [17]:

par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=ir.kmeans$cluster, pch=16,
     main="K-means", xlab="Wddth", ylab="Length")

In [18]:

ir.kmeans <- kmeans(ir, centers=6)

In [19]:

par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=ir.kmeans$cluster, pch=16,
     main="K-means", xlab="Wddth", ylab="Length")

In [20]:

ir.ahc <- hclust(dist(ir), method = "complete")

In [21]:

plot(ir.ahc)
rect.hclust(ir.ahc, k=3, border = "red")

In [22]:

groups <- cutree(ir.ahc, k=3)

In [23]:

par(mfrow=c(1,2), pty="s")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=iris$Species, pch=16, main="True", xlab="Wddth", ylab="Length")
plot(iris[,1], iris[,2], type = "p", asp = 1, col=groups, pch=16,
     main="Hiearchical Clustering", xlab="Wddth", ylab="Length")

Heatmaps¶

Heatmaps are a grpahical means of displaying the results of agglomerative hierarchical clustering and a matrix of values (e.g. gene expression).

In [24]:

library(pheatmap)

In [25]:

pheatmap(mtcars, scale="column")

In [ ]: