Unsupervised Learning

Golub data set

In [1]:
suppressPackageStartupMessages(library(multtest))
suppressPackageStartupMessages(library(golubEsets))
suppressPackageStartupMessages(library(tidyverse))
In [2]:
data(Golub_Merge)
dim(Golub_Merge)
Features
7129
Samples
72

Extract gene expression values

In [3]:
golub <- exprs(Golub_Merge)

There are 72 patients and 7192 probe sets.

In [4]:
dim(golub)
  1. 7129
  2. 72
In [5]:
head(golub)
3940424748494143444535363738282930313233
AFFX-BioB-5_at-342 -87 22-243-130-256 -62 86-146-187 7-213 -25 -72 -4 15-318 -32-124-135
AFFX-BioB-M_at-200-248-153-218-177-249 -23 -36 -74-187-100-252 -20-139-116-114-192 -49 -79-186
AFFX-BioB-3_at 41 262 17-163 -28-410 -7-141 170 312 -57 136 124 -1-125 2 -95 49 -37 -70
AFFX-BioC-5_at 328 295 276 182 266 24 142 252 174 142 132 318 325 392 241 193 312 230 330 337
AFFX-BioC-3_at-224-226-211-289-170-535-233-201 -32 114-377-209-396-324-191 -51-139-367-188-407
AFFX-BioDn-5_at-427-493-250-268-326-810-284-384-318-148-478-557-464-510-411-155-344-508-423-566

For this exercise, we consider the probe values to be variables and the patients to be observations, so it is convenient to work with the matrix transpose.

In [6]:
golub <- t(golub)
In [7]:
dim(golub)
  1. 72
  2. 7129
In [8]:
golub[1:3, ]
AFFX-BioB-5_atAFFX-BioB-M_atAFFX-BioB-3_atAFFX-BioC-5_atAFFX-BioC-3_atAFFX-BioDn-5_atAFFX-BioDn-3_atAFFX-CreX-5_atAFFX-CreX-3_atAFFX-BioB-5_stU48730_atU58516_atU73738_atX06956_atX16699_atX83863_atZ17240_atL49218_f_atM71243_f_atZ78285_f_at
39-342-200 41 328 -224-427-656-292137 -144277 1023 67214 -1351074475 48 168-70
40 -87-248262 295 -226-493 367-452194 162 83 529-295352 -67 67263 -33 -33-21
42 22-153 17 276 -211-250 55-141 0 500413 399 16558 24 893297 6 1971-42

Distances

Pairwise distance between first 3 patinets

In [9]:
dist(golub[1:3,])
39        40
40 101530.75
42  94405.04  89502.29
In [10]:
dist(golub[1:3,], diag = TRUE)
39        40        42
39      0.00
40 101530.75      0.00
42  94405.04  89502.29      0.00
In [11]:
dist(golub[1:3,], diag = TRUE, upper=TRUE)
39        40        42
39      0.00 101530.75  94405.04
40 101530.75      0.00  89502.29
42  94405.04  89502.29      0.00

Manual calculation

Euclidean distance is just the n-dimensional application of Pythagoras theorem.

If we have points \(x = (0,0)\) and \(y = (3, 4)\), then the distance between \(x\) and \(y\) is

\[\sqrt{(3-0)^2 + (4-0)^2} = 5\]

We write a vectorized calculation of the above.

In [12]:
distance <- function(x, y) { sqrt(sum((x - y)^2))}
In [13]:
x <- golub[1,]
y <- golub[2,]
round(distance(x, y), 2)
101530.75

Ordination

MDS

In [14]:
mds <- as.data.frame(cmdscale(dist(golub), k = 2))
In [15]:
dim(mds)
  1. 72
  2. 2
In [16]:
phenotype <- Golub_Merge@phenoData@data$ALL.AML
In [17]:
mds <- mds %>% mutate(phenotype=phenotype)
In [18]:
head(mds)
V1V2phenotype
23228.24839-10207.142 ALL
7327.10741 -8114.544 ALL
-5088.36719 -2548.717 ALL
46738.68452 -8792.756 ALL
-39906.71218-23179.919 ALL
-37.91955 -9771.907 ALL
In [19]:
plot(mds$V1, mds$V2, type="n")
text(mds$V1, mds$V2, labels = mds$phenotype, col=as.integer(mds$phenotype))
../_images/cliburn_Unsupervised_Learning_28_0.png

PCA

In [20]:
pca <- as.data.frame(prcomp(golub, center=TRUE, scale=TRUE, rank=2)$x)
In [21]:
dim(pca)
  1. 72
  2. 2
In [22]:
pca <- pca %>% mutate(phenotype=phenotype)
In [23]:
head(pca)
PC1PC2phenotype
-5.829016 18.985914ALL
-8.364993 22.877479ALL
12.928473 -3.996856ALL
32.254056 -5.525024ALL
-4.218081-56.824360ALL
-65.275849 22.817599ALL
In [24]:
plot(pca$PC1, pca$PC2, type="n")
text(pca$PC1, pca$PC2, labels = pca$phenotype, col=as.integer(pca$phenotype))
../_images/cliburn_Unsupervised_Learning_34_0.png

Preserving the distances

Scale to have zero mean and unit standard deviation

In [25]:
scexpdat <- scale(golub)
In [26]:
dim(scexpdat)
  1. 72
  2. 7129

Check

In [27]:
apply(scexpdat[, 1:4], 2, mean)
AFFX-BioB-5_at
-7.84141692068388e-17
AFFX-BioB-M_at
-4.46028727069731e-18
AFFX-BioB-3_at
1.49183207261304e-17
AFFX-BioC-5_at
-5.05117745470191e-17
In [28]:
apply(scexpdat[, 1:4], 2, sd)
AFFX-BioB-5_at
1
AFFX-BioB-M_at
1
AFFX-BioB-3_at
1
AFFX-BioC-5_at
1

Using dplyr

In [29]:
as.data.frame(scexpdat) %>%
select(1:4) %>%
summarise_all(mean) %>%
round
AFFX-BioB-5_atAFFX-BioB-M_atAFFX-BioB-3_atAFFX-BioC-5_at
0000
In [30]:
as.data.frame(scexpdat) %>%
select(1:4) %>%
summarise_all(sd) %>%
round
AFFX-BioB-5_atAFFX-BioB-M_atAFFX-BioB-3_atAFFX-BioC-5_at
1111

Clustering

Agglomerative hierarchical clustering (AHC)

In [31]:
names = c("ATL", "BOS", "ORD", "DCA")
airports <- c(0, 934, 585, 542, 934, 0, 853, 392,
              585, 853, 0, 598, 542, 392, 598, 0)
airports <- matrix(airports, ncol=4, byrow=F, dimnames = list(names, names))
In [32]:
airports
ATLBOSORDDCA
ATL 0934585542
BOS934 0853392
ORD585853 0598
DCA542392598 0
In [33]:
as.dist(airports)
ATL BOS ORD
BOS 934
ORD 585 853
DCA 542 392 598
In [34]:
tree <- hclust(as.dist(airports), method="single")
plot(tree)
../_images/cliburn_Unsupervised_Learning_50_0.png
In [35]:
tree <- hclust(as.dist(airports), method="complete")
plot(tree)
../_images/cliburn_Unsupervised_Learning_51_0.png

Road-trip USA

In [36]:
plot(hclust(UScitiesD, method="complete"))
../_images/cliburn_Unsupervised_Learning_53_0.png

A trip to Europe

In [37]:
plot(hclust(eurodist, method="complete"))
../_images/cliburn_Unsupervised_Learning_55_0.png
In [38]:
eurotree <- hclust(eurodist, method="complete")

Find clusters by height

In [39]:
groups <- cutree(tree = eurotree, h = 1500)
data.frame(groups) %>%
rownames_to_column("city") %>%
arrange(groups)
citygroups
Athens 1
Rome 1
Barcelona 2
Geneva 2
Lyons 2
Marseilles 2
Milan 2
Brussels 3
Calais 3
Cherbourg 3
Cologne 3
Hook of Holland3
Paris 3
Copenhagen 4
Hamburg 4
Stockholm 4
Gibraltar 5
Lisbon 5
Madrid 5
Munich 6
Vienna 6
In [40]:
plot(eurotree)
rect.hclust(eurotree, h=1500, border = "red")
../_images/cliburn_Unsupervised_Learning_59_0.png

Find clusters by number

In [41]:
groups <- cutree(tree = eurotree, k = 8)
data.frame(groups) %>%
rownames_to_column("city") %>%
arrange(groups)
citygroups
Athens 1
Rome 1
Barcelona 2
Marseilles 2
Brussels 3
Calais 3
Cherbourg 3
Cologne 3
Hook of Holland3
Paris 3
Copenhagen 4
Hamburg 4
Geneva 5
Lyons 5
Milan 5
Gibraltar 6
Lisbon 6
Madrid 6
Munich 7
Vienna 7
Stockholm 8
In [42]:
plot(eurotree)
rect.hclust(eurotree, k=8, border = "red")
../_images/cliburn_Unsupervised_Learning_62_0.png

k-means clustering

In [43]:
kmeans.golub <- kmeans(golub, centers=4)
In [44]:
plot(mds$V1, mds$V2, type="n")
text(mds$V1, mds$V2, labels = mds$phenotype, col=as.integer(kmeans.golub$cluster))
../_images/cliburn_Unsupervised_Learning_65_0.png

Grouped by data source

In [45]:
plot(mds$V1, mds$V2, type="n")
text(mds$V1, mds$V2, labels = mds$phenotype,
     col=as.integer(Golub_Merge@phenoData@data$Source))
../_images/cliburn_Unsupervised_Learning_67_0.png

Semi-supervised learning (Noise discovery)

In [46]:
suppressPackageStartupMessages(library(genefilter))
suppressPackageStartupMessages(library(pheatmap))

Simulate noise data set

Note that EVERY expression value is drawn from a standard normal distribution. Hence there should not be any meaningful distinction between the “groups”.

In [47]:
m <- 20000 # number of genes
n <- 20 # number of subjects
alpha <- 0.005

grp <- factor(rep(c('N', 'Y'), c(n, n)))
genes <- paste("Gene", 1:m, sep="")
subjects <- paste("PID", 1:(2*n), sep="")
expr <- matrix(rnorm(2 * n * m), m, 2 * n)
rownames(expr) <- genes
colnames(expr) <- subjects

Find genes that are different across group at specified significance level

In [48]:
pvals <- rowttests(expr, grp)$p.value
In [49]:
df <- data.frame(expr, pvals)
In [50]:
top.genes <- df %>%
filter(pvals < alpha) %>%
select(-pvals)
In [51]:
dim(top.genes)
  1. 108
  2. 40

Show heatmap and AHC clustering for top genes

In [52]:
annot <- data.frame(grp=grp, row.names=colnames(top.genes))
In [53]:
head(annot)
grp
PID1N
PID2N
PID3N
PID4N
PID5N
PID6N

Simple version of heatmap

In [54]:
pheatmap(top.genes)
../_images/cliburn_Unsupervised_Learning_81_0.png

Fancy version of heatmap

In [55]:
pheatmap(top.genes,
         annotation_col = annot,
         color = colorRampPalette(c("red3", "black", "green3"))(50),
         annotation_colors = list(grp = c(Y = "blue", N = "yellow")),
         show_rownames = FALSE, show_colnames = FALSE,
        )
../_images/cliburn_Unsupervised_Learning_83_0.png

MDS of top genes

In [56]:
mds <- cmdscale(dist(t(top.genes)))
plot(mds, col=as.integer(grp))
../_images/cliburn_Unsupervised_Learning_85_0.png

Exercise 1

Load the iris data set. Each row has 4 features and a Species label.

  • Reduce the dimensionality of the features to 2 using each of the methods described above (PCA, MDS).
  • Plot scatter plots for each method, coloring by Species.
  • Are the Species separate in these dimensionality-reduced plots?

Exercise 2

Load the iris data set. Each row has 4 features and a Species label.

  • Scale the data to have zero mean and unit standard deviation
  • Calculate a pairwise distance matrix (explore different distance measures)
  • Perform hierarchical clustering (explore different linkage measures)
  • Plot a dendrogram for the hierarchical clustering, showing 3 clusters (see the rect.hclust function)
  • Create a scatter plot of the first two features colored by the cluster label (see teh cutree function)

Exercise 3

Load the iris data set. Each row has 4 features and a Species label.

  • Scale the data to have zero mean and unit standard deviation
  • Perform k-means clustering using 2,3,4 and 10 clusters
  • Create a scatter plot of the first two features colored by the cluster label for each cluster number
  • How could you assess how many clusters is appropriate?