[1]:

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

[2]:

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score

[3]:

from sklearn.datasets import make_blobs

[4]:

plt.style.use('dark_background')

Clustering¶

Toy example to illustrate concepts¶

[5]:

npts = 1000
nc = 6
X, y = make_blobs(n_samples=npts, centers=nc, random_state=0)

[6]:

plt.scatter(X[:, 0], X[:, 1], s=5, c=y,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.axis('square')
pass

../_images/notebooks_B04_Clustering_7_0.png

Sometimes the structure is clearer with contours¶

[7]:

import seaborn as sns

[8]:

sns.kdeplot(X[:, 0], X[:, 1])
plt.axis('square');

../_images/notebooks_B04_Clustering_10_0.png

How to cluster¶

Choose a clustering algorithm
Choose the number of clusters

Using a library function¶

Normally, we would use a library function. The examples below are meant to illustrate how the k-means algorithm works.

[9]:

kmeans = KMeans(nc)

[10]:

z = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

[11]:

plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass

../_images/notebooks_B04_Clustering_16_0.png

Coding from the ground up¶

There are many different clustering methods, but K-means is fast, scales well, and can be interpreted as a probabilistic model. We will write 3 versions of the K-means algorithm to illustrate the main concepts. The algorithm is very simple:

Start with \(k\) centers with labels \(0, 1, \ldots, k-1\)
Find the distance of each data point to each center
Assign the data points nearest to a center to its label
Use the mean of the points assigned to a center as the new center
Repeat for a fixed number of iterations or until the centers stop changing

Note: If you want a probabilistic interpretation, just treat the final solution as a mixture of (multivariate) normal distributions. K-means is often used to initialize the fitting of full statistical mixture models, which are computationally more demanding and hence slower.

Distance between sets of vectors¶

[12]:

from scipy.spatial.distance import cdist

[13]:

pts = np.arange(6).reshape(-1,2)
pts

[13]:

array([[0, 1],
       [2, 3],
       [4, 5]])

[14]:

mus = np.arange(4).reshape(-1, 2)
mus

[14]:

array([[0, 1],
       [2, 3]])

[15]:

cdist(pts, mus)

[15]:

array([[0.        , 2.82842712],
       [2.82842712, 0.        ],
       [5.65685425, 2.82842712]])

Version 1¶

[16]:

def my_kemans_1(X, k, iters=10):
    """K-means with fixed number of iterations."""

    r, c = X.shape
    centers = X[np.random.choice(r, k, replace=False)]
    for i in range(iters):
        m = cdist(X, centers)
        z = np.argmin(m, axis=1)
        centers = np.array([np.mean(X[z==i], axis=0) for i in range(k)])
    return (z, centers)

[17]:

np.random.seed(2018)
z, centers = my_kemans_1(X, nc)

Note that K-means can get stuck in local optimum¶

[18]:

plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass

../_images/notebooks_B04_Clustering_28_0.png

Version 2¶

[19]:

def my_kemans_2(X, k, tol= 1e-6):
    """K-means with tolerance."""

    r, c = X.shape
    centers = X[np.random.choice(r, k, replace=False)]
    delta = np.infty
    while delta > tol:
        m = cdist(X, centers)
        z = np.argmin(m, axis=1)
        new_centers = np.array([np.mean(X[z==i], axis=0) for i in range(k)])
        delta = np.sum((new_centers - centers)**2)
        centers = new_centers
    return (z, centers)

[20]:

np.random.seed(2018)
z, centers = my_kemans_2(X, nc)

Still stuck in local optimum¶

[21]:

plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass

../_images/notebooks_B04_Clustering_33_0.png

Version 3¶

Use of a score to evaluate the goodness of fit. In this case, the simplest score is just the sum of distances from each point to its nearest center.

[22]:

def my_kemans_3(X, k, tol=1e-6, n_starts=10):
    """K-means with tolerance and random restarts."""

    r, c = X.shape
    best_score = np.infty

    for i in range(n_starts):
        centers = X[np.random.choice(r, k, replace=False)]
        delta = np.infty
        while delta > tol:
            m = cdist(X, centers)
            z = np.argmin(m, axis=1)
            new_centers = np.array([np.mean(X[z==i], axis=0) for i in range(k)])
            delta = np.sum((new_centers - centers)**2)
            centers = new_centers
        score = m[z].sum()
        if score < best_score:
            best_score = score
            best_z = z
            best_centers = centers
    return (best_z, best_centers)

[23]:

np.random.seed(2018)
z, centers = my_kemans_3(X, nc)

[24]:

plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass

../_images/notebooks_B04_Clustering_37_0.png

Model selection¶

We use AMIS for model selection. You can also use likelihood-based methods by interpreting the solution as a mixture of normals but we don’t cover those approaches here. There are also other ad-hoc methods - see sckikt-lean docs if you are interested.

The mutual information is defined as

\[I(X; Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \left( \frac{p(x, y)}{p(x)p(y)} \right)\]

It measures how much knowing \(X\) reduces the uncertainty about \(Y\) (and vice versa).

If \(X\) is independent of \(Y\), then \(I(X; Y) = 0\)
If \(X\) is a deterministic function of \(Y\), then \(I(X; Y) = 1\)

It is equivalent to the Kullback-Leibler divergence

\[\text{KL}(p(x,y) \mid\mid p(x)p(y)\]

We use AMIS (Adjusted MIS) here, which adjusts for the number of clusters used in the clustering method.

From the documentation

AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [max(H(U), H(V)) - E(MI(U, V))]

[25]:

ncols = 4
nrows = 2
plt.figure(figsize=(ncols*3, nrows*3))
for i, nc in enumerate(range(3, 11)):
    kmeans = KMeans(nc, n_init=10)
    clusters = kmeans.fit_predict(X)
    centers = kmeans.cluster_centers_
    amis = adjusted_mutual_info_score(y, clusters)
    plt.subplot(nrows, ncols, i+1)
    plt.scatter(X[:, 0], X[:, 1], s=5, c=y,
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.scatter(centers[:, 0], centers[:, 1], marker='x',
                linewidth=3, s=100, c='red')
    plt.text(0.15, 0.9, 'nc = %d AMIS = %.2f' % (nc, amis),
             fontsize=16, color='yellow',
             transform=plt.gca().transAxes)
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()

../_images/notebooks_B04_Clustering_40_0.png

Comparing across samples¶

[26]:

X1 = X + np.random.normal(0, 1, X.shape)

[27]:

plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=y,
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass

../_images/notebooks_B04_Clustering_43_0.png

[28]:

nc = 6
kmeans = KMeans(nc, n_init=10)
kmeans.fit(X)
z1 = kmeans.predict(X)
z2 = kmeans.predict(X1)
zs = [z1, z2]

[29]:

plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs[i],
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass

../_images/notebooks_B04_Clustering_46_0.png

[30]:

Y = np.r_[X, X1]
kmeans = KMeans(nc, n_init=10)
kmeans.fit(Y)
zs = np.split(kmeans.predict(Y), 2)

[31]:

plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs[i],
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass

../_images/notebooks_B04_Clustering_49_0.png

[32]:

from scipy.optimize import linear_sum_assignment

[33]:

np.random.seed(2018)
kmeans = KMeans(nc, n_init=10)
c1 = kmeans.fit(X).cluster_centers_
z1 = kmeans.predict(X)
c2 = kmeans.fit(X1).cluster_centers_
z2 = kmeans.predict(X1)
zs = [z1, z2]
cs = [c1, c2]

[34]:

plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    z = zs[i]
    c = cs[i]

    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=z,
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass

../_images/notebooks_B04_Clustering_53_0.png

Define cost function¶

We use a simple cost as just the distance between centers. More complex dissimilarity measures could be used.

[35]:

cost = cdist(c1, c2)

Run the Hungarian (Muunkres) algorithm for bipartitie matching¶

[36]:

row_ind, col_ind = linear_sum_assignment(cost)

[37]:

row_ind, col_ind

[37]:

(array([0, 1, 2, 3, 4, 5]), array([0, 1, 5, 3, 4, 2]))

Swap cluster indexes to align data sets¶

[38]:

z1_aligned = col_ind[z1]
zs = [z1_aligned, z2]

[39]:

plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs[i],
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass

../_images/notebooks_B04_Clustering_61_0.png

Probabilistic alternatives to k-means¶

There are several advantages to fitting a probabilistic model:

There is a probabilistic interpretation
It is generative
You have a log likelihood to work with - e.g. evaluate model fit with AIC and BIC

[40]:

from sklearn.mixture import GaussianMixture

Model selection¶

We can use so-called information criteria that approximate the leave-one-out error. Two commonly used are

AIC - Akaike Information Criteria
BIC - Bayesian Information Criteria

Basically, they add a penalty term to the negative log likelihood to penalize for model complexity. BIC is generally more conservative than AIC (i.e. BIC prefers less complex models).

[41]:

aics = []
bics = []
for i in range(4, 10):
    gmm = GaussianMixture(n_components=i)
    gmm.fit(X)
    aics.append(gmm.aic(X))
    bics.append(gmm.bic(X))

[42]:

min(list(zip(range(4,10), aics)), key = lambda x: x[1])

[42]:

(6, 8939.46780097899)

[43]:

plt.plot(range(4, 10), aics, '-o');

../_images/notebooks_B04_Clustering_67_0.png

[44]:

min(list(zip(range(4,10), bics)), key = lambda x: x[1])

[44]:

(5, 9095.225715313982)

[45]:

plt.plot(range(4, 10), bics, '-o');

../_images/notebooks_B04_Clustering_69_0.png

Generative model¶

You can draw synthetic samples from the fitted distribution.

[46]:

gmm = GaussianMixture(n_components=nc)
gmm.fit(X)
xs, zs = gmm.sample(1000)

[47]:

plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.axis('square')
pass

../_images/notebooks_B04_Clustering_72_0.png

[ ]: