[1]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
[2]:
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
[3]:
from sklearn.datasets import make_blobs
[4]:
plt.style.use('dark_background')

Clustering

Toy example to illustrate concepts

[5]:
npts = 1000
nc = 6
X, y = make_blobs(n_samples=npts, centers=nc, random_state=0)
[6]:
plt.scatter(X[:, 0], X[:, 1], s=5, c=y,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.axis('square')
pass
../_images/notebooks_B04_Clustering_7_0.png

Sometimes the structure is clearer with contours

[7]:
import seaborn as sns
[8]:
sns.kdeplot(X[:, 0], X[:, 1])
plt.axis('square');
../_images/notebooks_B04_Clustering_10_0.png

How to cluster

  1. Choose a clustering algorithm

  2. Choose the number of clusters

Using a library function

Normally, we would use a library function. The examples below are meant to illustrate how the k-means algorithm works.

[9]:
kmeans = KMeans(nc)
[10]:
z = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_
[11]:
plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass
../_images/notebooks_B04_Clustering_16_0.png

Coding from the ground up

There are many different clustering methods, but K-means is fast, scales well, and can be interpreted as a probabilistic model. We will write 3 versions of the K-means algorithm to illustrate the main concepts. The algorithm is very simple:

  1. Start with \(k\) centers with labels \(0, 1, \ldots, k-1\)

  2. Find the distance of each data point to each center

  3. Assign the data points nearest to a center to its label

  4. Use the mean of the points assigned to a center as the new center

  5. Repeat for a fixed number of iterations or until the centers stop changing

Note: If you want a probabilistic interpretation, just treat the final solution as a mixture of (multivariate) normal distributions. K-means is often used to initialize the fitting of full statistical mixture models, which are computationally more demanding and hence slower.

Distance between sets of vectors

[12]:
from scipy.spatial.distance import cdist
[13]:
pts = np.arange(6).reshape(-1,2)
pts
[13]:
array([[0, 1],
       [2, 3],
       [4, 5]])
[14]:
mus = np.arange(4).reshape(-1, 2)
mus
[14]:
array([[0, 1],
       [2, 3]])
[15]:
cdist(pts, mus)
[15]:
array([[0.        , 2.82842712],
       [2.82842712, 0.        ],
       [5.65685425, 2.82842712]])

Version 1

[16]:
def my_kemans_1(X, k, iters=10):
    """K-means with fixed number of iterations."""

    r, c = X.shape
    centers = X[np.random.choice(r, k, replace=False)]
    for i in range(iters):
        m = cdist(X, centers)
        z = np.argmin(m, axis=1)
        centers = np.array([np.mean(X[z==i], axis=0) for i in range(k)])
    return (z, centers)
[17]:
np.random.seed(2018)
z, centers = my_kemans_1(X, nc)

Note that K-means can get stuck in local optimum

[18]:
plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass
../_images/notebooks_B04_Clustering_28_0.png

Version 2

[19]:
def my_kemans_2(X, k, tol= 1e-6):
    """K-means with tolerance."""

    r, c = X.shape
    centers = X[np.random.choice(r, k, replace=False)]
    delta = np.infty
    while delta > tol:
        m = cdist(X, centers)
        z = np.argmin(m, axis=1)
        new_centers = np.array([np.mean(X[z==i], axis=0) for i in range(k)])
        delta = np.sum((new_centers - centers)**2)
        centers = new_centers
    return (z, centers)
[20]:
np.random.seed(2018)
z, centers = my_kemans_2(X, nc)

Still stuck in local optimum

[21]:
plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass
../_images/notebooks_B04_Clustering_33_0.png

Version 3

Use of a score to evaluate the goodness of fit. In this case, the simplest score is just the sum of distances from each point to its nearest center.

[22]:
def my_kemans_3(X, k, tol=1e-6, n_starts=10):
    """K-means with tolerance and random restarts."""

    r, c = X.shape
    best_score = np.infty

    for i in range(n_starts):
        centers = X[np.random.choice(r, k, replace=False)]
        delta = np.infty
        while delta > tol:
            m = cdist(X, centers)
            z = np.argmin(m, axis=1)
            new_centers = np.array([np.mean(X[z==i], axis=0) for i in range(k)])
            delta = np.sum((new_centers - centers)**2)
            centers = new_centers
        score = m[z].sum()
        if score < best_score:
            best_score = score
            best_z = z
            best_centers = centers
    return (best_z, best_centers)
[23]:
np.random.seed(2018)
z, centers = my_kemans_3(X, nc)
[24]:
plt.scatter(X[:, 0], X[:, 1], s=5, c=z,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.scatter(centers[:, 0], centers[:, 1], marker='x',
            linewidth=3, s=100, c='red')
plt.axis('square')
pass
../_images/notebooks_B04_Clustering_37_0.png

Model selection

We use AMIS for model selection. You can also use likelihood-based methods by interpreting the solution as a mixture of normals but we don’t cover those approaches here. There are also other ad-hoc methods - see sckikt-lean docs if you are interested.

The mutual information is defined as

\[I(X; Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \left( \frac{p(x, y)}{p(x)p(y)} \right)\]

It measures how much knowing \(X\) reduces the uncertainty about \(Y\) (and vice versa).

  • If \(X\) is independent of \(Y\), then \(I(X; Y) = 0\)

  • If \(X\) is a deterministic function of \(Y\), then \(I(X; Y) = 1\)

It is equivalent to the Kullback-Leibler divergence

\[\text{KL}(p(x,y) \mid\mid p(x)p(y)\]

We use AMIS (Adjusted MIS) here, which adjusts for the number of clusters used in the clustering method.

From the documentation

AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [max(H(U), H(V)) - E(MI(U, V))]
[25]:
ncols = 4
nrows = 2
plt.figure(figsize=(ncols*3, nrows*3))
for i, nc in enumerate(range(3, 11)):
    kmeans = KMeans(nc, n_init=10)
    clusters = kmeans.fit_predict(X)
    centers = kmeans.cluster_centers_
    amis = adjusted_mutual_info_score(y, clusters)
    plt.subplot(nrows, ncols, i+1)
    plt.scatter(X[:, 0], X[:, 1], s=5, c=y,
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.scatter(centers[:, 0], centers[:, 1], marker='x',
                linewidth=3, s=100, c='red')
    plt.text(0.15, 0.9, 'nc = %d AMIS = %.2f' % (nc, amis),
             fontsize=16, color='yellow',
             transform=plt.gca().transAxes)
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
../_images/notebooks_B04_Clustering_40_0.png

Comparing across samples

[26]:
X1 = X + np.random.normal(0, 1, X.shape)
[27]:
plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=y,
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass
../_images/notebooks_B04_Clustering_43_0.png
[28]:
nc = 6
kmeans = KMeans(nc, n_init=10)
kmeans.fit(X)
z1 = kmeans.predict(X)
z2 = kmeans.predict(X1)
zs = [z1, z2]
[29]:
plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs[i],
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass
../_images/notebooks_B04_Clustering_46_0.png
[30]:
Y = np.r_[X, X1]
kmeans = KMeans(nc, n_init=10)
kmeans.fit(Y)
zs = np.split(kmeans.predict(Y), 2)
[31]:
plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs[i],
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass
../_images/notebooks_B04_Clustering_49_0.png
[32]:
from scipy.optimize import linear_sum_assignment
[33]:
np.random.seed(2018)
kmeans = KMeans(nc, n_init=10)
c1 = kmeans.fit(X).cluster_centers_
z1 = kmeans.predict(X)
c2 = kmeans.fit(X1).cluster_centers_
z2 = kmeans.predict(X1)
zs = [z1, z2]
cs = [c1, c2]
[34]:
plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    z = zs[i]
    c = cs[i]

    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=z,
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass
../_images/notebooks_B04_Clustering_53_0.png

Define cost function

We use a simple cost as just the distance between centers. More complex dissimilarity measures could be used.

[35]:
cost = cdist(c1, c2)

Run the Hungarian (Muunkres) algorithm for bipartitie matching

[36]:
row_ind, col_ind = linear_sum_assignment(cost)
[37]:
row_ind, col_ind
[37]:
(array([0, 1, 2, 3, 4, 5]), array([0, 1, 5, 3, 4, 2]))

Swap cluster indexes to align data sets

[38]:
z1_aligned = col_ind[z1]
zs = [z1_aligned, z2]
[39]:
plt.figure(figsize=(8, 4))
for i, xs in enumerate([X, X1]):
    plt.subplot(1,2,i+1)
    plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs[i],
                cmap=plt.cm.get_cmap('Accent', nc))
    plt.xticks([])
    plt.yticks([])
    plt.axis('square')
plt.tight_layout()
pass
../_images/notebooks_B04_Clustering_61_0.png

Probabilistic alternatives to k-means

There are several advantages to fitting a probabilistic model:

  • There is a probabilistic interpretation

  • It is generative

  • You have a log likelihood to work with - e.g. evaluate model fit with AIC and BIC

[40]:
from sklearn.mixture import GaussianMixture

Model selection

We can use so-called information criteria that approximate the leave-one-out error. Two commonly used are

  • AIC - Akaike Information Criteria

  • BIC - Bayesian Information Criteria

Basically, they add a penalty term to the negative log likelihood to penalize for model complexity. BIC is generally more conservative than AIC (i.e. BIC prefers less complex models).

[41]:
aics = []
bics = []
for i in range(4, 10):
    gmm = GaussianMixture(n_components=i)
    gmm.fit(X)
    aics.append(gmm.aic(X))
    bics.append(gmm.bic(X))
[42]:
min(list(zip(range(4,10), aics)), key = lambda x: x[1])
[42]:
(6, 8939.46780097899)
[43]:
plt.plot(range(4, 10), aics, '-o');
../_images/notebooks_B04_Clustering_67_0.png
[44]:
min(list(zip(range(4,10), bics)), key = lambda x: x[1])
[44]:
(5, 9095.225715313982)
[45]:
plt.plot(range(4, 10), bics, '-o');
../_images/notebooks_B04_Clustering_69_0.png

Generative model

You can draw synthetic samples from the fitted distribution.

[46]:
gmm = GaussianMixture(n_components=nc)
gmm.fit(X)
xs, zs = gmm.sample(1000)
[47]:
plt.scatter(xs[:, 0], xs[:, 1], s=5, c=zs,
            cmap=plt.cm.get_cmap('Accent', nc))
plt.axis('square')
pass
../_images/notebooks_B04_Clustering_72_0.png
[ ]: