In [2]:

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

In [3]:

import numpy as np
import numpy.linalg as la

\[\newcommand{\norm}[1]{\left\lVert#1\right\rVert}\]

Nonlinear Dimension Reduction¶

In many cases of dimension reduction for clustering, we want to preserve the cluster structure in the lower dimensional projection.

In [4]:

np.random.seed(123)
np.set_printoptions(3)

Limitations of PCA¶

We will project a 2-d data set onto 1-d to see one limitation of PCA. This provides motivation for learning non-linear methods of dimension reduction.

In [5]:

x1 = np.random.multivariate_normal([-3,3], np.eye(2), 100)
x2 = np.random.multivariate_normal([3,3], np.eye(2), 100)
x3 = np.random.multivariate_normal([0,-10], np.eye(2), 100)
xs = np.r_[x1, x2, x3]
xs = (xs - xs.mean(0))/xs.std()
zs = np.r_[np.zeros(100), np.ones(100), 2*np.ones(100)]

In [6]:

plt.scatter(xs[:, 0], xs[:, 1], c=zs)
plt.axis('equal')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_7_0.png

PCA does not preserve locality¶

In [7]:

from sklearn.decomposition import PCA

In [8]:

pca = PCA(n_components=1)
ys = pca.fit_transform(xs)

In [9]:

plt.scatter(ys[:, 0], np.random.uniform(-1, 1, len(ys)), c=zs)
plt.axhline(0, c='red')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_11_0.png

MDS preserves some locality¶

In [10]:

from sklearn.manifold import MDS

In [12]:

mds = MDS()

In [13]:

ms = mds.fit_transform(xs)

In [15]:

plt.scatter(ms[:, 0], np.random.uniform(-1, 1, len(ms)), c=zs)
plt.axhline(0, c='red')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_16_0.png

t-SNE preserves cluster structure¶

In [9]:

from sklearn.manifold import TSNE

In [10]:

tsne = TSNE(n_components=1)
ts = tsne.fit_transform(xs)

In [11]:

plt.scatter(ts[:, 0], np.random.uniform(-1, 1, len(ts)), c=zs)
plt.axhline(0, c='red')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_20_0.png

Multi-dimensional scaling (MDS)¶

MDS starts wtih a dissimilarity matrix. When the dissimilarties have a metric, this is equivalent to a distrance matrix. Recall that a metric or distance function has the following properites

\[\begin{split}d(x, y) \ge 0 \\ d(x, y) = 0 \iff x = y \\ d(x, y) = d(y, x) \\ d(x, z) \le d(x, y) + d(y, z)\end{split}\]

Inutitive explanaion of MDS¶

MDS tries to map points \(y_i\) in space \(\mathbb{R}^n\) to matching points \(x_i\) in a map space \(\mathbb{R}^k\) such that the sum of difference of pairwise distances is minimized. Note that we do not need the orignal coordinates of the points \(\mathbb{R}^n\) - all we need is the distance matrix \(D\). Conceptually MDS tries to minimize a quantity similar to

\[\norm{D_{i,j} - \norm{x_i - x_j}}^2\]

Since translation does not affect the difference, a furhter constraint is that \(\sum x_i = 0\).

PCA and MDS¶

Consider the followoing distanaces between 3 points \(A, B, C\)

\[\begin{split}d(A, B) = 1 \\ d(B, C) = 1 \\ d(A, C) = 2 \\\end{split}\]

There is a 1D map that that perfectly captures the distances between the 3 points.

and

\[\begin{split}d(A, B) = 1 \\ d(B, C = 1 \\ d(A, C) = 1\end{split}\]

There is no 1D map that that perfectly captures the distances between the 3 points, but this is easily done in 2D. This shows that it is not always possible to find a distance function in the MDS map (\(\mathbb{R}^n\) ) that preserves the orignal distnaces in \(\mathbb{R}^n\).

Stress¶

Let \(D_{ij}\) be the distance between \(y_i\) and \(y_j\) in the original space, and \(d_{ij}\) be the distance between \(x_i\) and \(x_j\) in the map space. Then the cost function for MDS is

\(\text{Stress} = \left( \frac{\sum_{i < j} (D_{ij} - f(d_{ij})^2}{\sum_{i < j} d_{ij}^{2}} \right)^{\frac{1}{2}}\)

for some monotonic function \(f\) (usually taken as the identity for quantitative variables). This is solved by iterative numerical optimization

Strain¶

There is a variant of MDS known as classical MDS that uses a different optimizaiton criterion

\(\text{Strain} = \left( \frac{\sum_{i < j} (B_{ij} - \langle x_i, x_j \rangle)^2}{\sum_{i < j} b_{ij}^{2}} \right)^{\frac{1}{2}}\)

where \(B = X^TX\). Finding \(X\) reduces to a PCA of \(B\).

Note that classical MDS does not give the sme solution as metric MDS. Note also that classic MDS assumes that distances are Euclidean.

Example¶

In [12]:

from sklearn.manifold import MDS

In [13]:

import pandas as pd

In [14]:

data = np.loadtxt('data/simu_SAGMB.txt')

Subsample¶

In [15]:

n = 1000
idx = np.random.choice(range(len(data)), n, replace=False)
data = data[idx]

In [16]:

plt.prism()

<Figure size 432x288 with 0 Axes>

In [17]:

plt.figure(figsize=(12,12))
for i in range(8):
    for j in range(i+1, 8):
        plt.subplot(8, 8, i*8+j+1)
        x, y = data[:, i], data[:, j]
        plt.scatter(x, y, s=1)
        plt.xticks([])
        plt.yticks([])
plt.tight_layout()

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_32_0.png

PCA¶

In [18]:

pca2 = PCA(n_components=2)

In [19]:

%%time

x_pca2 = pca2.fit_transform(data)

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 2.33 ms

In [20]:

plt.scatter(x_pca2[:, 0], x_pca2[:, 1], s=5)
plt.axis('square')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_36_0.png

MDS from data¶

In [16]:

mds = MDS(n_components=2)

In [22]:

%%time

x_mds = mds.fit_transform(data)

CPU times: user 49.9 s, sys: 868 ms, total: 50.8 s
Wall time: 21.2 s

In [23]:

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

In [24]:

plt.scatter(x_mds[:, 0], x_mds[:, 1], s=5)
plt.axis('square')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_41_0.png

MDS from dissimilarity matrix¶

Sometimes you start with a dissimilarity matrix (e.g. from observer ranking) rather than a set of vectors. You can still use MDS.

In [25]:

from scipy.spatial.distance import pdist, squareform

In [26]:

import numpy as np

In [27]:

d = pdist(data, metric='euclidean')

In [28]:

np.set_printoptions(precision=2)
squareform(d)

Out[28]:

array([[0.  , 2.97, 2.55, ..., 3.12, 7.35, 3.94],
       [2.97, 0.  , 3.46, ..., 3.59, 7.12, 4.52],
       [2.55, 3.46, 0.  , ..., 0.94, 5.71, 1.79],
       ...,
       [3.12, 3.59, 0.94, ..., 0.  , 5.57, 1.75],
       [7.35, 7.12, 5.71, ..., 5.57, 0.  , 5.67],
       [3.94, 4.52, 1.79, ..., 1.75, 5.67, 0.  ]])

In [29]:

mds2 = MDS(dissimilarity='precomputed')

In [30]:

%%time

x_mds2 = mds2.fit_transform(squareform(d))

CPU times: user 46.8 s, sys: 804 ms, total: 47.6 s
Wall time: 20.3 s

In [31]:

plt.scatter(x_mds2[:, 0], x_mds2[:, 1], s=5)
plt.axis('square')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_49_0.png

t-distributed Stochastic Neighbor Embedding (t-SNE)¶

The t-SNE algoorithm was designed to preserve local distances between points in the original space, as we saw in the example above. This means that t-SNE is particularly effective at preserving clusters in the origianl space. The full t-SNE algorithm is quite complex, so we just sketch the ideas here.

For meore details, see the orignal series of papers and this Python tutorial. The algorithm is also clearly laid out in the fairly comprehensive tutorial.

Outline of t-SNE¶

t-SNE is similar in outline to MDS, with two main differences - “distances” are baased on probabilistic concepts and depend on the local neighborhood of the point.

Original space¶

Find the conditinoal similarity between points in the original space based on a Gaussian kernel

\[p_{j \mid i} = \frac{f(\vert y_i - y_j \vert)}{\sum_{k \ne i} {f(\vert y_i - y_k \vert)}}\]

where

\[f(z) = {e^\frac{{-z^2}}{2\sigma_i^2}}\]

Symmetize the conditional similarity (this is necessary becasue each kernel has its own variance)

\[p_{ij} = \frac{p_{i \mid j} + p_{j \mid i}}{2}\]

This gives a similarity matrix \(p_{ij}\) that is fixed

Notes

In t-SNE, the variance of the Gaussian kernel depensd on the point \(x_i\). Intuitively, we want the variance to be small if \(x_i\) is in a locally desnse region, and to be large if \(x_i\) is in a locally sparse region. This is done by an iteratvie algorithm that depends on a user-defined variable called perplexity. Roughly, perplexity determines the number of meaningful neighbors each point should have.

Map space¶

Find the conditional similarity between points in the map space based on a Cauchy kernel

\[q_{ij} = \frac{g(\vert y_i - y_j \vert)}{\sum_{k \ne i} {g(\vert y_i - y_k \vert)}}\]

where

\[g(z) = \frac{1}{1+z^2}\]

This gives a similarity matrix \(q_{ij}\) that depends on the points in the map space that we can vary

Optimization¶

Minimize the Kullback-Leibler divergence between \(p_{ij}\) and \(q_{ij}\)

\[\text{KL}(P \mid\mid Q) = \sum p_{ij} \log{\frac{p_{ij}}{q_{ij}}}\]

Normal and Cauhcy distributions¶

The Cauchy has mcuh fatter tails than the normal distribuiotn. This means that two points that are widely separated in the original space would be pushed much further apart in the map space.

In [32]:

from scipy.stats import norm, cauchy

In [33]:

d1 = norm()
d2 = cauchy()

In [34]:

x = np.linspace(-10, 10, 100)
plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.legend(['Gaussian', 'Cauchy'])
plt.tight_layout()
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_55_0.png

Poinsts close together in in original space stay close together¶

In [35]:

x = np.linspace(-10, 0, 100)

In [36]:

from scipy.optimize import fsolve

In [37]:

p1 = fsolve(lambda x: d1.pdf(x) - 0.1, -2)
p2 = fsolve(lambda x: d2.pdf(x) - 0.1, -2)

In [38]:

plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.axhline(0.1, linestyle='dashed')
plt.legend(['Gaussian', 'Cauchy'])
plt.title('Close up of CDF')
plt.scatter([p1, p2], [0.1, 0.1], c=['blue', 'red'])
xlim = plt.xlim()
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_60_0.png

Poinsts far apart in in original space are pushed even furhter apart¶

In [39]:

x = np.linspace(-10, -2, 100)

In [40]:

p1 = fsolve(lambda x: d1.pdf(x) - 0.01, -2)
p2 = fsolve(lambda x: d2.pdf(x) - 0.01, -2)

In [41]:

plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.xlim(xlim)
plt.axhline(0.01, linestyle='dashed')
plt.scatter([p1, p2], [0.01, 0.01], c=['blue', 'red'])
plt.legend(['Gaussian', 'Cauchy'])
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_64_0.png

In [42]:

np.abs([d1.ppf(0.04), d2.ppf(0.04)])

Out[42]:

array([1.75, 7.92])

Example¶

In [43]:

tsne2 = TSNE(n_components=2)

In [44]:

%%time

x_tsne2 = tsne2.fit_transform(data)

CPU times: user 18 s, sys: 3.62 s, total: 21.6 s
Wall time: 21.6 s

In [45]:

plt.scatter(x_tsne2[:, 0], x_tsne2[:, 1], s=5)
plt.axis('square')
pass

notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_69_0.png

In [ ]:

Nonlinear Dimension Reduction¶

Limitations of PCA¶

PCA does not preserve locality¶

MDS preserves some locality¶

t-SNE preserves cluster structure¶

Multi-dimensional scaling (MDS)¶

Inutitive explanaion of MDS¶

Stress¶

Strain¶

Example¶

Subsample¶

PCA¶

MDS from data¶

MDS from dissimilarity matrix¶

t-distributed Stochastic Neighbor Embedding (t-SNE)¶

Outline of t-SNE¶

Original space¶

Map space¶

Optimization¶

Normal and Cauhcy distributions¶

Poinsts close together in in original space stay close together¶

Poinsts far apart in in original space are pushed even furhter apart¶

Example¶

Page contents

Previous page

Next page

This Page