In [2]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
In [3]:
import numpy as np
import numpy.linalg as la
\[\newcommand{\norm}[1]{\left\lVert#1\right\rVert}\]

Nonlinear Dimension Reduction

In many cases of dimension reduction for clustering, we want to preserve the cluster structure in the lower dimensional projection.

In [4]:
np.random.seed(123)
np.set_printoptions(3)

Limitations of PCA

We will project a 2-d data set onto 1-d to see one limitation of PCA. This provides motivation for learning non-linear methods of dimension reduction.

In [5]:
x1 = np.random.multivariate_normal([-3,3], np.eye(2), 100)
x2 = np.random.multivariate_normal([3,3], np.eye(2), 100)
x3 = np.random.multivariate_normal([0,-10], np.eye(2), 100)
xs = np.r_[x1, x2, x3]
xs = (xs - xs.mean(0))/xs.std()
zs = np.r_[np.zeros(100), np.ones(100), 2*np.ones(100)]
In [6]:
plt.scatter(xs[:, 0], xs[:, 1], c=zs)
plt.axis('equal')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_7_0.png

PCA does not preserve locality

In [7]:
from sklearn.decomposition import PCA
In [8]:
pca = PCA(n_components=1)
ys = pca.fit_transform(xs)
In [9]:
plt.scatter(ys[:, 0], np.random.uniform(-1, 1, len(ys)), c=zs)
plt.axhline(0, c='red')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_11_0.png

MDS preserves some locality

In [10]:
from sklearn.manifold import MDS
In [12]:
mds = MDS()
In [13]:
ms = mds.fit_transform(xs)
In [15]:
plt.scatter(ms[:, 0], np.random.uniform(-1, 1, len(ms)), c=zs)
plt.axhline(0, c='red')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_16_0.png

t-SNE preserves cluster structure

In [9]:
from sklearn.manifold import TSNE
In [10]:
tsne = TSNE(n_components=1)
ts = tsne.fit_transform(xs)
In [11]:
plt.scatter(ts[:, 0], np.random.uniform(-1, 1, len(ts)), c=zs)
plt.axhline(0, c='red')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_20_0.png

Multi-dimensional scaling (MDS)

MDS starts wtih a dissimilarity matrix. When the dissimilarties have a metric, this is equivalent to a distrance matrix. Recall that a metric or distance function has the following properites

\[\begin{split}d(x, y) \ge 0 \\ d(x, y) = 0 \iff x = y \\ d(x, y) = d(y, x) \\ d(x, z) \le d(x, y) + d(y, z)\end{split}\]

Inutitive explanaion of MDS

MDS tries to map points \(y_i\) in space \(\mathbb{R}^n\) to matching points \(x_i\) in a map space \(\mathbb{R}^k\) such that the sum of difference of pairwise distances is minimized. Note that we do not need the orignal coordinates of the points \(\mathbb{R}^n\) - all we need is the distance matrix \(D\). Conceptually MDS tries to minimize a quantity similar to

\[\norm{D_{i,j} - \norm{x_i - x_j}}^2\]

Since translation does not affect the difference, a furhter constraint is that \(\sum x_i = 0\).

PCA and MDS

PCA and MDS

Consider the followoing distanaces between 3 points \(A, B, C\)

\[\begin{split}d(A, B) = 1 \\ d(B, C) = 1 \\ d(A, C) = 2 \\\end{split}\]

There is a 1D map that that perfectly captures the distances between the 3 points.

and

\[\begin{split}d(A, B) = 1 \\ d(B, C = 1 \\ d(A, C) = 1\end{split}\]

There is no 1D map that that perfectly captures the distances between the 3 points, but this is easily done in 2D. This shows that it is not always possible to find a distance function in the MDS map (\(\mathbb{R}^n\) ) that preserves the orignal distnaces in \(\mathbb{R}^n\).

Stress

Let \(D_{ij}\) be the distance between \(y_i\) and \(y_j\) in the original space, and \(d_{ij}\) be the distance between \(x_i\) and \(x_j\) in the map space. Then the cost function for MDS is

\(\text{Stress} = \left( \frac{\sum_{i < j} (D_{ij} - f(d_{ij})^2}{\sum_{i < j} d_{ij}^{2}} \right)^{\frac{1}{2}}\)

for some monotonic function \(f\) (usually taken as the identity for quantitative variables). This is solved by iterative numerical optimization

Strain

There is a variant of MDS known as classical MDS that uses a different optimizaiton criterion

\(\text{Strain} = \left( \frac{\sum_{i < j} (B_{ij} - \langle x_i, x_j \rangle)^2}{\sum_{i < j} b_{ij}^{2}} \right)^{\frac{1}{2}}\)

where \(B = X^TX\). Finding \(X\) reduces to a PCA of \(B\).

Note that classical MDS does not give the sme solution as metric MDS. Note also that classic MDS assumes that distances are Euclidean.

Example

In [12]:
from sklearn.manifold import MDS
In [13]:
import pandas as pd
In [14]:
data = np.loadtxt('data/simu_SAGMB.txt')

Subsample

In [15]:
n = 1000
idx = np.random.choice(range(len(data)), n, replace=False)
data = data[idx]
In [16]:
plt.prism()
<Figure size 432x288 with 0 Axes>
In [17]:
plt.figure(figsize=(12,12))
for i in range(8):
    for j in range(i+1, 8):
        plt.subplot(8, 8, i*8+j+1)
        x, y = data[:, i], data[:, j]
        plt.scatter(x, y, s=1)
        plt.xticks([])
        plt.yticks([])
plt.tight_layout()
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_32_0.png

PCA

In [18]:
pca2 = PCA(n_components=2)
In [19]:
%%time

x_pca2 = pca2.fit_transform(data)
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 2.33 ms
In [20]:
plt.scatter(x_pca2[:, 0], x_pca2[:, 1], s=5)
plt.axis('square')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_36_0.png

MDS from data

In [16]:
mds = MDS(n_components=2)
In [22]:
%%time

x_mds = mds.fit_transform(data)
CPU times: user 49.9 s, sys: 868 ms, total: 50.8 s
Wall time: 21.2 s
In [23]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
In [24]:
plt.scatter(x_mds[:, 0], x_mds[:, 1], s=5)
plt.axis('square')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_41_0.png

MDS from dissimilarity matrix

Sometimes you start with a dissimilarity matrix (e.g. from observer ranking) rather than a set of vectors. You can still use MDS.

In [25]:
from scipy.spatial.distance import pdist, squareform
In [26]:
import numpy as np
In [27]:
d = pdist(data, metric='euclidean')
In [28]:
np.set_printoptions(precision=2)
squareform(d)
Out[28]:
array([[0.  , 2.97, 2.55, ..., 3.12, 7.35, 3.94],
       [2.97, 0.  , 3.46, ..., 3.59, 7.12, 4.52],
       [2.55, 3.46, 0.  , ..., 0.94, 5.71, 1.79],
       ...,
       [3.12, 3.59, 0.94, ..., 0.  , 5.57, 1.75],
       [7.35, 7.12, 5.71, ..., 5.57, 0.  , 5.67],
       [3.94, 4.52, 1.79, ..., 1.75, 5.67, 0.  ]])
In [29]:
mds2 = MDS(dissimilarity='precomputed')
In [30]:
%%time

x_mds2 = mds2.fit_transform(squareform(d))
CPU times: user 46.8 s, sys: 804 ms, total: 47.6 s
Wall time: 20.3 s
In [31]:
plt.scatter(x_mds2[:, 0], x_mds2[:, 1], s=5)
plt.axis('square')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_49_0.png

t-distributed Stochastic Neighbor Embedding (t-SNE)

The t-SNE algoorithm was designed to preserve local distances between points in the original space, as we saw in the example above. This means that t-SNE is particularly effective at preserving clusters in the origianl space. The full t-SNE algorithm is quite complex, so we just sketch the ideas here.

For meore details, see the orignal series of papers and this Python tutorial. The algorithm is also clearly laid out in the fairly comprehensive tutorial.

Outline of t-SNE

t-SNE is similar in outline to MDS, with two main differences - “distances” are baased on probabilistic concepts and depend on the local neighborhood of the point.

Original space

  • Find the conditinoal similarity between points in the original space based on a Gaussian kernel

\[p_{j \mid i} = \frac{f(\vert y_i - y_j \vert)}{\sum_{k \ne i} {f(\vert y_i - y_k \vert)}}\]

where

\[f(z) = {e^\frac{{-z^2}}{2\sigma_i^2}}\]
  • Symmetize the conditional similarity (this is necessary becasue each kernel has its own variance)

\[p_{ij} = \frac{p_{i \mid j} + p_{j \mid i}}{2}\]
  • This gives a similarity matrix \(p_{ij}\) that is fixed

Notes

  • In t-SNE, the variance of the Gaussian kernel depensd on the point \(x_i\). Intuitively, we want the variance to be small if \(x_i\) is in a locally desnse region, and to be large if \(x_i\) is in a locally sparse region. This is done by an iteratvie algorithm that depends on a user-defined variable called perplexity. Roughly, perplexity determines the number of meaningful neighbors each point should have.

Map space

  • Find the conditional similarity between points in the map space based on a Cauchy kernel

\[q_{ij} = \frac{g(\vert y_i - y_j \vert)}{\sum_{k \ne i} {g(\vert y_i - y_k \vert)}}\]

where

\[g(z) = \frac{1}{1+z^2}\]
  • This gives a similarity matrix \(q_{ij}\) that depends on the points in the map space that we can vary

Optimization

  • Minimize the Kullback-Leibler divergence between \(p_{ij}\) and \(q_{ij}\)

\[\text{KL}(P \mid\mid Q) = \sum p_{ij} \log{\frac{p_{ij}}{q_{ij}}}\]

Normal and Cauhcy distributions

The Cauchy has mcuh fatter tails than the normal distribuiotn. This means that two points that are widely separated in the original space would be pushed much further apart in the map space.

In [32]:
from scipy.stats import norm, cauchy
In [33]:
d1 = norm()
d2 = cauchy()
In [34]:
x = np.linspace(-10, 10, 100)
plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.legend(['Gaussian', 'Cauchy'])
plt.tight_layout()
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_55_0.png

Poinsts close together in in original space stay close together

In [35]:
x = np.linspace(-10, 0, 100)
In [36]:
from scipy.optimize import fsolve
In [37]:
p1 = fsolve(lambda x: d1.pdf(x) - 0.1, -2)
p2 = fsolve(lambda x: d2.pdf(x) - 0.1, -2)
In [38]:
plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.axhline(0.1, linestyle='dashed')
plt.legend(['Gaussian', 'Cauchy'])
plt.title('Close up of CDF')
plt.scatter([p1, p2], [0.1, 0.1], c=['blue', 'red'])
xlim = plt.xlim()
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_60_0.png

Poinsts far apart in in original space are pushed even furhter apart

In [39]:
x = np.linspace(-10, -2, 100)
In [40]:
p1 = fsolve(lambda x: d1.pdf(x) - 0.01, -2)
p2 = fsolve(lambda x: d2.pdf(x) - 0.01, -2)
In [41]:
plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.xlim(xlim)
plt.axhline(0.01, linestyle='dashed')
plt.scatter([p1, p2], [0.01, 0.01], c=['blue', 'red'])
plt.legend(['Gaussian', 'Cauchy'])
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_64_0.png
In [42]:
np.abs([d1.ppf(0.04), d2.ppf(0.04)])
Out[42]:
array([1.75, 7.92])

Example

In [43]:
tsne2 = TSNE(n_components=2)
In [44]:
%%time

x_tsne2 = tsne2.fit_transform(data)
CPU times: user 18 s, sys: 3.62 s, total: 21.6 s
Wall time: 21.6 s
In [45]:
plt.scatter(x_tsne2[:, 0], x_tsne2[:, 1], s=5)
plt.axis('square')
pass
notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_69_0.png
In [ ]: