In [2]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
In [3]:
import numpy as np
import numpy.linalg as la
Nonlinear Dimension Reduction¶
In many cases of dimension reduction for clustering, we want to preserve the cluster structure in the lower dimensional projection.
In [4]:
np.random.seed(123)
np.set_printoptions(3)
Limitations of PCA¶
We will project a 2-d data set onto 1-d to see one limitation of PCA. This provides motivation for learning non-linear methods of dimension reduction.
In [5]:
x1 = np.random.multivariate_normal([-3,3], np.eye(2), 100)
x2 = np.random.multivariate_normal([3,3], np.eye(2), 100)
x3 = np.random.multivariate_normal([0,-10], np.eye(2), 100)
xs = np.r_[x1, x2, x3]
xs = (xs - xs.mean(0))/xs.std()
zs = np.r_[np.zeros(100), np.ones(100), 2*np.ones(100)]
In [6]:
plt.scatter(xs[:, 0], xs[:, 1], c=zs)
plt.axis('equal')
pass
data:image/s3,"s3://crabby-images/1d4c2/1d4c297da84e62082f82e6354f4276e407060330" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_7_0.png"
PCA does not preserve locality¶
In [7]:
from sklearn.decomposition import PCA
In [8]:
pca = PCA(n_components=1)
ys = pca.fit_transform(xs)
In [9]:
plt.scatter(ys[:, 0], np.random.uniform(-1, 1, len(ys)), c=zs)
plt.axhline(0, c='red')
pass
data:image/s3,"s3://crabby-images/8be52/8be521d238d7b9ea9264ad01494f0e8e7e4b95ee" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_11_0.png"
MDS preserves some locality¶
In [10]:
from sklearn.manifold import MDS
In [12]:
mds = MDS()
In [13]:
ms = mds.fit_transform(xs)
In [15]:
plt.scatter(ms[:, 0], np.random.uniform(-1, 1, len(ms)), c=zs)
plt.axhline(0, c='red')
pass
data:image/s3,"s3://crabby-images/570cc/570cc09d0af2918e53c606b6971d4517e6bb2fde" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_16_0.png"
t-SNE preserves cluster structure¶
In [9]:
from sklearn.manifold import TSNE
In [10]:
tsne = TSNE(n_components=1)
ts = tsne.fit_transform(xs)
In [11]:
plt.scatter(ts[:, 0], np.random.uniform(-1, 1, len(ts)), c=zs)
plt.axhline(0, c='red')
pass
data:image/s3,"s3://crabby-images/2e850/2e850f7bb1a2531c7407723e4e3c50b91aa022af" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_20_0.png"
Multi-dimensional scaling (MDS)¶
MDS starts wtih a dissimilarity matrix. When the dissimilarties have a metric, this is equivalent to a distrance matrix. Recall that a metric or distance function has the following properites
Inutitive explanaion of MDS¶
MDS tries to map points \(y_i\) in space \(\mathbb{R}^n\) to matching points \(x_i\) in a map space \(\mathbb{R}^k\) such that the sum of difference of pairwise distances is minimized. Note that we do not need the orignal coordinates of the points \(\mathbb{R}^n\) - all we need is the distance matrix \(D\). Conceptually MDS tries to minimize a quantity similar to
Since translation does not affect the difference, a furhter constraint is that \(\sum x_i = 0\).
data:image/s3,"s3://crabby-images/49719/49719cc133644806436e2f3cfdeafb761e18eef4" alt="PCA and MDS"
PCA and MDS¶
Consider the followoing distanaces between 3 points \(A, B, C\)
There is a 1D map that that perfectly captures the distances between the 3 points.
and
There is no 1D map that that perfectly captures the distances between the 3 points, but this is easily done in 2D. This shows that it is not always possible to find a distance function in the MDS map (\(\mathbb{R}^n\) ) that preserves the orignal distnaces in \(\mathbb{R}^n\).
Stress¶
Let \(D_{ij}\) be the distance between \(y_i\) and \(y_j\) in the original space, and \(d_{ij}\) be the distance between \(x_i\) and \(x_j\) in the map space. Then the cost function for MDS is
\(\text{Stress} = \left( \frac{\sum_{i < j} (D_{ij} - f(d_{ij})^2}{\sum_{i < j} d_{ij}^{2}} \right)^{\frac{1}{2}}\)
for some monotonic function \(f\) (usually taken as the identity for quantitative variables). This is solved by iterative numerical optimization
Strain¶
There is a variant of MDS known as classical MDS that uses a different optimizaiton criterion
\(\text{Strain} = \left( \frac{\sum_{i < j} (B_{ij} - \langle x_i, x_j \rangle)^2}{\sum_{i < j} b_{ij}^{2}} \right)^{\frac{1}{2}}\)
where \(B = X^TX\). Finding \(X\) reduces to a PCA of \(B\).
Note that classical MDS does not give the sme solution as metric MDS. Note also that classic MDS assumes that distances are Euclidean.
Example¶
In [12]:
from sklearn.manifold import MDS
In [13]:
import pandas as pd
In [14]:
data = np.loadtxt('data/simu_SAGMB.txt')
Subsample¶
In [15]:
n = 1000
idx = np.random.choice(range(len(data)), n, replace=False)
data = data[idx]
In [16]:
plt.prism()
<Figure size 432x288 with 0 Axes>
In [17]:
plt.figure(figsize=(12,12))
for i in range(8):
for j in range(i+1, 8):
plt.subplot(8, 8, i*8+j+1)
x, y = data[:, i], data[:, j]
plt.scatter(x, y, s=1)
plt.xticks([])
plt.yticks([])
plt.tight_layout()
data:image/s3,"s3://crabby-images/46d80/46d8064cb2b4c5a445937a2303a1c46aa8b5899e" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_32_0.png"
PCA¶
In [18]:
pca2 = PCA(n_components=2)
In [19]:
%%time
x_pca2 = pca2.fit_transform(data)
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 2.33 ms
In [20]:
plt.scatter(x_pca2[:, 0], x_pca2[:, 1], s=5)
plt.axis('square')
pass
data:image/s3,"s3://crabby-images/7ca7c/7ca7cd7413872bfa245632a28a6ec1ce0c7fa4b8" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_36_0.png"
MDS from data¶
In [16]:
mds = MDS(n_components=2)
In [22]:
%%time
x_mds = mds.fit_transform(data)
CPU times: user 49.9 s, sys: 868 ms, total: 50.8 s
Wall time: 21.2 s
In [23]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
In [24]:
plt.scatter(x_mds[:, 0], x_mds[:, 1], s=5)
plt.axis('square')
pass
data:image/s3,"s3://crabby-images/00271/002714728cbf1a90e32b27dbd37a6deb578e9541" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_41_0.png"
MDS from dissimilarity matrix¶
Sometimes you start with a dissimilarity matrix (e.g. from observer ranking) rather than a set of vectors. You can still use MDS.
In [25]:
from scipy.spatial.distance import pdist, squareform
In [26]:
import numpy as np
In [27]:
d = pdist(data, metric='euclidean')
In [28]:
np.set_printoptions(precision=2)
squareform(d)
Out[28]:
array([[0. , 2.97, 2.55, ..., 3.12, 7.35, 3.94],
[2.97, 0. , 3.46, ..., 3.59, 7.12, 4.52],
[2.55, 3.46, 0. , ..., 0.94, 5.71, 1.79],
...,
[3.12, 3.59, 0.94, ..., 0. , 5.57, 1.75],
[7.35, 7.12, 5.71, ..., 5.57, 0. , 5.67],
[3.94, 4.52, 1.79, ..., 1.75, 5.67, 0. ]])
In [29]:
mds2 = MDS(dissimilarity='precomputed')
In [30]:
%%time
x_mds2 = mds2.fit_transform(squareform(d))
CPU times: user 46.8 s, sys: 804 ms, total: 47.6 s
Wall time: 20.3 s
In [31]:
plt.scatter(x_mds2[:, 0], x_mds2[:, 1], s=5)
plt.axis('square')
pass
data:image/s3,"s3://crabby-images/cee1b/cee1bd1a9ca3f7bc237d21ea4e6b6c43050c3555" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_49_0.png"
t-distributed Stochastic Neighbor Embedding (t-SNE)¶
The t-SNE algoorithm was designed to preserve local distances between points in the original space, as we saw in the example above. This means that t-SNE is particularly effective at preserving clusters in the origianl space. The full t-SNE algorithm is quite complex, so we just sketch the ideas here.
For meore details, see the orignal series of papers and this Python tutorial. The algorithm is also clearly laid out in the fairly comprehensive tutorial.
Outline of t-SNE¶
t-SNE is similar in outline to MDS, with two main differences - “distances” are baased on probabilistic concepts and depend on the local neighborhood of the point.
Original space¶
Find the conditinoal similarity between points in the original space based on a Gaussian kernel
where
Symmetize the conditional similarity (this is necessary becasue each kernel has its own variance)
This gives a similarity matrix \(p_{ij}\) that is fixed
Notes
In t-SNE, the variance of the Gaussian kernel depensd on the point \(x_i\). Intuitively, we want the variance to be small if \(x_i\) is in a locally desnse region, and to be large if \(x_i\) is in a locally sparse region. This is done by an iteratvie algorithm that depends on a user-defined variable called perplexity. Roughly, perplexity determines the number of meaningful neighbors each point should have.
Map space¶
Find the conditional similarity between points in the map space based on a Cauchy kernel
where
This gives a similarity matrix \(q_{ij}\) that depends on the points in the map space that we can vary
Optimization¶
Minimize the Kullback-Leibler divergence between \(p_{ij}\) and \(q_{ij}\)
Normal and Cauhcy distributions¶
The Cauchy has mcuh fatter tails than the normal distribuiotn. This means that two points that are widely separated in the original space would be pushed much further apart in the map space.
In [32]:
from scipy.stats import norm, cauchy
In [33]:
d1 = norm()
d2 = cauchy()
In [34]:
x = np.linspace(-10, 10, 100)
plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.legend(['Gaussian', 'Cauchy'])
plt.tight_layout()
pass
data:image/s3,"s3://crabby-images/ede71/ede71792d4563d6806f6f14f37aa8f09144a8f58" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_55_0.png"
Poinsts close together in in original space stay close together¶
In [35]:
x = np.linspace(-10, 0, 100)
In [36]:
from scipy.optimize import fsolve
In [37]:
p1 = fsolve(lambda x: d1.pdf(x) - 0.1, -2)
p2 = fsolve(lambda x: d2.pdf(x) - 0.1, -2)
In [38]:
plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.axhline(0.1, linestyle='dashed')
plt.legend(['Gaussian', 'Cauchy'])
plt.title('Close up of CDF')
plt.scatter([p1, p2], [0.1, 0.1], c=['blue', 'red'])
xlim = plt.xlim()
pass
data:image/s3,"s3://crabby-images/776a0/776a04c382131f8aee9bd77659c926002a845990" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_60_0.png"
Poinsts far apart in in original space are pushed even furhter apart¶
In [39]:
x = np.linspace(-10, -2, 100)
In [40]:
p1 = fsolve(lambda x: d1.pdf(x) - 0.01, -2)
p2 = fsolve(lambda x: d2.pdf(x) - 0.01, -2)
In [41]:
plt.plot(x, d1.pdf(x), c='blue')
plt.plot(x, d2.pdf(x), c='red')
plt.xlim(xlim)
plt.axhline(0.01, linestyle='dashed')
plt.scatter([p1, p2], [0.01, 0.01], c=['blue', 'red'])
plt.legend(['Gaussian', 'Cauchy'])
pass
data:image/s3,"s3://crabby-images/be523/be52389e1d661ad0ff4b2cd8cc0963a9f450c1d5" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_64_0.png"
In [42]:
np.abs([d1.ppf(0.04), d2.ppf(0.04)])
Out[42]:
array([1.75, 7.92])
Example¶
In [43]:
tsne2 = TSNE(n_components=2)
In [44]:
%%time
x_tsne2 = tsne2.fit_transform(data)
CPU times: user 18 s, sys: 3.62 s, total: 21.6 s
Wall time: 21.6 s
In [45]:
plt.scatter(x_tsne2[:, 0], x_tsne2[:, 1], s=5)
plt.axis('square')
pass
data:image/s3,"s3://crabby-images/981cf/981cf838263b85d45f74e16bde3a7a94708f2ced" alt="notebook/../../build/doctrees/nbsphinx/notebook_S10B_Dimension_Reduction_Nonlinear_69_0.png"
In [ ]: