Assignment 5: Unsupervised Learning¶

1. 15 points

Load the training and test MNIST digits data sets from data/mnist_train.csv and data/mnist_test.csv, and split into labels (column 0) and features (all other columns).
Each row is a vector of length 784 with values between 0 (black) and 255 (white) on the gray color scale.
Display the 3rd vector in the training set as a \(28 \times 28\) image using matplotlib, using a helper function that plots an image and its corresponding label in the title given its row number, the feature matrix and the label vector.

In [1]:

2. 20 points

Use PCA to reduce the number of dimensions of the training data set so that it includes just above 90% of the total variance. Remember to scale the data before doing PCA.
How many components are used?
Reconstruct the training set from the dimension-reduced one. Do this without using the inverse_transform method (you can use this to check your solution)
Show the image of the reconstructed data set for the vector in the third row.

In [1]:

3. 15 points

Using the test dataset, first use PCA to reduce the dimensionality from 784 to 50. Remember to scale the data before doing PCA.
Now use TSNE to further reduce the 50 dimensional data set to 2.
Plot a scatter plot of the data, coloring each point by its label.
Create a legend for the plot showing what color points go with what label

(Note: The TSNE transform will take a few minutes - go have coffee.)

In [1]:

4. 50 points

Implement the k-means++ algorithm from the description given at https://en.wikipedia.org/wiki/K-means%2B%2B (summarized below)
Use k-means++ to initialize a k-means clustering of the TsNE 2-dimensional data, using your own code (i.e. do not use scikit-learn or similar libraries for this)
Align the true labels and the k-means labels and show the two TSNE plots side-by-side with coloring of points by label values

K-means++ algorithm to initialize centers

Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.

In [1]: