Assignment 5: Unsupervised Learning¶
1. 15 points
The MNIST data set needs to be downloaded from https://pjreddie.com/media/files/mnist_train.csv and https://pjreddie.com/media/files/mnist_test.csv and put in the data sub-directory first.
- Load the training and test MNIST digits data sets from
data/mnist_train.csv
anddata/mnist_test.csv
, and split into labels (column 0) and features (all other columns). - Each row is a vector of length 784 with values between 0 (black) and 255 (white) on the gray color scale.
- Display the 3rd vector in the training set as a \(28 \times 28\)
image using
matplotlib
, using a helper function that plots an image and its corresponding label in the title given its row number, the feature matrix and the label vector.
In [1]:
2. 20 points
- Use PCA to reduce the number of dimensions of the training data set so that it includes just above 90% of the total variance. Remember to scale the data before doing PCA.
- How many components are used?
- Reconstruct the training set from the dimension-reduced one. Do this
without using the
inverse_transform
method (you can use this to check your solution) - Show the image of the reconstructed data set for the vector in the third row.
In [1]:
3. 15 points
- Using the test dataset, first use PCA to reduce the dimensionality from 784 to 50. Remember to scale the data before doing PCA.
- Now use TSNE to further reduce the 50 dimensional data set to 2.
- Plot a scatter plot of the data, coloring each point by its label.
- Create a legend for the plot showing what color points go with what label
(Note: The TSNE transform will take a few minutes - go have coffee.)
In [1]:
4. 50 points
- Implement the k-means++ algorithm from the description given at https://en.wikipedia.org/wiki/K-means%2B%2B (summarized below)
- Use k-means++ to initialize a k-means clustering of the TsNE
2-dimensional data, using your own code (i.e. do not use
scikit-learn
or similar libraries for this) - Align the true labels and the k-means labels and show the two TSNE plots side-by-side with coloring of points by label values
K-means++ algorithm to initialize centers
- Choose one center uniformly at random from among the data points.
- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
- Repeat Steps 2 and 3 until k centers have been chosen.
- Now that the initial centers have been chosen, proceed using standard k-means clustering.
In [1]: