Assignment 5: Unsupervised Learning

1. 15 points

The MNIST data set needs to be downloaded from https://pjreddie.com/media/files/mnist_train.csv and https://pjreddie.com/media/files/mnist_test.csv and put in the data sub-directory first.

  • Load the training and test MNIST digits data sets from data/mnist_train.csv and data/mnist_test.csv, and split into labels (column 0) and features (all other columns).
  • Each row is a vector of length 784 with values between 0 (black) and 255 (white) on the gray color scale.
  • Display the 3rd vector in the training set as a \(28 \times 28\) image using matplotlib, using a helper function that plots an image and its corresponding label in the title given its row number, the feature matrix and the label vector.
In [1]:




2. 20 points

  • Use PCA to reduce the number of dimensions of the training data set so that it includes just above 90% of the total variance. Remember to scale the data before doing PCA.
  • How many components are used?
  • Reconstruct the training set from the dimension-reduced one. Do this without using the inverse_transform method (you can use this to check your solution)
  • Show the image of the reconstructed data set for the vector in the third row.
In [1]:




3. 15 points

  • Using the test dataset, first use PCA to reduce the dimensionality from 784 to 50. Remember to scale the data before doing PCA.
  • Now use TSNE to further reduce the 50 dimensional data set to 2.
  • Plot a scatter plot of the data, coloring each point by its label.
  • Create a legend for the plot showing what color points go with what label

(Note: The TSNE transform will take a few minutes - go have coffee.)

In [1]:




4. 50 points

  • Implement the k-means++ algorithm from the description given at https://en.wikipedia.org/wiki/K-means%2B%2B (summarized below)
  • Use k-means++ to initialize a k-means clustering of the TsNE 2-dimensional data, using your own code (i.e. do not use scikit-learn or similar libraries for this)
  • Align the true labels and the k-means labels and show the two TSNE plots side-by-side with coloring of points by label values

K-means++ algorithm to initialize centers

  • Choose one center uniformly at random from among the data points.
  • For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  • Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
  • Repeat Steps 2 and 3 until k centers have been chosen.
  • Now that the initial centers have been chosen, proceed using standard k-means clustering.
In [1]: