{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 5: Unsupervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1**. 15 points\n", "\n", "The MNIST data set needs to be downloaded from https://pjreddie.com/media/files/mnist_train.csv and https://pjreddie.com/media/files/mnist_test.csv and put in the data sub-directory first.\n", "\n", "- Load the training and test MNIST digits data sets from `data/mnist_train.csv` and `data/mnist_test.csv`, and split into labels (column 0) and features (all other columns). \n", "- Each row is a vector of length 784 with values between 0 (black) and 255 (white) on the gray color scale. \n", "- Display the 3rd vector in the training set as a $28 \\times 28$ image using `matplotlib`, using a helper function that plots an image and its corresponding label in the title given its row number, the feature matrix and the label vector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2**. 20 points\n", "\n", "- Use PCA to reduce the number of dimensions of the training data set so that it includes just above 90% of the total variance. Remember to scale the data before doing PCA.\n", "- How many components are used?\n", "- Reconstruct the training set from the dimension-reduced one. Do this without using the `inverse_transform` method (you can use this to check your solution)\n", "- Show the image of the reconstructed data set for the vector in the third row." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3**. 15 points\n", " \n", "- Using the test dataset, first use PCA to reduce the dimensionality from 784 to 50. Remember to scale the data before doing PCA.\n", "- Now use TSNE to further reduce the 50 dimensional data set to 2. \n", "- Plot a scatter plot of the data, coloring each point by its label. \n", "- Create a legend for the plot showing what color points go with what label\n", "\n", "(Note: The TSNE transform will take a few minutes - go have coffee.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4**. 50 points\n", "\n", "- Implement the k-means++ algorithm from the description given at https://en.wikipedia.org/wiki/K-means%2B%2B (summarized below)\n", "- Use k-means++ to initialize a k-means clustering of the TsNE 2-dimensional data, using your own code (i.e. do not use `scikit-learn` or similar libraries for this)\n", "- Align the true labels and the k-means labels and show the two TSNE plots side-by-side with coloring of points by label values\n", "\n", "K-means++ algorithm to initialize centers\n", "\n", "- Choose one center uniformly at random from among the data points.\n", "- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.\n", "- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.\n", "- Repeat Steps 2 and 3 until k centers have been chosen.\n", "- Now that the initial centers have been chosen, proceed using standard k-means clustering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }