{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Assignment 5: Unsupervised Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1**. 15 points\n",
    "\n",
    "The MNIST data set needs to be downloaded from https://pjreddie.com/media/files/mnist_train.csv and https://pjreddie.com/media/files/mnist_test.csv and put in the data sub-directory first.\n",
    "\n",
    "- Load the training and test MNIST digits data sets from `data/mnist_train.csv` and `data/mnist_test.csv`, and split into labels (column 0) and features (all other columns). \n",
    "- Each row is a vector of length 784 with values between 0 (black) and 255 (white) on the gray color scale. \n",
    "- Display the 3rd vector in the training set as a $28 \\times 28$ image using `matplotlib`, using a helper function that plots an image and its corresponding label in the title given its row number, the feature matrix and the label vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2**. 20 points\n",
    "\n",
    "- Use PCA to reduce the number of dimensions of the training data set so that it includes just above 90% of the total variance. Remember to scale the data before doing PCA.\n",
    "- How many components are used?\n",
    "- Reconstruct the training set from the dimension-reduced one. Do this without using the `inverse_transform` method (you can use this to check your solution)\n",
    "- Show the image of the reconstructed data set for the vector in the third row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3**. 15 points\n",
    " \n",
    "- Using the test dataset, first use PCA to reduce the dimensionality from 784 to 50. Remember to scale the data before doing PCA.\n",
    "- Now use TSNE to further reduce the 50 dimensional data set to 2. \n",
    "- Plot a scatter plot of the data, coloring each point by its label. \n",
    "- Create a legend for the plot showing what color points go with what label\n",
    "\n",
    "(Note: The TSNE transform will take a few minutes - go have coffee.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**4**. 50 points\n",
    "\n",
    "- Implement the k-means++ algorithm from the description given at https://en.wikipedia.org/wiki/K-means%2B%2B (summarized below)\n",
    "- Use k-means++ to initialize a k-means clustering of the TsNE 2-dimensional data, using your own code (i.e. do not use `scikit-learn` or similar libraries for this)\n",
    "- Align the true labels and the k-means labels and show the two TSNE plots side-by-side with coloring of points by label values\n",
    "\n",
    "K-means++ algorithm to initialize centers\n",
    "\n",
    "- Choose one center uniformly at random from among the data points.\n",
    "- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.\n",
    "- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.\n",
    "- Repeat Steps 2 and 3 until k centers have been chosen.\n",
    "- Now that the initial centers have been chosen, proceed using standard k-means clustering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}