{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 8: Supervised Learning\n", "\n", "This should be a straightforward assignment and is here just to provide a concrete example of supervised learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the titanic data set from `seaborn`. We will try to predict survival from the other variables." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "titanic = sns.load_dataset('titanic')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex 1**. (10 points)\n", "\n", "Is the data set balanced or imbalanced? If it is badly imbalanced (say minority class under 20% of total), use down-sampling of the majority class to generate a balanced data set. Drop columns with any missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data set is not too imbalanced, so no action taken." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex 2**. (10 points)\n", "\n", "Convert the categorical values into dummy encoded variables , dropping the first value to avoid collinearity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex 3**. (10 points)\n", "\n", "Split the data into 70% training and 30% test data sets using stratified sampling on the sex." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex 4**. 20 points)\n", "\n", "Construct an `sklearn` `Pipeline` with the components `StandardScaler`, `RidgeClassifier`, and `GridSearchCV`. Train the Pipeline classifier, choosing a value for $\\lambda$ from one of $\\lambda = \\{0, 0.1, 1, 10\\}$ using grid search with 5-fold cross-validation. Note that the $\\lambda$ parameter we use in the lecture is named `alpha` in `RidgeClassifier`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex 5**. (10 points)\n", "\n", "Using the trained classifier, construct a confusion matrix for the test and predicted values " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex 6**. (10 points)\n", "\n", "Using the confusion matrix, calculate accuracy, sensitivity, specificity, PPV, NPV and F1 score " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Routine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex 7**. (10 points)\n", "\n", "Plot an ROC curve for the classifier" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ex. 8** (20 points)\n", "\n", "- Fit polynomial curves of order 0,1,2,3,4 and 5 to the values of $X$, $y$ given below \n", "- Using LOOCV, what is the degree of the best-fitting polynomial model? If this is not the true degree, explain why." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "np.random.seed(23)\n", "\n", "n = 10\n", "k = 3\n", "x = np.random.normal(0,1,n)\n", "X = np.c_[np.ones(n), x, x**2, x**3]\n", "beta = np.random.normal(0, 1, (k+1,1))\n", "s = 0.5\n", "y = X@beta\n", "y += np.random.normal(0, s, y.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }