Assignment 8: Supervised Learning

This should be a straightforward assignment and is here just to provide a concrete example of supervised learning.

Load the titanic data set from seaborn. We will try to predict survival from the other variables.

In [1]:
import seaborn as sns
titanic = sns.load_dataset('titanic')

Ex 1. (10 points)

Is the data set balanced or imbalanced? If it is badly imbalanced (say minority class under 20% of total), use down-sampling of the majority class to generate a balanced data set. Drop columns with any missing values.

In [ ]:

The data set is not too imbalanced, so no action taken.

Ex 2. (10 points)

Convert the categorical values into dummy encoded variables , dropping the first value to avoid collinearity.

In [ ]:

Ex 3. (10 points)

Split the data into 70% training and 30% test data sets using stratified sampling on the sex.

In [ ]:

Ex 4. 20 points)

Construct an sklearn Pipeline with the components StandardScaler, RidgeClassifier, and GridSearchCV. Train the Pipeline classifier, choosing a value for \(\lambda\) from one of \(\lambda = \{0, 0.1, 1, 10\}\) using grid search with 5-fold cross-validation. Note that the \(\lambda\) parameter we use in the lecture is named alpha in RidgeClassifier.

In [ ]:

Ex 5. (10 points)

Using the trained classifier, construct a confusion matrix for the test and predicted values

In [14]:

Ex 6. (10 points)

Using the confusion matrix, calculate accuracy, sensitivity, specificity, PPV, NPV and F1 score

Routine

Ex 7. (10 points)

Plot an ROC curve for the classifier

In [16]:

Ex. 8 (20 points)

  • Fit polynomial curves of order 0,1,2,3,4 and 5 to the values of \(X\), \(y\) given below
  • Using LOOCV, what is the degree of the best-fitting polynomial model? If this is not the true degree, explain why.
In [28]:
np.random.seed(23)

n = 10
k = 3
x = np.random.normal(0,1,n)
X = np.c_[np.ones(n), x, x**2, x**3]
beta = np.random.normal(0, 1, (k+1,1))
s = 0.5
y = X@beta
y += np.random.normal(0, s, y.shape)
In [ ]: