Assignment 8: Supervised Learning¶
This should be a straightforward assignment and is here just to provide a concrete example of supervised learning.
Load the titanic data set from seaborn
. We will try to predict
survival from the other variables.
In [1]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
Ex 1. (10 points)
Is the data set balanced or imbalanced? If it is badly imbalanced (say minority class under 20% of total), use down-sampling of the majority class to generate a balanced data set. Drop columns with any missing values.
In [ ]:
The data set is not too imbalanced, so no action taken.
Ex 2. (10 points)
Convert the categorical values into dummy encoded variables , dropping the first value to avoid collinearity.
In [ ]:
Ex 3. (10 points)
Split the data into 70% training and 30% test data sets using stratified sampling on the sex.
In [ ]:
Ex 4. 20 points)
Construct an sklearn
Pipeline
with the components
StandardScaler
, RidgeClassifier
, and GridSearchCV
. Train the
Pipeline classifier, choosing a value for \(\lambda\) from one of
\(\lambda = \{0, 0.1, 1, 10\}\) using grid search with 5-fold
cross-validation. Note that the \(\lambda\) parameter we use in the
lecture is named alpha
in RidgeClassifier
.
In [ ]:
Ex 5. (10 points)
Using the trained classifier, construct a confusion matrix for the test and predicted values
In [14]:
Ex 6. (10 points)
Using the confusion matrix, calculate accuracy, sensitivity, specificity, PPV, NPV and F1 score
Routine
Ex 7. (10 points)
Plot an ROC curve for the classifier
In [16]:
Ex. 8 (20 points)
- Fit polynomial curves of order 0,1,2,3,4 and 5 to the values of \(X\), \(y\) given below
- Using LOOCV, what is the degree of the best-fitting polynomial model? If this is not the true degree, explain why.
In [28]:
np.random.seed(23)
n = 10
k = 3
x = np.random.normal(0,1,n)
X = np.c_[np.ones(n), x, x**2, x**3]
beta = np.random.normal(0, 1, (k+1,1))
s = 0.5
y = X@beta
y += np.random.normal(0, s, y.shape)
In [ ]: