Machine Learning for Data Scientists¶
I expect that all of you will have taken or are concurrently taking a ML class that focuses on the mathematics and algorithms. Hence this class focuses mostly on practical aspects of ML that are often glossed over in more academic courses, but useful for a practicing data scientist.
A. Understanding ML¶
A1. What is ML?¶
Algorithms that get better at performing a task by learning from data
Contrast with explicit instruction or expert-constructed rules
A2. Data¶
A2.1. Labeled and unlabeled¶
Labeled \(\to\) supervised learning
Unlabeled \(\to\) unsupervised learning (or possibly self-supervised learning)
Future reward \(\to\) reinforcement learning
A2.2. Structured and unstructured¶
Structured \(\to\) tabular
Unstructured just means non-tabular - includes free text, image, video, audio, sequences
In the past, unstructured data was first converted to structured data by feature engineering; this has been upended by deep learning methods
A2.3. Size¶
Number of observations
Number of features (dimensionality)
B. ML stages¶
B1. Data processing¶
We typically need to process the data for it to work with a broad class of ML models. For example, many ML algorithms will have problems in fitting or interpretation when there are categorical or free text columns, missing data, large variation in measurement scales or collinear features.
Note: To avoid data leakage, any preprocessing that has a fit stage should estimate parameters on training data only.
B1.1. Category encoding¶
Encoding without labels
Encoding with labels
B1.2. Missing data imputation¶
Types of missing data
MCAR
MAR
MNAR
Simple imputation
Fancy imputation
B1.3. Feature selection¶
Uninformative variables
Collinear or multi-collinear variables
Dependent features
Recursive feature elimination
LASSO
B1.4. Shuffling¶
Shuffling breaks up any order to the observations
B1.5 Standardization¶
Distance-based models may be sensitive to scale
Convert features to have zero mean and unit standard deviation
B2. Model training¶
B2.1. Memorization and generalization¶
The entire point of any form of learning is generalization, not memorization
Model capacity is the amount of information a model can store
If model capacity \(\gg\) data complexity, the model will perform best by just memorizing the data \(\to\) over-fitting
If model capacity \(\ll\) data complexity, the model will not be very good \(\to\) under-fitting
Bias and variance
Bias-variance trade-off
B2.1.1. Tracking training and validation measures¶
Alternative showing what happens as model complexity grows over time with training for high-capacity model
B2.1.2. Remedies for over-fitting¶
Collect more data
Synthetic data and data augmentation
Pre-training
Early stopping
L1 and L2 regularization
Model-specific parameters for controlling model complexity
Dropout
B2.1.3. Data leakage¶
Data leakage occurs when your model has access to information it would not have when making predictions in practice
This results in over-optimistic model evaluations
B2.1.4. In-sample and out-of-sample prediction¶
Train-test split
Importance of out-of-sample prediction
Do not train your model on data that your deployed model will not have access to!
Model fitting must be done on training data only - this includes data preprocessing estimators
Importance of using pipelines
B2.2 Imbalanced data¶
Choice of evaluation metrics (e.g. Kappa)
Weighting samples
Majority under-sampling
Minority over-sampling
B2.3. Hyper-parameter tuning¶
Role of cross-validation
Grid search
Auto-tuning
B3. Model evaluation¶
B3.1. Unsupervised learning metrics¶
B3.1.2. Clustering metrics¶
Rand index
Within and between cluster variance
Similarity of nearby clusters
B3.1.2. Information criteria for probabilistic models¶
Negative log likelihood and deviance
AIC
BIC
B3.2 Supervised learning metrics¶
B3.2.1. Predictions¶
predict
,predict_proba
, andpredict_log_proba
B3.2.2. The base model¶
Dummy model (sanity check)
B3.2.3 Classification metrics¶
Confusion matrix and binary scores
ROC curve
PR curve
Cumulative gains curve
Discrimination threshold
B3.2.4. Regression metrics¶
Metrics
Residual plot
Prediction error plot
B4. Model interpretation¶
B4.1. Coefficients¶
Generally only available with linear model family
B4.2. Feature importances¶
Several ways of calculating - widely available for tree-based methods
No direction
B4.3. Partial dependence plots¶
Adds direction to feature importances
B4.4. Surrogate models¶
Use an explainable model to simulate a black box model
B4.5. Shapely¶
SHAP
For one instance
OVer all instances
Interactions / dependencies
Summary plot
Pipelines¶
Using
sklearn
mixins to create custom ML classesUsing a pipeline for consistency and deployability