Machine Learning for Data Scientists¶

I expect that all of you will have taken or are concurrently taking a ML class that focuses on the mathematics and algorithms. Hence this class focuses mostly on practical aspects of ML that are often glossed over in more academic courses, but useful for a practicing data scientist.

A. Understanding ML¶

A1. What is ML?¶

Algorithms that get better at performing a task by learning from data
Contrast with explicit instruction or expert-constructed rules

A2. Data¶

A2.1. Labeled and unlabeled¶

Labeled \(\to\) supervised learning
Unlabeled \(\to\) unsupervised learning (or possibly self-supervised learning)
Future reward \(\to\) reinforcement learning

A2.2. Structured and unstructured¶

Structured \(\to\) tabular
Unstructured just means non-tabular - includes free text, image, video, audio, sequences
In the past, unstructured data was first converted to structured data by feature engineering; this has been upended by deep learning methods

A2.3. Size¶

Number of observations
Number of features (dimensionality)

A3. ML model examples¶

A3.1 Dimension reduction¶

PCA
MDS
t-SNE
UMAP
PHATE

A3.2 Clustering¶

K-means
Agglomerative hierarchical clustering
Mixture models

A3.3 Supervised learning¶

Nearest neighbor
Linear models
Support vector machines
Trees
Neural networks

B. ML stages¶

B1. Data processing¶

We typically need to process the data for it to work with a broad class of ML models. For example, many ML algorithms will have problems in fitting or interpretation when there are categorical or free text columns, missing data, large variation in measurement scales or collinear features.

Note: To avoid data leakage, any preprocessing that has a fit stage should estimate parameters on training data only.

B1.1. Category encoding¶

Encoding without labels
Encoding with labels

B1.2. Missing data imputation¶

Types of missing data
- MCAR
- MAR
- MNAR
Simple imputation
Fancy imputation

B1.3. Feature selection¶

Uninformative variables
Collinear or multi-collinear variables
Dependent features
Recursive feature elimination
LASSO

B1.4. Shuffling¶

Shuffling breaks up any order to the observations

B1.5 Standardization¶

Distance-based models may be sensitive to scale
Convert features to have zero mean and unit standard deviation

B2. Model training¶

B2.1. Memorization and generalization¶

The entire point of any form of learning is generalization, not memorization
Model capacity is the amount of information a model can store
If model capacity \(\gg\) data complexity, the model will perform best by just memorizing the data \(\to\) over-fitting
If model capacity \(\ll\) data complexity, the model will not be very good \(\to\) under-fitting
Bias and variance
Bias-variance trade-off

B2.1.1. Tracking training and validation measures¶

Alternative showing what happens as model complexity grows over time with training for high-capacity model

B2.1.2. Remedies for over-fitting¶

Collect more data
Synthetic data and data augmentation
Pre-training
Early stopping
L1 and L2 regularization
Model-specific parameters for controlling model complexity
Dropout

B2.1.3. Data leakage¶

Data leakage occurs when your model has access to information it would not have when making predictions in practice
- This results in over-optimistic model evaluations

B2.1.4. In-sample and out-of-sample prediction¶

Train-test split
Importance of out-of-sample prediction
Do not train your model on data that your deployed model will not have access to!
- Model fitting must be done on training data only - this includes data preprocessing estimators
- Importance of using pipelines

B2.2 Imbalanced data¶

Choice of evaluation metrics (e.g. Kappa)
Weighting samples
Majority under-sampling
Minority over-sampling

B2.3. Hyper-parameter tuning¶

Role of cross-validation
Grid search
Auto-tuning

B2.4. Ensemble models¶

Stacking
Bagging
Boosting

Classifier Regressor

B3. Model evaluation¶

B3.1. Unsupervised learning metrics¶

B3.1.1 Dimension reduction¶

Interpreting PCA
- Explained variance
- Scree plot
- Loadings
- Biplots

B3.1.2. Clustering metrics¶

Rand index
Within and between cluster variance
Similarity of nearby clusters

B3.1.2. Information criteria for probabilistic models¶

Negative log likelihood and deviance
AIC
BIC

B3.2 Supervised learning metrics¶

B3.2.1. Predictions¶

predict, predict_proba, and predict_log_proba

B3.2.2. The base model¶

Dummy model (sanity check)

B3.2.3 Classification metrics¶

Confusion matrix and binary scores
ROC curve
PR curve
Cumulative gains curve
Discrimination threshold

B3.2.4. Regression metrics¶

Metrics
Residual plot
Prediction error plot

B4. Model interpretation¶

B4.1. Coefficients¶

Generally only available with linear model family

B4.2. Feature importances¶

Several ways of calculating - widely available for tree-based methods
No direction

B4.3. Partial dependence plots¶

Adds direction to feature importances

B4.4. Surrogate models¶

Use an explainable model to simulate a black box model

B4.5. Shapely¶

SHAP
For one instance
OVer all instances
Interactions / dependencies
Summary plot

Pipelines¶

Using sklearn mixins to create custom ML classes
Using a pipeline for consistency and deployability