Machine Learning for Data Scientists

I expect that all of you will have taken or are concurrently taking a ML class that focuses on the mathematics and algorithms. Hence this class focuses mostly on practical aspects of ML that are often glossed over in more academic courses, but useful for a practicing data scientist.

A. Understanding ML

A1. What is ML?

  • Algorithms that get better at performing a task by learning from data

  • Contrast with explicit instruction or expert-constructed rules

A2. Data

A2.1. Labeled and unlabeled

  • Labeled \(\to\) supervised learning

  • Unlabeled \(\to\) unsupervised learning (or possibly self-supervised learning)

  • Future reward \(\to\) reinforcement learning

A2.2. Structured and unstructured

  • Structured \(\to\) tabular

  • Unstructured just means non-tabular - includes free text, image, video, audio, sequences

  • In the past, unstructured data was first converted to structured data by feature engineering; this has been upended by deep learning methods

A2.3. Size

  • Number of observations

  • Number of features (dimensionality)

A3. ML model examples

A3.1 Dimension reduction

  • PCA

  • MDS

  • t-SNE

  • UMAP

  • PHATE

A3.2 Clustering

  • K-means

  • Agglomerative hierarchical clustering

  • Mixture models

A3.3 Supervised learning

  • Nearest neighbor img

  • Linear models img

  • Support vector machines img

  • Trees img

  • Neural networks img

B. ML stages

B1. Data processing

We typically need to process the data for it to work with a broad class of ML models. For example, many ML algorithms will have problems in fitting or interpretation when there are categorical or free text columns, missing data, large variation in measurement scales or collinear features.

Note: To avoid data leakage, any preprocessing that has a fit stage should estimate parameters on training data only.

B1.1. Category encoding

  • Encoding without labels

  • Encoding with labels

B1.2. Missing data imputation

  • Types of missing data

    • MCAR

    • MAR

    • MNAR

  • Simple imputation

  • Fancy imputation

B1.3. Feature selection

  • Uninformative variables

  • Collinear or multi-collinear variables

  • Dependent features

  • Recursive feature elimination

  • LASSO

B1.4. Shuffling

  • Shuffling breaks up any order to the observations

B1.5 Standardization

  • Distance-based models may be sensitive to scale

  • Convert features to have zero mean and unit standard deviation

B2. Model training

B2.1. Memorization and generalization

  • The entire point of any form of learning is generalization, not memorization

  • Model capacity is the amount of information a model can store

  • If model capacity \(\gg\) data complexity, the model will perform best by just memorizing the data \(\to\) over-fitting

  • If model capacity \(\ll\) data complexity, the model will not be very good \(\to\) under-fitting

  • Bias and variance img

  • Bias-variance trade-off img

B2.1.1. Tracking training and validation measures

img

Alternative showing what happens as model complexity grows over time with training for high-capacity model

img

B2.1.2. Remedies for over-fitting

  • Collect more data

  • Synthetic data and data augmentation

  • Pre-training

  • Early stopping

  • L1 and L2 regularization

  • Model-specific parameters for controlling model complexity

  • Dropout

B2.1.3. Data leakage

  • Data leakage occurs when your model has access to information it would not have when making predictions in practice

    • This results in over-optimistic model evaluations

B2.1.4. In-sample and out-of-sample prediction

  • Train-test split

  • Importance of out-of-sample prediction

  • Do not train your model on data that your deployed model will not have access to!

    • Model fitting must be done on training data only - this includes data preprocessing estimators

    • Importance of using pipelines

B2.2 Imbalanced data

  • Choice of evaluation metrics (e.g. Kappa)

  • Weighting samples

  • Majority under-sampling

  • Minority over-sampling

B2.3. Hyper-parameter tuning

  • Role of cross-validation

  • Grid search

  • Auto-tuning

B2.4. Ensemble models

  • Stacking img

  • Bagging img

  • Boosting

Classifier img Regressor img

B3. Model evaluation

B3.1. Unsupervised learning metrics

B3.1.1 Dimension reduction
  • Interpreting PCA

    • Explained variance

    • Scree plot

    • Loadings

    • Biplots

img

B3.1.2. Clustering metrics
  • Rand index

  • Within and between cluster variance

  • Similarity of nearby clusters

B3.1.2. Information criteria for probabilistic models
  • Negative log likelihood and deviance

  • AIC

  • BIC

B3.2 Supervised learning metrics

B3.2.1. Predictions
  • predict, predict_proba, and predict_log_proba

B3.2.2. The base model
  • Dummy model (sanity check)

B3.2.3 Classification metrics
  • Confusion matrix and binary scores

  • ROC curve

  • PR curve

  • Cumulative gains curve

  • Discrimination threshold

B3.2.4. Regression metrics
  • Metrics

  • Residual plot

  • Prediction error plot

B4. Model interpretation

B4.1. Coefficients

  • Generally only available with linear model family

B4.2. Feature importances

  • Several ways of calculating - widely available for tree-based methods

  • No direction

B4.3. Partial dependence plots

  • Adds direction to feature importances

B4.4. Surrogate models

  • Use an explainable model to simulate a black box model

B4.5. Shapely

  • SHAP

  • For one instance

  • OVer all instances

  • Interactions / dependencies

  • Summary plot

Pipelines

  • Using sklearn mixins to create custom ML classes

  • Using a pipeline for consistency and deployability