In [1]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', UserWarning)

# Imbalanced data

Imbalanced data occurs in classification when the number of instances in each class are not the same. Some care is required to learn to predict the *rare* classes effectively. 

There is no one-size-fits-all approach to handling imbalanced data. A reasonable strategy is to consider this as a model selection problem, and use cross-validation to find an approach that works well for your data sets. We will show how to do this in the hyper-parameter optimization notebook. 

**Warning**: Like most things in ML, techniques should not be applied blindly, but considered carefully with the problem goal in mind. In many cases, there is a decision-theoretic problem of assigning the appropriate costs to minority and majority case mistakes that requires domain knowledge to model correctly. As you will see in this example, blind application of a technique does not necessarily improve performance.

## Simulate an imbalanced data set

In [2]:
import pandas as pd
import numpy as np

In [3]:
X_train = pd.read_csv('data/X_train.csv')
X_test = pd.read_csv('data/X_test.csv')
y_train = pd.read_csv('data/y_train.csv')
y_test = pd.read_csv('data/y_test.csv')
X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test]).squeeze()

In [4]:
y.value_counts()

0    809
1    500
Name: survived, dtype: int64

In [5]:
np.random.seed(0)

In [6]:
idx = (
    (y == 0) | 
    ((y == 1) & (np.random.uniform(0, 1, y.shape) < 0.2))
).squeeze()

In [7]:
X_im, y_im = X.loc[idx, :], y[idx]

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_im, y_im, random_state=0)

In [10]:
y_test.value_counts(), y_train.value_counts()

(0    204
 1     25
 Name: survived, dtype: int64,
 0    605
 1     80
 Name: survived, dtype: int64)

## Collect more data

This is the best but often impractical solution. Synthetic data generation may also be an option.

## Use evaluation metrics that are less sensitive to imbalance

For example, the `F1` score (harmonic mean of precision and recall) is less sensitive than the accuracy score.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.utils import class_weight
from sklearn.metrics import roc_auc_score, confusion_matrix

In [12]:
from sklearn.dummy import DummyClassifier

In [13]:
clf = DummyClassifier(strategy='prior')

In [14]:
clf.fit(X_train, y_train)

DummyClassifier(strategy='prior')

In [15]:
clf.score(X_test, y_test)

0.8908296943231441

In [16]:
from sklearn.metrics import accuracy_score, f1_score, balanced_accuracy_score

In [17]:
accuracy_score(clf.predict(X_test), y_test)

0.8908296943231441

In [18]:
f1_score(clf.predict(X_test), y_test)

0.0

In [19]:
lr = LogisticRegression()

In [20]:
lr.fit(X_train, y_train)

LogisticRegression()

In [21]:
accuracy_score(lr.predict(X_test), y_test)

0.9213973799126638

In [22]:
balanced_accuracy_score(lr.predict(X_test), y_test)

0.8723936613844872

In [23]:
f1_score(lr.predict(X_test), y_test)

0.5

## Over-sample the minority class

There are many ways to over-sample the minority class. A popular algorithm is known as SMOTE (Synthetic Minority Oversampling Technique) 

![img](https://ars.els-cdn.com/content/image/1-s2.0-S0020025517310083-gr3.jpg)

In [24]:
! python3 -m pip install --quiet imbalanced-learn

In [25]:
import imblearn

In [26]:
X_train_resampled, y_train_resampled = \
imblearn.over_sampling.SMOTE().fit_resample(X_train, y_train)

In [27]:
X_train.shape

(685, 11)

In [28]:
X_train_resampled.shape

(1210, 11)

In [29]:
y_train.value_counts()

0    605
1     80
Name: survived, dtype: int64

### Evaluate if this helps

In [30]:
lr = LogisticRegression()

In [31]:
lr.fit(X_train, y_train)

LogisticRegression()

In [32]:
f1_score(lr.predict(X_test), y_test)

0.5

In [33]:
confusion_matrix(lr.predict(X_test), y_test)

array([[202,  16],
       [  2,   9]])

In [34]:
lr.fit(X_train_resampled, y_train_resampled)

LogisticRegression()

In [35]:
f1_score(lr.predict(X_test), y_test)

0.4109589041095891

In [36]:
confusion_matrix(lr.predict(X_test), y_test)

array([[171,  10],
       [ 33,  15]])

## Under-sample the majority class

Tomek pairs are nearest neighbor pairs of instances where the classes are different. Under-sampling is done by removing the majority member of the pair. 

![img](https://miro.medium.com/max/2788/1*pR35KsLpz7-_zvbvdm0frg.png)

In [37]:
X_train_resampled, y_train_resampled = \
imblearn.under_sampling.TomekLinks().fit_resample(X_train, y_train)

In [38]:
X_train.shape

(685, 11)

In [39]:
X_train_resampled.shape

(665, 11)

In [40]:
y_train.value_counts()

0    605
1     80
Name: survived, dtype: int64

In [41]:
y_train_resampled.value_counts()

0    585
1     80
Name: survived, dtype: int64

### Evaluate if this helps

In [42]:
lr = LogisticRegression()

In [43]:
lr.fit(X_train, y_train)

LogisticRegression()

In [44]:
f1_score(lr.predict(X_test), y_test)

0.5

In [45]:
confusion_matrix(lr.predict(X_test), y_test)

array([[202,  16],
       [  2,   9]])

In [46]:
lr.fit(X_train_resampled, y_train_resampled)

LogisticRegression()

In [47]:
f1_score(lr.predict(X_test), y_test)

0.5

In [48]:
confusion_matrix(lr.predict(X_test), y_test)

array([[202,  16],
       [  2,   9]])

## Combine over- and under-sampling

For example, over-sample using SMOTE then clean using Tomek.

In [49]:
X_train_resampled, y_train_resampled = \
imblearn.combine.SMOTETomek().fit_resample(X_train, y_train)

In [50]:
X_train.shape

(685, 11)

In [51]:
X_train_resampled.shape

(1174, 11)

In [52]:
y_train.value_counts()

0    605
1     80
Name: survived, dtype: int64

In [53]:
y_train_resampled.value_counts()

1    587
0    587
Name: survived, dtype: int64

### Evaluate if this helps

In [54]:
lr = LogisticRegression()

In [55]:
lr.fit(X_train, y_train)

LogisticRegression()

In [56]:
f1_score(lr.predict(X_test), y_test)

0.5

In [57]:
confusion_matrix(lr.predict(X_test), y_test)

array([[202,  16],
       [  2,   9]])

In [58]:
lr.fit(X_train_resampled, y_train_resampled)

LogisticRegression()

In [59]:
f1_score(lr.predict(X_test), y_test)

0.40540540540540543

In [60]:
confusion_matrix(lr.predict(X_test), y_test)

array([[170,  10],
       [ 34,  15]])

## Use class weights to adjust the loss function

We make prediction errors in the minority class more costly than prediction errors in the majority class.

In [61]:
wts = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)

In [62]:
wts

array([0.5661157, 4.28125  ])

You can then pass in the class weights. Note that there are several alternative ways to calculate possible class weights to use, and you can also do a GridSearch on weights.

This is actually built-in to most classifiers. The defaults are equal weights to each class.

In [63]:
lr = LogisticRegression(class_weight=wts)

In [64]:
lr.fit(X_train, y_train)

LogisticRegression(class_weight=array([0.5661157, 4.28125  ]))

In [65]:
lr.class_weight

array([0.5661157, 4.28125  ])

In [66]:
f1_score(lr.predict(X_test), y_test)

0.5

In [67]:
roc_auc_score(lr.predict(X_test), y_test)

0.8723936613844872

In [68]:
confusion_matrix(lr.predict(X_test), y_test)

array([[202,  16],
       [  2,   9]])

In [69]:
lr_balanced = LogisticRegression(class_weight='balabced')

In [70]:
lr_balanced.class_weight

'balabced'

In [71]:
lr_balanced.fit(X_train, y_train)

LogisticRegression(class_weight='balabced')

In [72]:
roc_auc_score(lr_balanced.predict(X_test), y_test)

0.8723936613844872

In [73]:
confusion_matrix(lr_balanced.predict(X_test), y_test)

array([[202,  16],
       [  2,   9]])

In [74]:
f1_score(lr_balanced.predict(X_test), y_test)

0.5

## Use a classifier that is less sensitive to imbalance

Boosted trees are generally good because of their sequential nature.

In [75]:
from catboost import CatBoostClassifier

In [76]:
cb = CatBoostClassifier()

In [77]:
cb.fit(X_train, y_train, verbose=0)

<catboost.core.CatBoostClassifier at 0x135202880>

In [78]:
f1_score(cb.predict(X_test), y_test)

0.5263157894736842

In [79]:
confusion_matrix(cb.predict(X_test), y_test)

array([[201,  15],
       [  3,  10]])

### Imbalanced learn has classifiers that balance the data automatically

In [80]:
from imblearn.ensemble import BalancedRandomForestClassifier

In [81]:
brf = BalancedRandomForestClassifier()

In [82]:
brf.fit(X_train, y_train)

BalancedRandomForestClassifier()

In [83]:
confusion_matrix(brf.predict(X_test), y_test)

array([[154,   8],
       [ 50,  17]])

In [84]:
f1_score(brf.predict(X_test), y_test)

0.3695652173913044