Processing data

Many ML algorithms need data that is numeric, complete (no missing) and standardized. Ensembles of trees are the most accommodating and require the least data processing.

[1]:
url = (
    'http://biostat.mc.vanderbilt.edu/'
    'wiki/pub/Main/DataSets/titanic3.xls'
)
[2]:
import pandas as pd
[3]:
df = pd.read_excel(url)
df_orig = df.copy()

Basic inspection

[4]:
df.sample(5)
[4]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
1117 3 0 Pekoniemi, Mr. Edvard male 21.0 0 0 STON/O 2. 3101294 7.9250 NaN S NaN NaN NaN
412 2 0 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S NaN 322.0 Liverpool / Montreal, PQ
64 1 1 Chambers, Mr. Norman Campbell male 27.0 1 0 113806 53.1000 E8 S 5 NaN New York, NY / Ithaca, NY
650 3 0 Attalah, Miss. Malake female 17.0 0 0 2627 14.4583 NaN C NaN NaN NaN
912 3 0 Karaic, Mr. Milan male 30.0 0 0 349246 7.8958 NaN S NaN NaN NaN
[5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   pclass     1309 non-null   int64
 1   survived   1309 non-null   int64
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64
 6   parch      1309 non-null   int64
 7   ticket     1309 non-null   object
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object
 10  embarked   1307 non-null   object
 11  boat       486 non-null    object
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB

Detailed inspection

[6]:
import pandas_profiling as pp
[7]:
pp.ProfileReport(df)



[7]:

Create new features

Sometimes we want to create new features from existing columns. For example, the names column can be mined to extract titles. We illustrate how to do this, but will not use this here.

[8]:
df['title'] = df.name.str.extract('.*([A-Z][a-z]+)\..*')
df.title.value_counts()
[8]:
Mr          757
Miss        260
Mrs         197
Master       61
Dr            8
Rev           8
Col           4
Major         2
Mlle          2
Ms            2
Don           1
Lady          1
Mme           1
Jonkheer      1
Sir           1
Capt          1
Dona          1
Countess      1
Name: title, dtype: int64

Drop features

These features are either uninformative or leak information about the outcome.

[9]:
target = df.survived
df = df.drop(columns = [
    'survived', 'name', 'ticket', 'cabin'
    ,'boat', 'body', 'home.dest', 'title'])

Inspect for missing data

[10]:
import missingno as mn
[11]:
mn.matrix(df);
../_images/notebooks_B05_Data_Preporocessing_16_0.png

Fill in missing values for categorical values

[12]:
df.select_dtypes('object').isnull().sum()
[12]:
sex         0
embarked    2
dtype: int64
[13]:
df['embarked'] = df['embarked'].fillna('')

Tangent: catboost is nice

Minimal processing or tuning is required to use catboost, making it a nice “default” algorithm.

[14]:
! python3 -m pip install --quiet catboost
[15]:
import catboost
[16]:
from sklearn.model_selection import train_test_split
[17]:
X_train, X_test, y_train, y_test =train_test_split(df, target, random_state=0)
[18]:
cb = catboost.CatBoostClassifier()
[19]:
cb.fit(X_train, y_train,
       cat_features=['sex', 'embarked'],
       verbose=0);
[20]:
cb.score(X_test, y_test)
[20]:
0.8170731707317073

Category encoding

Vanilla encoding

For variables with only a few distinct values, one hot encoding (or dummy variables) is often used. For more values, we can use hash encoding, which is basically the same idea but bins values using a hash function.

We may choose to drop one of the created columns to avoid multicollinearity.

[21]:
pd.get_dummies(df, drop_first=True).head()
[21]:
pclass age sibsp parch fare sex_male embarked_C embarked_Q embarked_S
0 1 29.0000 0 0 211.3375 0 0 0 1
1 1 0.9167 1 2 151.5500 1 0 0 1
2 1 2.0000 1 2 151.5500 0 0 0 1
3 1 30.0000 1 2 151.5500 1 0 0 1
4 1 25.0000 1 2 151.5500 0 0 0 1

Target encoding

We can use the target to find a more informative encoding. Note that these methods leak information and are prone to over-fitting.

[22]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
[23]:
import category_encoders as ce
[24]:
te = ce.TargetEncoder()
[25]:
te.fit_transform(df.select_dtypes('number'), target).head()
[25]:
pclass age sibsp parch fare
0 1 29.0000 0 0 211.3375
1 1 0.9167 1 2 151.5500
2 1 2.0000 1 2 151.5500
3 1 30.0000 1 2 151.5500
4 1 25.0000 1 2 151.5500

Split data into train and test data sets

Before we go further, we split into test and train data sets to avoid data leakage.

[26]:
X_train, X_test, y_train, y_test = train_test_split(df, target)

Category encoding

We will be conservative and avoid risk of leakage

Note we don’t bother to drop columns - multicollinearity is only a problem when fitting linear models without regularization - this is rarely done in ML (c.f. statistics).

[27]:
ohe= ce.OneHotEncoder(cols=['sex','embarked'], use_cat_names=True)
[28]:
X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)
[29]:
X_train.head()
[29]:
pclass sex_male sex_female age sibsp parch fare embarked_S embarked_C embarked_Q embarked_
414 2 1 0 34.0 1 0 21.0000 1 0 0 0
287 1 1 0 61.0 0 0 32.3208 1 0 0 0
1279 3 0 1 14.0 0 0 7.8542 1 0 0 0
711 3 1 0 28.0 0 0 7.2500 1 0 0 0
358 2 0 1 42.0 0 0 13.0000 1 0 0 0

Impute missing numeric values

Vanilla imputation

A simple imputation is to fill with mean or median.

[30]:
from sklearn.impute import SimpleImputer
[31]:
si = SimpleImputer(strategy='mean')
[32]:
X_train.select_dtypes('number').head(3)
[32]:
pclass sex_male sex_female age sibsp parch fare embarked_S embarked_C embarked_Q embarked_
414 2 1 0 34.0 1 0 21.0000 1 0 0 0
287 1 1 0 61.0 0 0 32.3208 1 0 0 0
1279 3 0 1 14.0 0 0 7.8542 1 0 0 0

We illustrate the code but will try more fancy imputation instead.

X_train[X_train.select_dtypes('number').columns] = \
si.fit_transform(X_train.select_dtypes('number'))
X_test[X_test.select_dtypes('number').columns] = \
si.transform(X_test.select_dtypes('number'))

Fancy imputation

This basically does the same thing as mice in R.

[33]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
[34]:
ii = IterativeImputer(random_state=0)
[35]:
X_train[X_train.select_dtypes('number').columns] = \
ii.fit_transform(X_train.select_dtypes('number'))
[36]:
X_test[X_test.select_dtypes('number').columns] =  \
ii.transform(X_test.select_dtypes('number'))
[37]:
X_train.isnull().sum().sum(), X_test.isnull().sum().sum()
[37]:
(0, 0)

Simple example to illustrate differences

[38]:
import numpy as np
[39]:
x = np.array([
    [10, 10],
    [1, 1],
    [2,2],
    [10, 10],
    [10, np.nan],
    [np.nan, 10],
    [np.nan, np.nan]
])
[40]:
si.fit_transform(x)
[40]:
array([[10. , 10. ],
       [ 1. ,  1. ],
       [ 2. ,  2. ],
       [10. , 10. ],
       [10. ,  6.6],
       [ 6.6, 10. ],
       [ 6.6,  6.6]])
[41]:
ii.fit_transform(x)
[41]:
array([[10.        , 10.        ],
       [ 1.        ,  1.        ],
       [ 2.        ,  2.        ],
       [10.        , 10.        ],
       [10.        ,  9.99968523],
       [10.00094638, 10.        ],
       [ 7.1668244 ,  7.1666142 ]])
[42]:
X_train.to_csv('data/X_train_unscaled.csv', index=False)
X_test.to_csv('data/X_test_unscaled.csv', index=False)
y_train.to_csv('data/y_train_unscaled.csv', index=False)
y_test.to_csv('data/y_test_unscaled.csv', index=False)

Standardize data

[43]:
from sklearn.preprocessing import StandardScaler
[44]:
scaler = StandardScaler()
[45]:
X_train.iloc[:, :] = scaler.fit_transform(X_train)
X_test.iloc[:, :] = scaler.transform(X_test)

Save processed data for future use

[46]:
X_train.to_csv('data/X_train.csv', index=False)
X_test.to_csv('data/X_test.csv', index=False)
y_train.to_csv('data/y_train.csv', index=False)
y_test.to_csv('data/y_test.csv', index=False)