{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.simplefilter('ignore', FutureWarning)\n", "warnings.simplefilter('ignore', UserWarning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Imbalanced data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imbalanced data occurs in classification when the number of instances in each class are not the same. Some care is required to learn to predict the *rare* classes effectively. \n", "\n", "There is no one-size-fits-all approach to handling imbalanced data. A reasonable strategy is to consider this as a model selection problem, and use cross-validation to find an approach that works well for your data sets. We will show how to do this in the hyper-parameter optimization notebook. \n", "\n", "**Warning**: Like most things in ML, techniques should not be applied blindly, but considered carefully with the problem goal in mind. In many cases, there is a decision-theoretic problem of assigning the appropriate costs to minority and majority case mistakes that requires domain knowledge to model correctly. As you will see in this example, blind application of a technique does not necessarily improve performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulate an imbalanced data set" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X_train = pd.read_csv('data/X_train.csv')\n", "X_test = pd.read_csv('data/X_test.csv')\n", "y_train = pd.read_csv('data/y_train.csv')\n", "y_test = pd.read_csv('data/y_test.csv')\n", "X = pd.concat([X_train, X_test])\n", "y = pd.concat([y_train, y_test]).squeeze()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 809\n", "1 500\n", "Name: survived, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.value_counts()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "np.random.seed(0)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "idx = (\n", " (y == 0) | \n", " ((y == 1) & (np.random.uniform(0, 1, y.shape) < 0.2))\n", ").squeeze()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "X_im, y_im = X.loc[idx, :], y[idx]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X_im, y_im, random_state=0)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0 204\n", " 1 25\n", " Name: survived, dtype: int64,\n", " 0 605\n", " 1 80\n", " Name: survived, dtype: int64)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test.value_counts(), y_train.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collect more data\n", "\n", "This is the best but often impractical solution. Synthetic data generation may also be an option." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use evaluation metrics that are less sensitive to imbalance\n", "\n", "For example, the `F1` score (harmonic mean of precision and recall) is less sensitive than the accuracy score." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.utils import class_weight\n", "from sklearn.metrics import roc_auc_score, confusion_matrix" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.dummy import DummyClassifier" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "clf = DummyClassifier(strategy='prior')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DummyClassifier(strategy='prior')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8908296943231441" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score, f1_score, balanced_accuracy_score" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8908296943231441" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(clf.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(clf.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "lr = LogisticRegression()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9213973799126638" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8723936613844872" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "balanced_accuracy_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Over-sample the minority class\n", "\n", "There are many ways to over-sample the minority class. A popular algorithm is known as SMOTE (Synthetic Minority Oversampling Technique) \n", "\n", "![img](https://ars.els-cdn.com/content/image/1-s2.0-S0020025517310083-gr3.jpg)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "! python3 -m pip install --quiet imbalanced-learn" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "import imblearn" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "X_train_resampled, y_train_resampled = \\\n", "imblearn.over_sampling.SMOTE().fit_resample(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(685, 11)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1210, 11)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train_resampled.shape" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 605\n", "1 80\n", "Name: survived, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate if this helps" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "lr = LogisticRegression()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[202, 16],\n", " [ 2, 9]])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train_resampled, y_train_resampled)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4109589041095891" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[171, 10],\n", " [ 33, 15]])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr.predict(X_test), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Under-sample the majority class\n", "\n", "Tomek pairs are nearest neighbor pairs of instances where the classes are different. Under-sampling is done by removing the majority member of the pair. \n", "\n", "![img](https://miro.medium.com/max/2788/1*pR35KsLpz7-_zvbvdm0frg.png)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "X_train_resampled, y_train_resampled = \\\n", "imblearn.under_sampling.TomekLinks().fit_resample(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(685, 11)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(665, 11)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train_resampled.shape" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 605\n", "1 80\n", "Name: survived, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.value_counts()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 585\n", "1 80\n", "Name: survived, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train_resampled.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate if this helps" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "lr = LogisticRegression()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[202, 16],\n", " [ 2, 9]])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train_resampled, y_train_resampled)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[202, 16],\n", " [ 2, 9]])" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr.predict(X_test), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combine over- and under-sampling\n", "\n", "For example, over-sample using SMOTE then clean using Tomek." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "X_train_resampled, y_train_resampled = \\\n", "imblearn.combine.SMOTETomek().fit_resample(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(685, 11)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1174, 11)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train_resampled.shape" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 605\n", "1 80\n", "Name: survived, dtype: int64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.value_counts()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 587\n", "0 587\n", "Name: survived, dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train_resampled.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate if this helps" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "lr = LogisticRegression()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[202, 16],\n", " [ 2, 9]])" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train_resampled, y_train_resampled)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.40540540540540543" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[170, 10],\n", " [ 34, 15]])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr.predict(X_test), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use class weights to adjust the loss function\n", "\n", "We make prediction errors in the minority class more costly than prediction errors in the majority class." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "wts = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.5661157, 4.28125 ])" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can then pass in the class weights. Note that there are several alternative ways to calculate possible class weights to use, and you can also do a GridSearch on weights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is actually built-in to most classifiers. The defaults are equal weights to each class." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "lr = LogisticRegression(class_weight=wts)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(class_weight=array([0.5661157, 4.28125 ]))" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.5661157, 4.28125 ])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.class_weight" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8723936613844872" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[202, 16],\n", " [ 2, 9]])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "lr_balanced = LogisticRegression(class_weight='balabced')" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'balabced'" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr_balanced.class_weight" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(class_weight='balabced')" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr_balanced.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8723936613844872" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(lr_balanced.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[202, 16],\n", " [ 2, 9]])" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(lr_balanced.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(lr_balanced.predict(X_test), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use a classifier that is less sensitive to imbalance\n", "\n", "Boosted trees are generally good because of their sequential nature." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "from catboost import CatBoostClassifier" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "cb = CatBoostClassifier()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cb.fit(X_train, y_train, verbose=0)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5263157894736842" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(cb.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[201, 15],\n", " [ 3, 10]])" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(cb.predict(X_test), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imbalanced learn has classifiers that balance the data automatically" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "from imblearn.ensemble import BalancedRandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "brf = BalancedRandomForestClassifier()" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BalancedRandomForestClassifier()" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[154, 8],\n", " [ 50, 17]])" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(brf.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3695652173913044" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(brf.predict(X_test), y_test)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.5 64-bit", "language": "python", "name": "python38564bit02a66c47ce504b05b2ef5646cfed96c2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }