Binder badge Colab badge

Boosting Overfit

Load data

The dataset is the adult dataset which can be downloaded from the UCI machine learning repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

names = [*(f'col_{i}' for i in range(1,14)), 'target']
train_df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                       header=None, names=names)
val_df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test',
                     skiprows=1, header=None, names=names)
val_df['target'] = val_df['target'].str[:-1]

# Run label encoder on all categorical columns
for column in train_df.columns:
    if train_df[column].dtype == 'object':
        le = LabelEncoder()
        le.fit(pd.concat([train_df[column], val_df[column]]))
        train_df[column] = le.transform(train_df[column])
        val_df[column] = le.transform(val_df[column])

Create Dataset

[2]:
from deepchecks import Dataset
from deepchecks.checks.methodology.boosting_overfit import BoostingOverfit

train_ds = Dataset(train_df, label='target')
validation_ds = Dataset(val_df, label='target')
Some columns have been inferred as categorical features: col_1, col_3, col_4, col_5, col_6, col_7, col_8.
 and more...
 For the full list of columns, use dataset.cat_features
Some columns have been inferred as categorical features: col_1, col_3, col_4, col_5, col_6, col_7, col_8.
 and more...
 For the full list of columns, use dataset.cat_features

Classification model

[3]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(random_state=0)
clf.fit(train_ds.data[train_ds.features], train_ds.data[train_ds.label_name])
BoostingOverfit().run(train_ds, validation_ds, clf)

Boosting Overfit

Check for overfit caused by using too many iterations in a gradient boosted model. Read More...

Additional Outputs
The check limits the boosting model to using up to N estimators each time, and plotting the Accuracy calculated for each subset of estimators for both the train dataset and the test dataset.