Binder badge Colab badge

Train Test Label Drift

[1]:
import numpy as np
import pandas as pd

from deepchecks import Dataset
from deepchecks.checks import TrainTestLabelDrift
import pprint

Generate data - Classification label

[2]:
np.random.seed(42)

train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.5, 0.5], size=(1000, 1))], axis=1)
#Create test_data with drift in label:
test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')

[3]:
df_train.head()
[3]:
col1 col2 target
0 0.496714 -0.138264 1.0
1 0.647689 1.523030 1.0
2 -0.234153 -0.234137 1.0
3 1.579213 0.767435 1.0
4 -0.469474 0.542560 0.0

Run check

[4]:
check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures. Read More...

Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.

Generate data - Regression label

[5]:
train_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)
test_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])
#Create drift in test:
df_test['target'] = df_test['target'].astype('float') + abs(np.random.randn(1000)) + np.arange(0, 1, 0.001) * 4

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')

Run check

[6]:
check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures. Read More...

Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.

Add condition

[7]:
check_cond = TrainTestLabelDrift().add_condition_drift_score_not_greater_than()
check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset)

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures. Read More...

Conditions Summary
Status Condition More Info
PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift Label's Earth Mover's Distance above threshold: 0.25
Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.