Train Test Label Drift¶
[1]:
import numpy as np
import pandas as pd
from deepchecks import Dataset
from deepchecks.checks import TrainTestLabelDrift
import pprint
Generate data - Classification label¶
[2]:
np.random.seed(42)
train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.5, 0.5], size=(1000, 1))], axis=1)
#Create test_data with drift in label:
test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1)
df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])
train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')
[3]:
df_train.head()
[3]:
| col1 | col2 | target | |
|---|---|---|---|
| 0 | 0.496714 | -0.138264 | 1.0 |
| 1 | 0.647689 | 1.523030 | 1.0 |
| 2 | -0.234153 | -0.234137 | 1.0 |
| 3 | 1.579213 | 0.767435 | 1.0 |
| 4 | -0.469474 | 0.542560 | 0.0 |
Run check¶
[4]:
check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result
Train Test Label Drift
Calculate label drift between train dataset and test dataset, using statistical measures. Read More...
Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test
and train distributions.
The check shows the drift score and distributions for the label.
The check shows the drift score and distributions for the label.
Generate data - Regression label¶
[5]:
train_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)
test_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)
df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])
#Create drift in test:
df_test['target'] = df_test['target'].astype('float') + abs(np.random.randn(1000)) + np.arange(0, 1, 0.001) * 4
train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')
Run check¶
[6]:
check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result
Train Test Label Drift
Calculate label drift between train dataset and test dataset, using statistical measures. Read More...
Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test
and train distributions.
The check shows the drift score and distributions for the label.
The check shows the drift score and distributions for the label.
Add condition¶
[7]:
check_cond = TrainTestLabelDrift().add_condition_drift_score_not_greater_than()
check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset)
Train Test Label Drift
Calculate label drift between train dataset and test dataset, using statistical measures. Read More...
Conditions Summary
| Status | Condition | More Info |
|---|---|---|
✖ |
PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift | Label's Earth Mover's Distance above threshold: 0.25 |
Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test
and train distributions.
The check shows the drift score and distributions for the label.
The check shows the drift score and distributions for the label.