API Reference - TrainTestLabelDrift

Train Test Label Drift¶

[1]:

import numpy as np
import pandas as pd

from deepchecks import Dataset
from deepchecks.checks import TrainTestLabelDrift
import pprint

Generate data - Classification label¶

[2]:

np.random.seed(42)

train_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.5, 0.5], size=(1000, 1))], axis=1)
#Create test_data with drift in label:
test_data = np.concatenate([np.random.randn(1000,2), np.random.choice(a=[1,0], p=[0.35, 0.65], size=(1000, 1))], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')

[3]:

df_train.head()

[3]:

	col1	col2	target
0	0.496714	-0.138264	1.0
1	0.647689	1.523030	1.0
2	-0.234153	-0.234137	1.0
3	1.579213	0.767435	1.0
4	-0.469474	0.542560	0.0

Run check¶

[4]:

check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures. Read More...

Additional Outputs

The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.

Generate data - Regression label¶

[5]:

train_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)
test_data = np.concatenate([np.random.randn(1000,2), np.random.randn(1000, 1)], axis=1)

df_train = pd.DataFrame(train_data, columns=['col1', 'col2', 'target'])
df_test = pd.DataFrame(test_data, columns=['col1', 'col2', 'target'])
#Create drift in test:
df_test['target'] = df_test['target'].astype('float') + abs(np.random.randn(1000)) + np.arange(0, 1, 0.001) * 4

train_dataset = Dataset(df_train, label='target')
test_dataset = Dataset(df_test, label='target')

Run check¶

[6]:

check = TrainTestLabelDrift()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures. Read More...

Additional Outputs

The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.

Add condition¶

[7]:

check_cond = TrainTestLabelDrift().add_condition_drift_score_not_greater_than()
check_cond.run(train_dataset=train_dataset, test_dataset=test_dataset)

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures. Read More...

Conditions Summary

Status	Condition	More Info
✖	PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift	Label's Earth Mover's Distance above threshold: 0.25

Additional Outputs

The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.

Train Test Feature Drift

Trust Score Comparison