Train Test Samples Mix¶
Imports¶
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from deepchecks.checks.methodology import TrainTestSamplesMix
from deepchecks.base import Dataset
Generating data:¶
[2]:
iris = load_iris(return_X_y=False, as_frame=True)
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=55)
train_dataset = Dataset(pd.concat([X_train, y_train], axis=1),
features=iris.feature_names,
label='target')
test_df = pd.concat([X_test, y_test], axis=1)
bad_test = test_df.append(train_dataset.data.iloc[[0, 1, 1, 2, 3, 4]], ignore_index=True)
test_dataset = Dataset(bad_test,
features=iris.feature_names,
label='target')
Running data_sample_leakage_report check:¶
[3]:
check = TrainTestSamplesMix()
[4]:
check.run(test_dataset=test_dataset, train_dataset=train_dataset)
Train Test Samples Mix
Detect samples in the test data that appear also in training data. Read More...
Additional Outputs
11.76% (6 / 51) of test data samples appear in train data
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| Train indices: 131 Test indices: 46, 47 | 7.90 | 3.80 | 6.40 | 2.00 | 2 |
| Train indices: 23 Test indices: 49 | 5.10 | 3.30 | 1.70 | 0.50 | 0 |
| Train indices: 101, 142 Test indices: 45 | 5.80 | 2.70 | 5.10 | 1.90 | 2 |
| Train indices: 115 Test indices: 50 | 6.40 | 3.20 | 5.30 | 2.30 | 2 |
| Train indices: 110 Test indices: 48 | 6.50 | 3.20 | 5.10 | 2.00 | 2 |