API Reference - IdentifierLeakage

Identifier Leakage¶

Imports¶

[1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from deepchecks.checks.methodology import *
from deepchecks.base import Dataset

Generating data:¶

[2]:

np.random.seed(42)
df = pd.DataFrame(np.random.randn(100, 3), columns=['x1', 'x2', 'x3'])
df['x4'] = df['x1'] * 0.05 + df['x2']
df['x5'] = df['x2']*121 + 0.01 * df['x1']
df['label'] = df['x5'].apply(lambda x: 0 if x < 0 else 1)

[3]:

dataset = Dataset(df, label='label', index_name='x1', datetime_name='x2')

Running identifier_leakage check:¶

[4]:

IdentifierLeakage().run(dataset)

Identifier Leakage

Check if identifiers (Index/Date) can be used to predict the label. Read More...

Additional Outputs

The PPS represents the ability of a feature to single-handedly predict another feature or label.

For Identifier columns (Index/Date) PPS should be nearly 0, otherwise date and index have some predictive effect on the label.

Using the SingleFeatureContribution check class:¶

[5]:

my_check = IdentifierLeakage(ppscore_params={'sample': 10})
my_check.run(dataset=dataset)

Identifier Leakage

Check if identifiers (Index/Date) can be used to predict the label. Read More...

Additional Outputs

The PPS represents the ability of a feature to single-handedly predict another feature or label.

For Identifier columns (Index/Date) PPS should be nearly 0, otherwise date and index have some predictive effect on the label.

Date Train Validation Leakage Overlap

Index Leakage