API Reference - SingleFeatureContributionTrainTest

Single Feature Contribution Train Test¶

This notebook provides an overview for using and understanding the “Single Feature Contribution Train Test” check.

Structure:

What is the purpose of the check?
Generate data
Run the check
Define a condition

What is the purpose of the check?¶

The check estimates for every feature its ability to predict the label by itself. This check can help find:

A potential leakage (between the label and a feature) in both datasets - e.g. due to incorrect sampling during data collection. This is a critical problem, that will likely stay hidden without this check (as it won’t pop up when comparing model performance on train and test).
A strong drift between the the feature-label relation in both datasets, possibly originating from a leakage in one of the datasets - e.g. a leakage that exists in the training data, but not necessarily in a “fresh” dataset, that may have been built differently.

The check is based on calculating the predictive power score (PPS) of each feature. For more details you can read here how the PPS is calculated.

What is a problematic result?¶

Features with a high predictive score - can indicate that there is a leakage between the label and the feature, meaning that the feature holds information that is somewhat based on the label to begin with.

For example: a bank uses their loans database to create a model of whether a customer will be able to return a loan. One of the features they extract is “number of late payments”. It is clear this feature will have a very strong prediction power on the customer’s ability to return his loan, but this feature is based on data the bank knows only after the loan is given, so it won’t be available during the time of the prediction, and is a type of leakage.
A high difference between the PPS scores of a certain feature in the train and in the test datasets - this is an indication for a drift between the relation of the feature and the label and a possible leakage in one of the datasets.

For example: a coffee shop chain trained a model to predict the number of coffee cups ordered in a store, and the model was trained on data from a specific state, and tested on data from all states. Running the Single Feature Contribution check on this split found that there was a high difference in the PPS score of the feature “time_in_day” - it had a much higher predictive power on the training data than on the test data. Investigating this topic led to detection of the problem - the time in day was saved in UTC time for all states, which made the feature much less indicative for the test data as it had data from several time zones (and much more coffee cups are ordered in during the morning/noon than during the evening/night time). This was fixed by changing the feature to be the time relative to the local time zone, thus fixing its predictive power and improving the model’s overall performance.

How is the Predictive Power Score (PPS) calculated?¶

The features’ predictive score results in a numeric score between 0 (feature has no predictive power) and 1 (feature can fully predict the label alone).

The process of calculating the PPS is the following: 1. Extract from the data only the label and the feature being tested 2. Drop samples with missing values 3. Keep 5000 (this is configurable parameter) samples from the data 4. Preprocess categorical columns. For the label using sklearn.LabelEncoder and for the feature using sklearn.OneHotEncoder 5. Partition the data with 4-fold cross-validation 6. Train decision tree 7. Compare the trained model’s performance with naive model’s
performance as follows:
Regression: The naive model always predicts the median of the label column, The metric being used is MAE and the PPS calculation is: \(1 - \frac{\text{MAE model}}{\text{MAE naive}}\)
Classification: The naive model always predicts the most common class of the label column, The metric being used is F1 and the PPS calculation is: \(\frac{\text{F1 model} - \text{F1 naive}}{1 - \text{F1 naive}}\)

Note: all the PPS parameters can be changed by passing to the check the parameter ``ppscore_params``

For further information about PPS you can visit the ppscore github or the following blog post: RIP correlation. Introducing the Predictive Power Score

Generate data¶

We’ll add to a given dataset a direct relation between two features and the label, in order to see the Single Feature Contribution Train Test check in action.

[1]:

from deepchecks.datasets.classification.phishing import load_data

def relate_column_to_label(dataset, column, label_power):
    col_data = dataset.data[column]
    dataset.data[column] = col_data + (dataset.data[dataset.label_name] * col_data.mean() * label_power)

train_dataset, test_dataset = load_data()

# Transforming 2 features in the dataset given to add correlation to the label
relate_column_to_label(train_dataset, 'numDigits', 10)
relate_column_to_label(train_dataset, 'numLinks', 10)
relate_column_to_label(test_dataset, 'numDigits', 0.1)

Run the check¶

[2]:

from deepchecks.checks.methodology import SingleFeatureContributionTrainTest

result = SingleFeatureContributionTrainTest().run(train_dataset=train_dataset, test_dataset=test_dataset)
result

Single Feature Contribution Train-Test

Return the Predictive Power Score of all features, in order to estimate each feature's ability to predict the label. Read More...

Additional Outputs

The Predictive Power Score (PPS) is used to estimate the ability of a feature to predict the label by itself. (Read more about Predictive Power Score)

In the graph above, we should suspect we have problems in our data if:

1. Train dataset PPS values are high:

Can indicate that this feature's success in predicting the label is actually due to data leakage,

meaning that the feature holds information that is based on the label to begin with.

2. Large difference between train and test PPS (train PPS is larger):

An even more powerful indication of data leakage, as a feature that was powerful in train but not in test

can be explained by leakage in train that is not relevant to a new dataset.

3. Large difference between test and train PPS (test PPS is larger):

An anomalous value, could indicate drift in test dataset that caused a coincidental correlation to the target label.

Observe the check’s output¶

The check shows the top features with the highest PPS difference in the datasets, and elaborates how to interpret the results.

By default only the top 5 features are displayed, it can be changed by using the parameter n_show_top of the check.

In addition to the graphic output, the check also returns a value which includes all of the information that is needed for defining the conditions for validation.

The value is a dictionary of: - train - for train dataset for each column the numeric PPS score (0 to 1) - test - for test dataset for each column the numeric PPS score (0 to 1) - train-test difference - for each column the difference between the datasets scores, as train - test

[3]:

result.value

[3]:

{'train': {'numDigits': 0.9527027027027027,
  'numLinks': 0.8851351351351351,
  'urlLength': 0.23646714795497537,
  'month': 0.0,
  'ext': 0.0,
  'numParams': 0.0,
  'num_%20': 0.0,
  'num_@': 0.0,
  'entropy': 0.0,
  'has_ip': 0.0,
  'hasHttp': 0.0,
  'hasHttps': 0.0,
  'urlIsLive': 0.0,
  'dsr': 0.0,
  'dse': 0.0,
  'bodyLength': 0.0,
  'numTitles': 0.0,
  'numImages': 0.0,
  'specialChars': 0.0,
  'scriptLength': 0.0,
  'sbr': 0.0,
  'bscr': 0.0,
  'sscr': 0.0},
 'test': {'numDigits': 0.8367346292159752,
  'urlLength': 0.2723349191922806,
  'month': 0.0,
  'ext': 0.0,
  'numParams': 0.0,
  'num_%20': 0.0,
  'num_@': 0.0,
  'entropy': 0.0,
  'has_ip': 0.0,
  'hasHttp': 0.0,
  'hasHttps': 0.0,
  'urlIsLive': 0.0,
  'dsr': 0.0,
  'dse': 0.0,
  'bodyLength': 0.0,
  'numTitles': 0.0,
  'numImages': 0.0,
  'numLinks': 0.0,
  'specialChars': 0.0,
  'scriptLength': 0.0,
  'sbr': 0.0,
  'bscr': 0.0,
  'sscr': 0.0},
 'train-test difference': {'bodyLength': 0.0,
  'bscr': 0.0,
  'dse': 0.0,
  'dsr': 0.0,
  'entropy': 0.0,
  'ext': 0.0,
  'hasHttp': 0.0,
  'hasHttps': 0.0,
  'has_ip': 0.0,
  'month': 0.0,
  'numDigits': 0.11596807348672755,
  'numImages': 0.0,
  'numLinks': 0.8851351351351351,
  'numParams': 0.0,
  'numTitles': 0.0,
  'num_%20': 0.0,
  'num_@': 0.0,
  'sbr': 0.0,
  'scriptLength': 0.0,
  'specialChars': 0.0,
  'sscr': 0.0,
  'urlIsLive': 0.0,
  'urlLength': -0.035867771237305224}}

Define a condition¶

We can define on our check a condition that will validate that our pps scores aren’t too high. The check has 2 possible built-in conditions:
add_condition_feature_pps_difference_not_greater_than - Validate that the difference in the PPS between train and test is not larger than defined amount (default 0.2)
add_condition_feature_pps_in_train_not_greater_than - Validate that the PPS scores on train dataset are not exceeding a defined amount (default 0.7)

Let’s add the conditions, and re-run the check:

[4]:

check = SingleFeatureContributionTrainTest().add_condition_feature_pps_difference_not_greater_than().add_condition_feature_pps_in_train_not_greater_than()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result.show(show_additional_outputs=False)

Single Feature Contribution Train-Test

Return the Predictive Power Score of all features, in order to estimate each feature's ability to predict the label. Read More...

Conditions Summary

Status	Condition	More Info
✖	Train-Test features' Predictive Power Score difference is not greater than 0.2	Features with PPS difference above threshold: {'numLinks': '0.89'}
✖	Train features' Predictive Power Score is not greater than 0.7	Features in train dataset with PPS above threshold: {'numDigits': '0.95', 'numLinks': '0.89'}

We see that the conditions have caught the changes we have introduced to the datasets, and alerts us that there is a possible problem with the given features.

Single Feature Contibution

Train Test Samples Mix