Single Feature Contribution Train Test¶
This notebook provides an overview for using and understanding the “Single Feature Contribution Train Test” check.
Structure:
What is the purpose of the check?¶
The check estimates for every feature its ability to predict the label by itself. This check can help find:
A potential leakage (between the label and a feature) in both datasets - e.g. due to incorrect sampling during data collection. This is a critical problem, that will likely stay hidden without this check (as it won’t pop up when comparing model performance on train and test).
A strong drift between the the feature-label relation in both datasets, possibly originating from a leakage in one of the datasets - e.g. a leakage that exists in the training data, but not necessarily in a “fresh” dataset, that may have been built differently.
The check is based on calculating the predictive power score (PPS) of each feature. For more details you can read here how the PPS is calculated.
What is a problematic result?¶
- Features with a high predictive score - can indicate that there is a leakage between the label and the feature, meaning that the feature holds information that is somewhat based on the label to begin with.For example: a bank uses their loans database to create a model of whether a customer will be able to return a loan. One of the features they extract is “number of late payments”. It is clear this feature will have a very strong prediction power on the customer’s ability to return his loan, but this feature is based on data the bank knows only after the loan is given, so it won’t be available during the time of the prediction, and is a type of leakage.
- A high difference between the PPS scores of a certain feature in the train and in the test datasets - this is an indication for a drift between the relation of the feature and the label and a possible leakage in one of the datasets.For example: a coffee shop chain trained a model to predict the number of coffee cups ordered in a store, and the model was trained on data from a specific state, and tested on data from all states. Running the Single Feature Contribution check on this split found that there was a high difference in the PPS score of the feature “time_in_day” - it had a much higher predictive power on the training data than on the test data. Investigating this topic led to detection of the problem - the time in day was saved in UTC time for all states, which made the feature much less indicative for the test data as it had data from several time zones (and much more coffee cups are ordered in during the morning/noon than during the evening/night time). This was fixed by changing the feature to be the time relative to the local time zone, thus fixing its predictive power and improving the model’s overall performance.
How is the Predictive Power Score (PPS) calculated?¶
sklearn.LabelEncoder and for the feature using sklearn.OneHotEncoder 5. Partition the data with 4-fold cross-validation 6. Train decision tree 7. Compare the trained model’s performance with naive model’s
performance as follows:Note: all the PPS parameters can be changed by passing to the check the parameter ``ppscore_params``
For further information about PPS you can visit the ppscore github or the following blog post: RIP correlation. Introducing the Predictive Power Score
Generate data¶
We’ll add to a given dataset a direct relation between two features and the label, in order to see the Single Feature Contribution Train Test check in action.
[1]:
from deepchecks.datasets.classification.phishing import load_data
def relate_column_to_label(dataset, column, label_power):
col_data = dataset.data[column]
dataset.data[column] = col_data + (dataset.data[dataset.label_name] * col_data.mean() * label_power)
train_dataset, test_dataset = load_data()
# Transforming 2 features in the dataset given to add correlation to the label
relate_column_to_label(train_dataset, 'numDigits', 10)
relate_column_to_label(train_dataset, 'numLinks', 10)
relate_column_to_label(test_dataset, 'numDigits', 0.1)
Run the check¶
[2]:
from deepchecks.checks.methodology import SingleFeatureContributionTrainTest
result = SingleFeatureContributionTrainTest().run(train_dataset=train_dataset, test_dataset=test_dataset)
result
Single Feature Contribution Train-Test
Return the Predictive Power Score of all features, in order to estimate each feature's ability to predict the label. Read More...
Additional Outputs
Observe the check’s output¶
n_show_top of the check.train - test[3]:
result.value
[3]:
{'train': {'numDigits': 0.9527027027027027,
'numLinks': 0.8851351351351351,
'urlLength': 0.23646714795497537,
'month': 0.0,
'ext': 0.0,
'numParams': 0.0,
'num_%20': 0.0,
'num_@': 0.0,
'entropy': 0.0,
'has_ip': 0.0,
'hasHttp': 0.0,
'hasHttps': 0.0,
'urlIsLive': 0.0,
'dsr': 0.0,
'dse': 0.0,
'bodyLength': 0.0,
'numTitles': 0.0,
'numImages': 0.0,
'specialChars': 0.0,
'scriptLength': 0.0,
'sbr': 0.0,
'bscr': 0.0,
'sscr': 0.0},
'test': {'numDigits': 0.8367346292159752,
'urlLength': 0.2723349191922806,
'month': 0.0,
'ext': 0.0,
'numParams': 0.0,
'num_%20': 0.0,
'num_@': 0.0,
'entropy': 0.0,
'has_ip': 0.0,
'hasHttp': 0.0,
'hasHttps': 0.0,
'urlIsLive': 0.0,
'dsr': 0.0,
'dse': 0.0,
'bodyLength': 0.0,
'numTitles': 0.0,
'numImages': 0.0,
'numLinks': 0.0,
'specialChars': 0.0,
'scriptLength': 0.0,
'sbr': 0.0,
'bscr': 0.0,
'sscr': 0.0},
'train-test difference': {'bodyLength': 0.0,
'bscr': 0.0,
'dse': 0.0,
'dsr': 0.0,
'entropy': 0.0,
'ext': 0.0,
'hasHttp': 0.0,
'hasHttps': 0.0,
'has_ip': 0.0,
'month': 0.0,
'numDigits': 0.11596807348672755,
'numImages': 0.0,
'numLinks': 0.8851351351351351,
'numParams': 0.0,
'numTitles': 0.0,
'num_%20': 0.0,
'num_@': 0.0,
'sbr': 0.0,
'scriptLength': 0.0,
'specialChars': 0.0,
'sscr': 0.0,
'urlIsLive': 0.0,
'urlLength': -0.035867771237305224}}
Define a condition¶
add_condition_feature_pps_difference_not_greater_than - Validate that the difference in the PPS between train and test is not larger than defined amount (default 0.2)add_condition_feature_pps_in_train_not_greater_than - Validate that the PPS scores on train dataset are not exceeding a defined amount (default 0.7)Let’s add the conditions, and re-run the check:
[4]:
check = SingleFeatureContributionTrainTest().add_condition_feature_pps_difference_not_greater_than().add_condition_feature_pps_in_train_not_greater_than()
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset)
result.show(show_additional_outputs=False)
Single Feature Contribution Train-Test
Return the Predictive Power Score of all features, in order to estimate each feature's ability to predict the label. Read More...
Conditions Summary
| Status | Condition | More Info |
|---|---|---|
✖ |
Train-Test features' Predictive Power Score difference is not greater than 0.2 | Features with PPS difference above threshold: {'numLinks': '0.89'} |
✖ |
Train features' Predictive Power Score is not greater than 0.7 | Features in train dataset with PPS above threshold: {'numDigits': '0.95', 'numLinks': '0.89'} |
We see that the conditions have caught the changes we have introduced to the datasets, and alerts us that there is a possible problem with the given features.