API Reference - TrustScoreComparison

Trust Score Comparison¶

This notebooks provides an overview for using and understanding the trust score comparison check.

Structure:

What is trust score?
Loading the data
Run the check
Define a condition

## What is trust score?

Trust score is an alternative measure of model confidence, used in classification problems to assign a higher score to samples whose prediction is more likely to end up correct.

What is model confidence¶

Model confidence commonly refers to the predicted probability of classification model for the predicted class. This quantity is useful for a variety of tasks: 1. Detecting “problematic samples” before labels become available - predictions with low probability are more likely to be wrong. 2. Risk management - in use-cases such as loan approval, we may want to weigh the probability that the loan will be returned with the loaned sum and the expected return. 3. Early warning of concept drift - a significant decline in the average confidence of samples encountered in production or test data indicates that the model is predicting on more and more samples on which it is unsure.

Trust Score compared to predicted probability¶

“Regular” model confidence is easy to compute - just use the model’s “predict_proba” function. The danger with relying on the values produced by the model itself is that they are often un-calibrated - which means that predicted probabilities don’t correspond to the actual percent of correct predictions (check the calibration score check for more info). This is because the methods and loss functions used by these models are often not designed to produce actual probabilities. Additionally, most common classification metrics (such as precision, recall, accuracy etc.) measure only the quality of the final prediction (after threshold is applied to the predicted probability) and not on the probability itself. This reinforces the tendency to ignore the quality of the probabilities themselves.

Trust Score is an alternative method for scoring the “trust-worthiness” of the model predictions that is completely independent of model implementation. The method and code used by the deepchecks package were published in To Trust Or Not To Trust A Classifier.

Trust score has been shown to perform better than predicted probability in identifying correctly classified samples, and is used by the TrustScoreComparison check for: 1. Identifying the samples with highest (and lowest) score - which are the samples most likely (and unlikely) to be correctly classified by the model. This is useful for visually detecting common qualities among the highest and lowest confidence samples. 2. Identifying a degradation between the trust score on the test data when comparing it to the training data, which may indicate that the model will perform worse on test compared to train and serves as a method to detect concept drift. This condition is useful especially for cases when the test labels are not available, such as when performing inference on new and unknown data.

## Loading the data

We’ll load the scikit-learn breast cancer dataset to test out the Trust Score check.

[1]:

import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from deepchecks.datasets.classification.breast_cancer import load_data
from deepchecks import Dataset

label = 'target'

train_df, test_df = load_data(data_format='Dataframe')
train = Dataset(train_df, label=label)
test = Dataset(test_df, label=label)

clf = AdaBoostClassifier()
features = train_df.drop(label, axis=1)
target = train_df[label]
clf = clf.fit(features, target)

## Run the check

Next, we’ll run the check on the dataset and model, modifying the default value of min_test_samples in order to enable us to run this check on the small dataset. In this case, we’ll run the check “as is”, and introduce the condition in the next section.

Additional optional parameters include the maximal sample size, the random state, the number of highest and lowest Trust Score samples to show and various hyperparameters controlling the trust score algorithm.

[2]:

from deepchecks.checks import TrustScoreComparison

TrustScoreComparison(min_test_samples=100).run(train, test, clf)

Trust Score Comparison: Train vs. Test

Compares the model's trust score for the train dataset with scores of the test dataset. Read More...

Additional Outputs

Trust score roughly measures the following quantity:

$$Trust Score = \frac{ \textrm{Distance from the sample to the nearest training samples belonging to a class different than the predicted class}}{\textrm{Distance from the sample to the nearest training samples belonging to the predicted class}}$$

So that higher values represent samples that are "close" to training examples with the same label as sample prediction, and lower values represent samples that are "far" from training samples with labels matching their prediction. For more information, please refer to the original paper at arxiv 1805.11783, or see the version of the paper presented at NeurIPS in 2018.

The test trust score distribution should be quite similar to the train's. If it is skewed to the left, the confidence of the model in the test data is lower than the train, indicating a difference that may affect model performance on similar data. If it is skewed to the right, it indicates an underlying problem with the creation of the test dataset (test confidence isn't expected to be higher than train's).

Worst Trust Score Samples

	Trust Score	target	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
92	0.81	1	14.99	22.11	97.53	693.70	0.09	0.10	0.07	0.04	0.19	0.06	0.32	1.34	2.31	28.51	0.00	0.03	0.03	0.01	0.02	16.76	31.55	110.20	867.10	0.11	0.33	0.31	0.13	0.32	0.09
40	0.80	0	15.13	29.81	96.71	719.50	0.08	0.05	0.05	0.03	0.19	0.05	0.47	1.63	3.04	45.38	0.01	0.01	0.02	0.01	0.03	17.26	36.91	110.10	931.40	0.11	0.10	0.15	0.07	0.32	0.06
108	0.78	0	15.61	19.38	100.00	758.60	0.08	0.06	0.04	0.03	0.15	0.05	0.23	1.00	1.53	22.18	0.00	0.01	0.01	0.01	0.01	17.91	31.67	115.90	988.60	0.11	0.18	0.23	0.09	0.27	0.07
136	0.77	1	14.74	25.42	94.70	668.60	0.08	0.07	0.04	0.03	0.18	0.06	0.30	1.39	2.18	27.41	0.00	0.01	0.02	0.01	0.02	16.51	32.29	107.40	826.40	0.11	0.14	0.16	0.11	0.27	0.07
65	0.69	1	12.04	28.14	76.85	449.90	0.09	0.06	0.02	0.02	0.19	0.06	0.61	2.64	4.10	44.96	0.01	0.02	0.01	0.01	0.02	13.60	33.33	87.24	567.60	0.10	0.10	0.06	0.06	0.24	0.07

Top Trust Score Samples

	Trust Score	Model Prediction	target	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
138	3.28	0	0	23.21	26.97	153.50	1670.00	0.10	0.17	0.20	0.12	0.19	0.06	1.06	0.96	7.25	155.80	0.01	0.03	0.04	0.02	0.02	31.01	34.51	206.00	2944.00	0.15	0.41	0.58	0.26	0.31	0.09
127	3.14	1	1	11.29	13.04	72.23	388.00	0.10	0.08	0.03	0.03	0.18	0.06	0.19	0.53	1.16	13.17	0.01	0.01	0.01	0.01	0.02	12.32	16.18	78.27	457.50	0.14	0.15	0.13	0.09	0.27	0.08
142	2.90	0	0	19.59	18.15	130.70	1214.00	0.11	0.17	0.25	0.13	0.20	0.06	0.74	1.05	4.79	97.07	0.00	0.02	0.04	0.01	0.02	26.73	26.39	174.90	2232.00	0.14	0.38	0.68	0.22	0.36	0.09
86	2.84	1	1	13.50	12.71	85.69	566.20	0.07	0.04	0.00	0.00	0.14	0.05	0.22	0.69	1.51	20.39	0.00	0.00	0.00	0.00	0.01	14.97	16.94	95.48	698.70	0.09	0.06	0.01	0.02	0.23	0.06
43	2.68	1	1	10.32	16.35	65.31	324.90	0.09	0.05	0.01	0.01	0.19	0.06	0.21	0.97	1.36	12.97	0.01	0.01	0.01	0.01	0.02	11.25	21.77	71.12	384.90	0.13	0.09	0.04	0.02	0.27	0.07

Analyzing the output¶

From here we can see that high trust score predictions are mostly correct, while the lowest trust score samples are wrong more often than not and are always predicted to belong to the negative class.

Furthermore, we may notice some other common characteristics, such as the fact that worst texture and mean texture both seem to be lower in the top scoring samples, while the worst scoring samples have high worst texture and mean texture values, both features with high feature importance for the AdaBoost model. Might it be that high texture samples are getting worse predictions by the model?

[3]:

pd.Series(index=train_df.columns[:-1] ,data=clf.feature_importances_, name='Model Feature importance').sort_values(ascending=False).to_frame().head(7)

[3]:

	Model Feature importance
compactness error	0.08
worst texture	0.08
fractal dimension error	0.08
area error	0.08
mean concave points	0.06
worst perimeter	0.06
mean texture	0.06

## Define a condition

Introducing concept drift¶

First, we introduce concept drift into the data by changing the relation between the worst texture and mean concave points features, both important features for the model.

[4]:

mod_test_df = test_df.copy()
np.random.seed(0)
sample_idx = np.random.choice(test_df.index, 80, replace=False)
mod_test_df.loc[sample_idx, 'worst texture'] = mod_test_df.loc[sample_idx, 'target'] * (mod_test_df.loc[sample_idx, 'mean concave points'] > 0.05)
mod_test = Dataset(mod_test_df, label=label)

Checking for decline in Trust Score¶

Now, we define a condition on the Trust Score check to alert us on significant degradation in the mean Trust Score of the test data compared to the training data. Note that the threshold percent of decline can be modified by passing a different threshold to the condition (the default is 0.2, or 20% decline).

[5]:

from deepchecks.checks import TrustScoreComparison

TrustScoreComparison(min_test_samples=100).add_condition_mean_score_percent_decline_not_greater_than(threshold=0.19).run(train, mod_test, clf)

Trust Score Comparison: Train vs. Test

Compares the model's trust score for the train dataset with scores of the test dataset. Read More...

Conditions Summary

Status	Condition	More Info
!	Mean trust score decline is not greater than 19%	Found decline of: -21.09%

Additional Outputs

Trust score roughly measures the following quantity:

Worst Trust Score Samples

	Trust Score	target	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	fractal dimension error	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
183	0.90	0	13.61	24.98	88.05	582.70	0.09	0.09	0.09	0.04	0.16	0.06	0.46	1.29	2.86	43.14	0.01	0.01	0.03	0.01	0.01	0.00	16.99	0.00	108.60	906.50	0.13	0.19	0.32	0.12	0.27	0.07
151	0.83	1	14.26	19.65	97.83	629.90	0.08	0.22	0.30	0.08	0.17	0.08	0.36	1.49	3.40	29.25	0.01	0.07	0.14	0.02	0.03	0.01	15.30	23.73	107.00	709.00	0.09	0.42	0.68	0.15	0.24	0.11
92	0.81	1	14.99	22.11	97.53	693.70	0.09	0.10	0.07	0.04	0.19	0.06	0.32	1.34	2.31	28.51	0.00	0.03	0.03	0.01	0.02	0.00	16.76	31.55	110.20	867.10	0.11	0.33	0.31	0.13	0.32	0.09
136	0.77	1	14.74	25.42	94.70	668.60	0.08	0.07	0.04	0.03	0.18	0.06	0.30	1.39	2.18	27.41	0.00	0.01	0.02	0.01	0.02	0.00	16.51	32.29	107.40	826.40	0.11	0.14	0.16	0.11	0.27	0.07
65	0.69	1	12.04	28.14	76.85	449.90	0.09	0.06	0.02	0.02	0.19	0.06	0.61	2.64	4.10	44.96	0.01	0.02	0.01	0.01	0.02	0.00	13.60	33.33	87.24	567.60	0.10	0.10	0.06	0.06	0.24	0.07

Top Trust Score Samples

	Trust Score	Model Prediction	target	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
138	3.28	0	0	23.21	26.97	153.50	1670.00	0.10	0.17	0.20	0.12	0.19	0.06	1.06	0.96	7.25	155.80	0.01	0.03	0.04	0.02	0.02	31.01	34.51	206.00	2944.00	0.15	0.41	0.58	0.26	0.31	0.09
127	3.14	1	1	11.29	13.04	72.23	388.00	0.10	0.08	0.03	0.03	0.18	0.06	0.19	0.53	1.16	13.17	0.01	0.01	0.01	0.01	0.02	12.32	16.18	78.27	457.50	0.14	0.15	0.13	0.09	0.27	0.08
142	2.90	0	0	19.59	18.15	130.70	1214.00	0.11	0.17	0.25	0.13	0.20	0.06	0.74	1.05	4.79	97.07	0.00	0.02	0.04	0.01	0.02	26.73	26.39	174.90	2232.00	0.14	0.38	0.68	0.22	0.36	0.09
43	2.68	1	1	10.32	16.35	65.31	324.90	0.09	0.05	0.01	0.01	0.19	0.06	0.21	0.97	1.36	12.97	0.01	0.01	0.01	0.01	0.02	11.25	21.77	71.12	384.90	0.13	0.09	0.04	0.02	0.27	0.07
180	2.63	0	0	18.63	25.11	124.80	1088.00	0.11	0.19	0.23	0.12	0.22	0.06	0.83	1.47	5.57	105.00	0.01	0.03	0.05	0.01	0.02	23.15	34.01	160.50	1670.00	0.15	0.43	0.61	0.18	0.34	0.10

Analyzing the output¶

The condition alerts us to the fact that the mean Trust Score has declined by ~21%, which is more than the 10% we allowed!

The decline is also evident in the plot showing the distribution of Trust Scores in each dataset, in which we can see that test data has significantly more samples with Trust Score around 1 compared to training data. We can also see the distribution of the Trust Score for the modified test data used here is visibly skewed to the left (low Trust Score) due to the introduction of concept drift into the test data. The condition helps us detect this new skew. Did this skew in the data really change the performance of the model?

[6]:

from deepchecks.checks.performance import MultiModelPerformanceReport

[7]:

MultiModelPerformanceReport().run([train, train], [test, mod_test], {'unmodified test': clf, 'modified test': clf})

Multi Model Performance Report

Summarize performance scores for multiple models on test datasets. Read More...

Additional Outputs

Using the MultiModelPerformanceReport we can clearly see that several metrics (such as f1, and recall) have declined on the modified test dataset. In a use case in which labels were not available for test data, we would have still known to be wary of that thanks to the condition raised by the Trust Score check on the modified data!

Train Test Label Drift

Whole Dataset Drift