Binder badge Colab badge

Label Ambiguity

This notebooks provides an overview for using and understanding the label ambiguity check.

Structure:

[1]:
from deepchecks.checks.integrity import LabelAmbiguity
from deepchecks.base import Dataset
import pandas as pd

## What is Label Ambiguity?

Label Ambiguity searches for identical samples with different labels. This can occur due to either mislabeled data, or when the data collected is missing features necessary to separate the labels. If the data is mislabled, it can confuse the model and can result in lower performance of the model.

## Load Data

[2]:
from deepchecks.datasets.classification.phishing import load_data

phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars'])
Automatically inferred these columns as categorical features: numParams, num_%20, num_@.

Run the check

[3]:
LabelAmbiguity().run(phishing_dataset)

Label Ambiguity

Find samples with multiple labels. Read More...

Additional Outputs
Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 17
urlLength numDigits numParams num_%20 num_@ bodyLength numTitles numImages numLinks specialChars
Observed Labels
(0, 1) 81 6 0 0 0 0 0 0 0 0
(0, 1) 82 2 0 0 0 0 0 0 0 0
(0, 1) 85 0 0 0 0 0 0 0 0 0
(0, 1) 85 20 0 0 0 0 0 0 0 0
(0, 1) 88 0 0 0 0 0 0 0 0 0

We can also check label ambiguity on a subset of the features:

[4]:
LabelAmbiguity(columns=['urlLength', 'numDigits']).run(phishing_dataset)

Label Ambiguity

Find samples with multiple labels. Read More...

Additional Outputs
Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 78
urlLength numDigits
Observed Labels
(0, 1) 81 0
(0, 1) 81 6
(0, 1) 82 2
(0, 1) 84 2
(0, 1) 85 0

Define a condition

Now, we define a condition that enforces that the ratio of ambiguous samples should be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

[5]:
check = LabelAmbiguity()
check.add_condition_ambiguous_sample_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Label Ambiguity

Find samples with multiple labels. Read More...

Conditions Summary
Status Condition More Info
Ambiguous sample ratio is not greater than 0% Found ratio of samples with multiple labels above threshold: 0.6%