Label Ambiguity¶
This notebooks provides an overview for using and understanding the label ambiguity check.
Structure:
[1]:
from deepchecks.checks.integrity import LabelAmbiguity
from deepchecks.base import Dataset
import pandas as pd
## What is Label Ambiguity?
Label Ambiguity searches for identical samples with different labels. This can occur due to either mislabeled data, or when the data collected is missing features necessary to separate the labels. If the data is mislabled, it can confuse the model and can result in lower performance of the model.
## Load Data
[2]:
from deepchecks.datasets.classification.phishing import load_data
phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars'])
Automatically inferred these columns as categorical features: numParams, num_%20, num_@.
Run the check¶
[3]:
LabelAmbiguity().run(phishing_dataset)
Label Ambiguity
Find samples with multiple labels. Read More...
Additional Outputs
| urlLength | numDigits | numParams | num_%20 | num_@ | bodyLength | numTitles | numImages | numLinks | specialChars | |
|---|---|---|---|---|---|---|---|---|---|---|
| Observed Labels | ||||||||||
| (0, 1) | 81 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| (0, 1) | 82 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| (0, 1) | 85 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| (0, 1) | 85 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| (0, 1) | 88 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We can also check label ambiguity on a subset of the features:
[4]:
LabelAmbiguity(columns=['urlLength', 'numDigits']).run(phishing_dataset)
Label Ambiguity
Find samples with multiple labels. Read More...
Additional Outputs
| urlLength | numDigits | |
|---|---|---|
| Observed Labels | ||
| (0, 1) | 81 | 0 |
| (0, 1) | 81 | 6 |
| (0, 1) | 82 | 2 |
| (0, 1) | 84 | 2 |
| (0, 1) | 85 | 0 |
Define a condition¶
Now, we define a condition that enforces that the ratio of ambiguous samples should be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.
[5]:
check = LabelAmbiguity()
check.add_condition_ambiguous_sample_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)
Label Ambiguity
Find samples with multiple labels. Read More...
Conditions Summary
| Status | Condition | More Info |
|---|---|---|
✖ |
Ambiguous sample ratio is not greater than 0% | Found ratio of samples with multiple labels above threshold: 0.6% |