API Reference - LabelAmbiguity

Label Ambiguity¶

This notebooks provides an overview for using and understanding the label ambiguity check.

Structure:

What is Label Ambiguity?
Load data
Run the check
Define a condition

[1]:

from deepchecks.checks.integrity import LabelAmbiguity
from deepchecks.base import Dataset
import pandas as pd

## What is Label Ambiguity?

Label Ambiguity searches for identical samples with different labels. This can occur due to either mislabeled data, or when the data collected is missing features necessary to separate the labels. If the data is mislabled, it can confuse the model and can result in lower performance of the model.

## Load Data

[2]:

from deepchecks.datasets.classification.phishing import load_data

phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars'])

Automatically inferred these columns as categorical features: numParams, num_%20, num_@.

Run the check¶

[3]:

LabelAmbiguity().run(phishing_dataset)

Label Ambiguity

Find samples with multiple labels. Read More...

Additional Outputs

Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 17

	urlLength	numDigits	numParams	num_%20	num_@	bodyLength	numTitles	numImages	numLinks	specialChars
Observed Labels
(0, 1)	81	6	0	0	0	0	0	0	0	0
(0, 1)	82	2	0	0	0	0	0	0	0	0
(0, 1)	85	0	0	0	0	0	0	0	0	0
(0, 1)	85	20	0	0	0	0	0	0	0	0
(0, 1)	88	0	0	0	0	0	0	0	0	0

We can also check label ambiguity on a subset of the features:

[4]:

LabelAmbiguity(columns=['urlLength', 'numDigits']).run(phishing_dataset)

Label Ambiguity

Find samples with multiple labels. Read More...

Additional Outputs

Each row in the table shows an example of a data sample and the its observed labels as found in the dataset. Showing top 5 of 78

	urlLength	numDigits
Observed Labels
(0, 1)	81	0
(0, 1)	81	6
(0, 1)	82	2
(0, 1)	84	2
(0, 1)	85	0

Define a condition¶

Now, we define a condition that enforces that the ratio of ambiguous samples should be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

[5]:

check = LabelAmbiguity()
check.add_condition_ambiguous_sample_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Label Ambiguity

Find samples with multiple labels. Read More...

Conditions Summary

Status	Condition	More Info
✖	Ambiguous sample ratio is not greater than 0%	Found ratio of samples with multiple labels above threshold: 0.6%

Is Single Value

Mixed Data Types