API Reference - DataDuplicates

Data Duplicates¶

This notebooks provides an overview for using and understanding the data duplicates check.

Structure:

Why data duplicates?
Load data
Run the check
Define a condition

[1]:

from deepchecks.checks.integrity.data_duplicates import DataDuplicates
from deepchecks.base import Dataset, Suite
from datetime import datetime
import pandas as pd

## Why data duplicates?

The DataDuplicates check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset’s nature it has identical-looking samples) this may be valid, however if this is an hidden issue we’re not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention.

## Load data

[2]:

from deepchecks.datasets.classification.phishing import load_data

phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset

[2]:

	target	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	...	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
0	0	1	2019-01-01	net	102	8	0	0	0	-4.384032	...	191	32486	3	5	330	9419	23919	0.736286	0.289940	2.539442
1	0	1	2019-01-01	country	154	60	0	2	0	-3.566515	...	0	16199	0	4	39	2735	794	0.049015	0.168838	0.290311
2	0	1	2019-01-01	net	171	5	11	0	0	-4.608755	...	104	103344	18	9	302	27798	83817	0.811049	0.268985	2.412174
3	0	1	2019-01-01	com	94	10	0	0	0	-4.548921	...	466	34093	11	43	199	9087	19427	0.569824	0.266536	2.137889
4	0	1	2019-01-01	other	95	11	0	0	0	-4.717188	...	928	202	1	0	0	39	0	0.000000	0.193069	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11345	0	1	2020-01-15	country	89	7	0	0	0	-4.254491	...	0	4117	5	0	1	971	1866	0.625302	0.213266	2.932029
11346	0	1	2020-01-15	other	107	13	0	0	0	-4.758879	...	1882	17788	47	58	645	3185	4228	0.291069	0.214348	1.357928
11347	0	1	2020-01-15	com	112	10	0	0	0	-4.723014	...	1011	0	0	0	0	0	0	0.000000	0.000000	0.000000
11348	0	1	2020-01-15	html	111	3	0	0	0	-4.289384	...	265	0	0	0	0	0	0	0.000000	0.000000	0.000000
11349	0	1	2020-01-15	html	97	0	0	0	0	-4.304523	...	298	149	1	0	0	25	0	0.000000	0.167785	0.000000

11350 rows × 25 columns

## Running the check

[3]:

from deepchecks.checks import DataDuplicates
DataDuplicates().run(phishing_dataset)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Additional Outputs

0.0088% of data samples are duplicates.

Each row in the table shows an example of duplicate data and the number of times it appears.

		target	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	has_ip	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
Instances	Number of Duplicates
4696, 4719	2	0	6	2019-06-06	other	123	28	4	0	0	-4.91	0	True	False	False	0	0	0	0	0	0	0	0	0.00	0.00	0.00

With Check Parameters¶

DataDuplicates check can also use a specific subset of columns (or alternatively use all columns except specific ignore_columns to check duplication):

[4]:

DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Additional Outputs

4.11% of data samples are duplicates.

Each row in the table shows an example of duplicate data and the number of times it appears.

		entropy	numParams
Instances	Number of Duplicates
82, 974, 1557, 2150, 2360, 3528, 6560, 7...	13	-4.31	0
1641, 1729, 2213, 2234, 4412, 4638, 6328...	8	-4.57	4
2719, 4634, 6504, 6774, 6783, 7528, 9592...	8	-4.49	8
929, 2499, 4047, 7989, 8391, 9348, 9932,...	8	-4.25	0
1020, 1670, 1802, 2984, 6666, 9138, 1092...	7	-4.65	5

[5]:

DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Additional Outputs

0.22% of data samples are duplicates.

Each row in the table shows an example of duplicate data and the number of times it appears.

		target	month	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	has_ip	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
Instances	Number of Duplicates
4696, 4719, 5398	3	0	6	other	123	28	4	0	0	-4.91	0	True	False	False	0	0	0	0	0	0	0	0	0.00	0.00	0.00
82, 11342	2	0	1	html	92	2	0	0	0	-4.31	0	True	False	False	0	0	149	1	0	0	25	0	0.00	0.17	0.00
250, 790	2	0	1	php	107	4	8	0	0	-4.53	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
6, 217	2	0	1	php	107	5	8	0	0	-4.52	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
609, 763	2	0	1	php	113	6	8	0	0	-4.63	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
974, 1557	2	0	2	html	92	2	0	0	0	-4.31	0	True	False	False	0	0	149	1	0	0	25	0	0.00	0.17	0.00
2150, 2360	2	0	3	html	92	2	0	0	0	-4.31	0	True	False	False	0	0	149	1	0	0	25	0	0.00	0.17	0.00
2238, 2489	2	0	3	php	108	3	8	0	0	-4.51	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
3192, 3444	2	0	4	other	123	28	4	0	0	-4.92	0	True	False	False	0	0	0	0	0	0	0	0	0.00	0.00	0.00
3277, 3498	2	0	4	php	93	31	1	0	0	-4.93	0	True	False	False	0	0	281	0	0	0	74	142	0.51	0.26	1.92

## Define a condition

Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

[6]:

check = DataDuplicates()
check.add_condition_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Conditions Summary

Status	Condition	More Info
!	Duplicate data ratio is not greater than 0%	Found 0.0088% duplicate data

As it can be seen, the condition found that we have data duplicates in our dataset!

New Category

Dominant Frequency Change