Binder badge Colab badge

Data Duplicates

This notebooks provides an overview for using and understanding the data duplicates check.

Structure:

[1]:
from deepchecks.checks.integrity.data_duplicates import DataDuplicates
from deepchecks.base import Dataset, Suite
from datetime import datetime
import pandas as pd

## Why data duplicates?

The DataDuplicates check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset’s nature it has identical-looking samples) this may be valid, however if this is an hidden issue we’re not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention.

## Load data

[2]:
from deepchecks.datasets.classification.phishing import load_data

phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset
[2]:
target month scrape_date ext urlLength numDigits numParams num_%20 num_@ entropy ... dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr
0 0 1 2019-01-01 net 102 8 0 0 0 -4.384032 ... 191 32486 3 5 330 9419 23919 0.736286 0.289940 2.539442
1 0 1 2019-01-01 country 154 60 0 2 0 -3.566515 ... 0 16199 0 4 39 2735 794 0.049015 0.168838 0.290311
2 0 1 2019-01-01 net 171 5 11 0 0 -4.608755 ... 104 103344 18 9 302 27798 83817 0.811049 0.268985 2.412174
3 0 1 2019-01-01 com 94 10 0 0 0 -4.548921 ... 466 34093 11 43 199 9087 19427 0.569824 0.266536 2.137889
4 0 1 2019-01-01 other 95 11 0 0 0 -4.717188 ... 928 202 1 0 0 39 0 0.000000 0.193069 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11345 0 1 2020-01-15 country 89 7 0 0 0 -4.254491 ... 0 4117 5 0 1 971 1866 0.625302 0.213266 2.932029
11346 0 1 2020-01-15 other 107 13 0 0 0 -4.758879 ... 1882 17788 47 58 645 3185 4228 0.291069 0.214348 1.357928
11347 0 1 2020-01-15 com 112 10 0 0 0 -4.723014 ... 1011 0 0 0 0 0 0 0.000000 0.000000 0.000000
11348 0 1 2020-01-15 html 111 3 0 0 0 -4.289384 ... 265 0 0 0 0 0 0 0.000000 0.000000 0.000000
11349 0 1 2020-01-15 html 97 0 0 0 0 -4.304523 ... 298 149 1 0 0 25 0 0.000000 0.167785 0.000000

11350 rows × 25 columns

## Running the check

[3]:
from deepchecks.checks import DataDuplicates
DataDuplicates().run(phishing_dataset)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Additional Outputs
0.0088% of data samples are duplicates.
Each row in the table shows an example of duplicate data and the number of times it appears.
    target month scrape_date ext urlLength numDigits numParams num_%20 num_@ entropy has_ip hasHttp hasHttps urlIsLive dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr
Instances Number of Duplicates                                                  
4696, 4719 2 0 6 2019-06-06 other 123 28 4 0 0 -4.91 0 True False False 0 0 0 0 0 0 0 0 0.00 0.00 0.00

With Check Parameters

DataDuplicates check can also use a specific subset of columns (or alternatively use all columns except specific ignore_columns to check duplication):

[4]:
DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Additional Outputs
4.11% of data samples are duplicates.
Each row in the table shows an example of duplicate data and the number of times it appears.
    entropy numParams
Instances Number of Duplicates    
82, 974, 1557, 2150, 2360, 3528, 6560, 7... 13 -4.31 0
1641, 1729, 2213, 2234, 4412, 4638, 6328... 8 -4.57 4
2719, 4634, 6504, 6774, 6783, 7528, 9592... 8 -4.49 8
929, 2499, 4047, 7989, 8391, 9348, 9932,... 8 -4.25 0
1020, 1670, 1802, 2984, 6666, 9138, 1092... 7 -4.65 5
[5]:
DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Additional Outputs
0.22% of data samples are duplicates.
Each row in the table shows an example of duplicate data and the number of times it appears.
    target month ext urlLength numDigits numParams num_%20 num_@ entropy has_ip hasHttp hasHttps urlIsLive dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr
Instances Number of Duplicates                                                
4696, 4719, 5398 3 0 6 other 123 28 4 0 0 -4.91 0 True False False 0 0 0 0 0 0 0 0 0.00 0.00 0.00
82, 11342 2 0 1 html 92 2 0 0 0 -4.31 0 True False False 0 0 149 1 0 0 25 0 0.00 0.17 0.00
250, 790 2 0 1 php 107 4 8 0 0 -4.53 0 True False False 1381 79 0 1 0 0 0 0 0.00 0.00 0.00
6, 217 2 0 1 php 107 5 8 0 0 -4.52 0 True False False 1381 79 0 1 0 0 0 0 0.00 0.00 0.00
609, 763 2 0 1 php 113 6 8 0 0 -4.63 0 True False False 1381 79 0 1 0 0 0 0 0.00 0.00 0.00
974, 1557 2 0 2 html 92 2 0 0 0 -4.31 0 True False False 0 0 149 1 0 0 25 0 0.00 0.17 0.00
2150, 2360 2 0 3 html 92 2 0 0 0 -4.31 0 True False False 0 0 149 1 0 0 25 0 0.00 0.17 0.00
2238, 2489 2 0 3 php 108 3 8 0 0 -4.51 0 True False False 1381 79 0 1 0 0 0 0 0.00 0.00 0.00
3192, 3444 2 0 4 other 123 28 4 0 0 -4.92 0 True False False 0 0 0 0 0 0 0 0 0.00 0.00 0.00
3277, 3498 2 0 4 php 93 31 1 0 0 -4.93 0 True False False 0 0 281 0 0 0 74 142 0.51 0.26 1.92

## Define a condition

Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

[6]:
check = DataDuplicates()
check.add_condition_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Conditions Summary
Status Condition More Info
!
Duplicate data ratio is not greater than 0% Found 0.0088% duplicate data

As it can be seen, the condition found that we have data duplicates in our dataset!