Data Duplicates¶
This notebooks provides an overview for using and understanding the data duplicates check.
Structure:
[1]:
from deepchecks.checks.integrity.data_duplicates import DataDuplicates
from deepchecks.base import Dataset, Suite
from datetime import datetime
import pandas as pd
## Why data duplicates?
The DataDuplicates check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset’s nature it has identical-looking samples) this may be valid, however if this is an hidden issue we’re not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention.
## Load data
[2]:
from deepchecks.datasets.classification.phishing import load_data
phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset
[2]:
| target | month | scrape_date | ext | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | ... | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 2019-01-01 | net | 102 | 8 | 0 | 0 | 0 | -4.384032 | ... | 191 | 32486 | 3 | 5 | 330 | 9419 | 23919 | 0.736286 | 0.289940 | 2.539442 |
| 1 | 0 | 1 | 2019-01-01 | country | 154 | 60 | 0 | 2 | 0 | -3.566515 | ... | 0 | 16199 | 0 | 4 | 39 | 2735 | 794 | 0.049015 | 0.168838 | 0.290311 |
| 2 | 0 | 1 | 2019-01-01 | net | 171 | 5 | 11 | 0 | 0 | -4.608755 | ... | 104 | 103344 | 18 | 9 | 302 | 27798 | 83817 | 0.811049 | 0.268985 | 2.412174 |
| 3 | 0 | 1 | 2019-01-01 | com | 94 | 10 | 0 | 0 | 0 | -4.548921 | ... | 466 | 34093 | 11 | 43 | 199 | 9087 | 19427 | 0.569824 | 0.266536 | 2.137889 |
| 4 | 0 | 1 | 2019-01-01 | other | 95 | 11 | 0 | 0 | 0 | -4.717188 | ... | 928 | 202 | 1 | 0 | 0 | 39 | 0 | 0.000000 | 0.193069 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11345 | 0 | 1 | 2020-01-15 | country | 89 | 7 | 0 | 0 | 0 | -4.254491 | ... | 0 | 4117 | 5 | 0 | 1 | 971 | 1866 | 0.625302 | 0.213266 | 2.932029 |
| 11346 | 0 | 1 | 2020-01-15 | other | 107 | 13 | 0 | 0 | 0 | -4.758879 | ... | 1882 | 17788 | 47 | 58 | 645 | 3185 | 4228 | 0.291069 | 0.214348 | 1.357928 |
| 11347 | 0 | 1 | 2020-01-15 | com | 112 | 10 | 0 | 0 | 0 | -4.723014 | ... | 1011 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 |
| 11348 | 0 | 1 | 2020-01-15 | html | 111 | 3 | 0 | 0 | 0 | -4.289384 | ... | 265 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 |
| 11349 | 0 | 1 | 2020-01-15 | html | 97 | 0 | 0 | 0 | 0 | -4.304523 | ... | 298 | 149 | 1 | 0 | 0 | 25 | 0 | 0.000000 | 0.167785 | 0.000000 |
11350 rows × 25 columns
## Running the check
[3]:
from deepchecks.checks import DataDuplicates
DataDuplicates().run(phishing_dataset)
Data Duplicates
Checks for duplicate samples in the dataset. Read More...
Additional Outputs
| target | month | scrape_date | ext | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | has_ip | hasHttp | hasHttps | urlIsLive | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Instances | Number of Duplicates | |||||||||||||||||||||||||
| 4696, 4719 | 2 | 0 | 6 | 2019-06-06 | other | 123 | 28 | 4 | 0 | 0 | -4.91 | 0 | True | False | False | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
With Check Parameters¶
DataDuplicates check can also use a specific subset of columns (or alternatively use all columns except specific ignore_columns to check duplication):
[4]:
DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)
Data Duplicates
Checks for duplicate samples in the dataset. Read More...
Additional Outputs
| entropy | numParams | ||
|---|---|---|---|
| Instances | Number of Duplicates | ||
| 82, 974, 1557, 2150, 2360, 3528, 6560, 7... | 13 | -4.31 | 0 |
| 1641, 1729, 2213, 2234, 4412, 4638, 6328... | 8 | -4.57 | 4 |
| 2719, 4634, 6504, 6774, 6783, 7528, 9592... | 8 | -4.49 | 8 |
| 929, 2499, 4047, 7989, 8391, 9348, 9932,... | 8 | -4.25 | 0 |
| 1020, 1670, 1802, 2984, 6666, 9138, 1092... | 7 | -4.65 | 5 |
[5]:
DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)
Data Duplicates
Checks for duplicate samples in the dataset. Read More...
Additional Outputs
| target | month | ext | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | has_ip | hasHttp | hasHttps | urlIsLive | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Instances | Number of Duplicates | ||||||||||||||||||||||||
| 4696, 4719, 5398 | 3 | 0 | 6 | other | 123 | 28 | 4 | 0 | 0 | -4.91 | 0 | True | False | False | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
| 82, 11342 | 2 | 0 | 1 | html | 92 | 2 | 0 | 0 | 0 | -4.31 | 0 | True | False | False | 0 | 0 | 149 | 1 | 0 | 0 | 25 | 0 | 0.00 | 0.17 | 0.00 |
| 250, 790 | 2 | 0 | 1 | php | 107 | 4 | 8 | 0 | 0 | -4.53 | 0 | True | False | False | 1381 | 79 | 0 | 1 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
| 6, 217 | 2 | 0 | 1 | php | 107 | 5 | 8 | 0 | 0 | -4.52 | 0 | True | False | False | 1381 | 79 | 0 | 1 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
| 609, 763 | 2 | 0 | 1 | php | 113 | 6 | 8 | 0 | 0 | -4.63 | 0 | True | False | False | 1381 | 79 | 0 | 1 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
| 974, 1557 | 2 | 0 | 2 | html | 92 | 2 | 0 | 0 | 0 | -4.31 | 0 | True | False | False | 0 | 0 | 149 | 1 | 0 | 0 | 25 | 0 | 0.00 | 0.17 | 0.00 |
| 2150, 2360 | 2 | 0 | 3 | html | 92 | 2 | 0 | 0 | 0 | -4.31 | 0 | True | False | False | 0 | 0 | 149 | 1 | 0 | 0 | 25 | 0 | 0.00 | 0.17 | 0.00 |
| 2238, 2489 | 2 | 0 | 3 | php | 108 | 3 | 8 | 0 | 0 | -4.51 | 0 | True | False | False | 1381 | 79 | 0 | 1 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
| 3192, 3444 | 2 | 0 | 4 | other | 123 | 28 | 4 | 0 | 0 | -4.92 | 0 | True | False | False | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
| 3277, 3498 | 2 | 0 | 4 | php | 93 | 31 | 1 | 0 | 0 | -4.93 | 0 | True | False | False | 0 | 0 | 281 | 0 | 0 | 0 | 74 | 142 | 0.51 | 0.26 | 1.92 |
## Define a condition
Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.
[6]:
check = DataDuplicates()
check.add_condition_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)
Data Duplicates
Checks for duplicate samples in the dataset. Read More...
Conditions Summary
| Status | Condition | More Info |
|---|---|---|
! |
Duplicate data ratio is not greater than 0% | Found 0.0088% duplicate data |
As it can be seen, the condition found that we have data duplicates in our dataset!