Classifying Malicious URLs¶

This notebook demonstrates how the deepchecks package can help you validate your basic data science workflow right out of the box!

The scenario is a real business use case: You work as a data scientist at a cyber security startup, and the company wants to provide the clients with a tool to automatically detect phishing attempts performed through emails and warn clients about them. The idea is to scan emails and determine for each web URL they include whether it points to a phishing-related web page or not.

Since phishing attempts are an always-adapting efforts, static black lists or white lists composed of good or bad URLs seen in the past are simply not enough to make a good filtering system for the future. The way the company chose to deal with this challenge is to have you train a Machine Learning model to generalize what a phishing URL looks like from historic data!

To enable you to do this the company’s security team has collected a set of benign (meaning OK, or Kosher) URLs and phishing URLs observed during 2019 (not necessarily in clients emails). They have also wrote a script extracting features they believe should help discern phishing URLs from benign ones.

These features are divided to three sub-sets: * String Characteristics - Extracted from the URL string itself. * Domain Characteristics - Extracted by interacting with the domain provider. * Web Page Characteristics - Extracted from the content of the web page the URL points to.

The string characteristics are based the way URLs are structured, and what their different parts do. Here is an informative illustration. You can read more at Mozilla’s What is a URL article. We’ll see the specific features soon.

[1]:

from IPython.display import Image; from IPython.core.display import HTML;Image(url= "https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL/mdn-url-all.png")

[1]:

(Note: This is a slightly synthetic dataset based on a great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors has released it under an open license per our request, and for that we are very grateful to them.)

Installing requirements

[2]:

import sys
!{sys.executable} -m pip install deepchecks --quiet

🔷 Loading the data¶

OK, let’s take a look at the data!

[3]:

import numpy as np; import pandas as pd; import sklearn; import deepchecks;
pd.set_option('display.max_columns', 45); SEED=832; np.random.seed(SEED);

[4]:

from deepchecks.datasets.classification.phishing import load_data

[5]:

df = load_data(data_format='dataframe', as_train_test=False)

[6]:

df.shape

[6]:

(11350, 25)

[7]:

df.head(5)

[7]:

	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	entropy	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
0	1	2019-01-01	net	102	8	0	0	-4.384032	True	False	False	4921	191	32486	3	5	330	9419	23919	0.736286	0.289940	2.539442
1	1	2019-01-01	country	154	60	0	2	-3.566515	True	False	False	0	0	16199	0	4	39	2735	794	0.049015	0.168838	0.290311
2	1	2019-01-01	net	171	5	11	0	-4.608755	True	False	False	5374	104	103344	18	9	302	27798	83817	0.811049	0.268985	2.412174
3	1	2019-01-01	com	94	10	0	0	-4.548921	True	False	False	6107	466	34093	11	43	199	9087	19427	0.569824	0.266536	2.137889
4	1	2019-01-01	other	95	11	0	0	-4.717188	True	False	False	3819	928	202	1	0	0	39	0	0.000000	0.193069	0.000000

Here is the actual list of features:

[8]:

df.columns

[8]:

Index(['target', 'month', 'scrape_date', 'ext', 'urlLength', 'numDigits',
       'numParams', 'num_%20', 'num_@', 'entropy', 'has_ip', 'hasHttp',
       'hasHttps', 'urlIsLive', 'dsr', 'dse', 'bodyLength', 'numTitles',
       'numImages', 'numLinks', 'specialChars', 'scriptLength', 'sbr', 'bscr',
       'sscr'],
      dtype='object')

🔹 Feature List¶

And here is a short explanation of each:

Feature Name	Feature Group	Description
target	Meta Features	0 if the URL is benign, 1 if it is related to phishing
month	Meta Features	The month this URL was first encountered, as an int
scrape_date	Meta Features	The exact date this URL was first encountered

ext	String Characteristics	The domain extension
urlLength	String Characteristics	The number of characters in the URL
numDigits	String Characteristics	The number of digits in the URL
numParams	String Characteristics	The number of query parameters in the URL
num_%20	String Characteristics	The number of ‘%20’ substrings in the URL
num_@	String Characteristics	The number of @ characters in the URL
entropy	String Characteristics	The entropy of the URL
has_ip	String Characteristics	True if the URL string contains an IP addres

hasHttp	Domain Characteristics	True if the url’s domain supports http
hasHttps	Domain Characteristics	True if the url’s domain supports https
urlIsLive	Domain Characteristics	The URL was live at the time of scraping
dsr	Domain Characteristics	The number of days since domain registration
dse	Domain Characteristics	The number of days since domain registration expired

bodyLength	Web Page Characteristics	The number of charcters in the URL’s web page
numTitles	Web Page Characteristics	The number of HTML titles (H1/H2/…) in the page
numImages	Web Page Characteristics	The number of images in the page
numLinks	Web Page Characteristics	The number of links in the page
specialChars	Web Page Characteristics	The number of special characters in the page
scriptLength	Web Page Characteristics	The number of charcters in scripts embedded in the page
sbr	Web Page Characteristics	The ratio of scriptLength to bodyLength (`= scriptLength / bodyLength`)
bscr	Web Page Characteristics	The ratio of bodyLength to specialChars (`= specialChars / bodyLength`)
sscr	Web Page Characteristics	The ratio of scriptLength to specialChars (`= scriptLength / specialChars`)

🔷 Data Integrity with Deepchecks!¶

The nice thing about the deepchecks package is that we can already use it out of the box! Instead of running a single check, we use a pre-defined test suite to run a host of data validation checks.

We think it’s valuable to start off with these types of suites as there are various issues we can identify at the get go just by looking at raw data.

We will first import the appropriate factory function from the deepchecks.suites module - in this case, an integrity suite tailored for a single dataset (as opposed to a division into a train and test, for example) - and use it to create a new suite object:

[9]:

from deepchecks.suites import single_dataset_integrity
integ_suite = single_dataset_integrity()

We will now run that suite on our dataframe:

[10]:

integ_suite.run(test_dataset=df)

Single Dataset Integrity Suite

The suite is composed of various checks such as: Mixed Nulls, String Mismatch, String Length Out Of Bounds, etc...
Each check may contain conditions (which will result in pass / fail / warning, represented by ✓ / ✖ / ! ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified (see the Create a Custom Suite tutorial).

Conditions Summary

Status	Check	Condition	More Info
✖	Single Value in Column - Test Dataset	Does not contain only a single value	Found columns with a single value: ['has_ip', 'urlIsLive']
!	Data Duplicates - Test Dataset	Duplicate data ratio is not greater than 0%	Found 8.81E-3% duplicate data
✓	Mixed Nulls - Test Dataset	Not more than 1 different null types
✓	Mixed Data Types - Test Dataset	Rare data types in column are either more than 10.00% or less than 1.00% of the data
✓	String Mismatch - Test Dataset	No string variants
✓	String Length Out Of Bounds - Test Dataset	Ratio of outliers not greater than 0% string length outliers
✓	Special Characters - Test Dataset	Ratio of entirely special character samples not greater than 0.10%

Check With Conditions Output

Single Value in Column - Test Dataset

Check if there are columns which have only a single unique value in all rows.

Conditions Summary

Status	Condition	More Info
✖	Does not contain only a single value	Found columns with a single value: ['has_ip', 'urlIsLive']

Additional Outputs

The following columns have only one unique value

	has_ip	urlIsLive
Single unique value	0	False

		target	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	has_ip	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
Instances	Number of Duplicates
4696, 4719	2	0	6	2019-06-06	other	123	28	4	0	0	-4.91	0	True	False	False	0	0	0	0	0	0	0	0	0.00	0.00	0.00

Check	Reason
Label Ambiguity - Test Dataset	DeepchecksValueError: Check requires dataset to be of type Dataset. instead got: DataFrame
Mixed Nulls - Test Dataset	Nothing found
Mixed Data Types - Test Dataset	Nothing found
String Mismatch - Test Dataset	Nothing found
String Length Out Of Bounds - Test Dataset	Nothing found
Special Characters - Test Dataset	Nothing found

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-01-01	-0.271569	-0.329581	-0.327303	-0.089699	-0.068846	0.314615	0.239243	-0.241671	0.280235	-0.356485	-0.125958	-0.255521	-0.264688	1.393957	-0.059321	-0.068217	0.753133	0.753298	-0.054849	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517
2019-01-01	0.917509	2.357675	-0.327303	5.663025	-0.068846	2.991389	0.239243	-0.241671	-1.093947	-0.629844	-0.254032	-0.344488	-0.290751	-0.358447	-0.269256	-0.282689	-1.087302	-0.414405	-0.174310	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-01-01	1.306246	-0.484615	6.957823	-0.089699	-0.068846	-0.421190	0.239243	-0.241671	0.406734	-0.480999	0.431238	0.189313	-0.160433	1.225340	0.517939	0.487306	0.953338	0.551243	-0.061609	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-01	-0.500238	-0.691327	-0.327303	-0.089699	-0.068846	0.956667	0.239243	-0.241671	-1.093947	-0.629844	-0.381413	-0.344488	-0.395006	-0.593305	-0.355159	-0.290053	-1.218560	-2.042381	-0.189730	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	0.002834	0.238877	-0.327303	-0.089699	-0.068846	-0.498665	0.239243	-0.241671	-1.093947	-0.629844	10.879221	-0.136899	1.533700	0.153424	9.579742	8.281871	0.509814	0.087470	-0.034532	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	-0.614572	0.342233	-0.327303	-0.089699	-0.068846	-0.030503	0.239243	-0.241671	-0.247266	-0.266319	-0.200150	-0.314833	-0.082243	-0.448777	-0.127258	-0.174697	0.020147	0.559584	-0.098683	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517

Status	Check	Condition	More Info
✖	Date Train-Test Leakage (overlap)	Date leakage ratio is not greater than 0%	Found 100% leaked dates
✓	Train Test Drift	PSI <= 0.2 and Earth Mover's Distance <= 0.1
✓	Train Test Label Drift	PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift
✓	Whole Dataset Drift	Drift value is not greater than 0.25
✓	Datasets Size Comparison	Test-Train size ratio is not smaller than 0.01
✓	Single Feature Contribution Train-Test	Train-Test features' Predictive Power Score (PPS) difference is not greater than 0.2
✓	Single Feature Contribution Train-Test	Train features' Predictive Power Score (PPS) is not greater than 0.7
✓	Dominant Frequency Change	Change in ratio of dominant value in data is not greater than 25.00%
✓	Category Mismatch Train Test	Ratio of samples with a new category is not greater than 0%
✓	New Label Train Test	Number of new label values is not greater than 0
✓	String Mismatch Comparison	No new variants allowed in test data
✓	Date Train-Test Leakage (duplicates)	Date leakage ratio is not greater than 0%

Check	Reason
Train Test Samples Mix	TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'
Identifier Leakage - Train Dataset	DeepchecksValueError: Dataset needs to have a date or index column.
Identifier Leakage - Test Dataset	DeepchecksValueError: Dataset needs to have a date or index column.
Index Train Test Leakage	DeepchecksValueError: Check requires dataset to have an index column
Dominant Frequency Change	Nothing found
Category Mismatch Train Test	Nothing found
New Label Train Test	Nothing found
String Mismatch Comparison	Nothing found
Date Train-Test Leakage (duplicates)	Nothing found

Status	Check	Condition	More Info
✖	Simple Model Comparison	Model performance gain over simple model is not less than 10.00%	Found metrics with gain below threshold: {'F1': {0: '2.34%', 1: '4.65%'}}
!	Model Error Analysis	The performance difference of the detected segments must not be greater than 5.00%	Found change in Accuracy in features above threshold: {'urlLength': '31.20%'}
!	Trust Score Comparison: Train vs. Test	Mean trust score decline is not greater than 20.00%	Found decline of: -97.21%
✓	Performance Report	Train-Test scores relative degradation is not greater than 0.1
✓	ROC Report - Train Dataset	AUC score for all the classes is not less than 0.7
✓	ROC Report - Test Dataset	AUC score for all the classes is not less than 0.7
✓	Unused Features	Number of high variance unused features is not greater than 5
✓	Model Inference Time Check - Train Dataset	Average model inference time for one sample is not greater than 0.001
✓	Model Inference Time Check - Test Dataset	Average model inference time for one sample is not greater than 0.001

	Trust Score	Model Prediction	target	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-04 00:00:00	0.08	0	1	0.21	0.14	-0.33	-0.09	-0.07	-0.76	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	1.16	-0.43	-0.40	-0.04	-0.28	-0.43	-0.23
2019-12-10 00:00:00	0.04	0	1	1.95	0.24	-0.33	-0.09	-0.07	-3.82	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.37	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	3.55	-0.43	-0.23
2019-11-05 00:00:00	0.03	0	1	1.99	0.50	-0.33	-0.09	-0.07	-3.42	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	2.30	-0.40	-0.04	-0.28	-0.43	-0.23
2019-12-17 00:00:00	0.03	0	1	0.37	-0.69	1.66	-0.09	-0.07	0.30	0.24	4.14	1.47	0.86	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	3.55	-0.43	-0.23
2019-11-26 00:00:00	0.02	0	1	-1.62	-0.69	-0.33	-0.09	-0.07	1.06	0.24	-0.24	0.16	-0.27	-0.37	-0.31	-0.34	-0.56	-0.34	-0.28	0.48	0.57	-0.07	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23

	Trust Score	Model Prediction	target	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-12-03 00:00:00	339388615871.85	0	0	0.21	0.70	2.32	-0.09	-0.07	-1.40	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	2.34	-0.23
2019-11-29 00:00:00	254814779268.18	0	0	0.85	-0.43	2.32	-0.09	-0.07	-1.53	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	2.30	-0.40	-0.04	-0.28	-0.43	-0.23
2019-11-20 00:00:00	254814779268.18	0	0	0.85	-0.43	2.32	-0.09	-0.07	-1.53	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	2.30	-0.40	-0.04	-0.28	-0.43	-0.23
2019-11-20 00:00:00	9250.03	0	0	-0.50	-0.64	-0.33	-0.09	-0.07	0.57	0.24	-0.24	-1.09	-0.63	-0.38	-0.31	-0.40	-0.59	-0.35	-0.29	-1.22	-0.42	-0.19	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23
2019-11-02 00:00:00	3167.88	0	0	-0.18	-0.54	4.97	-0.09	-0.07	-0.02	0.24	-0.24	-0.71	-0.52	-0.38	-0.31	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	-0.43	4.41

Check	Reason
Regression Systematic Error - Train Dataset	DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary
Regression Systematic Error - Test Dataset	DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary
Regression Error Distribution - Train Dataset	DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary
Regression Error Distribution - Test Dataset	DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary
Boosting Overfit	DeepchecksValueError: Unsupported model of type: LogisticRegression

Status	Check	Condition	More Info
✖	Performance Report	Train-Test scores relative degradation is not greater than 0.1	F1 for class 1 (train=1 test=0.82) Precision for class 1 (train=1 test=0.79) Recall for class 1 (train=1 test=0.85)
!	Model Error Analysis	The performance difference of the detected segments must not be greater than 5.00%	Found change in Accuracy in features above threshold: {'urlLength': '5.73%'}
!	Trust Score Comparison: Train vs. Test	Mean trust score decline is not greater than 20.00%	Found decline of: -97.21%
!	Unused Features	Number of high variance unused features is not greater than 5	Found number of unused high variance features above threshold: ['scriptLength', 'sscr', 'ext_info', 'numImages', 'num_@', 'hasHttp']
✓	ROC Report - Train Dataset	AUC score for all the classes is not less than 0.7
✓	ROC Report - Test Dataset	AUC score for all the classes is not less than 0.7
✓	Simple Model Comparison	Model performance gain over simple model is not less than 10.00%
✓	Model Inference Time Check - Train Dataset	Average model inference time for one sample is not greater than 0.001
✓	Model Inference Time Check - Test Dataset	Average model inference time for one sample is not greater than 0.001

	Trust Score	Model Prediction	target	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-11-01 00:00:00	0.07	1	0	-0.57	0.45	-0.33	-0.09	-0.07	0.83	0.24	-0.24	-0.26	-0.20	-0.38	-0.31	-0.37	-0.59	-0.35	-0.29	-1.22	-0.14	-0.19	-0.86	-0.43	-0.40	-0.04	3.55	-0.43	-0.23
2019-11-14 00:00:00	0.07	1	0	-0.13	-0.69	-0.33	-0.09	-0.07	0.75	0.24	-0.24	-1.09	-0.63	-0.15	1.23	0.80	0.88	-0.05	-0.09	0.53	0.85	-0.07	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23
2019-10-01 00:00:00	0.05	1	0	-0.45	-0.74	-0.33	-0.09	-0.07	0.67	0.24	-0.24	-1.09	-0.63	-0.24	-0.02	-0.32	0.44	-0.23	-0.19	0.36	0.09	-0.05	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23
2019-12-24 00:00:00	0.04	1	0	-0.09	-0.69	-0.33	-0.09	-0.07	0.80	0.24	-0.24	-1.09	-0.63	-0.30	-0.05	0.02	0.12	-0.29	-0.26	0.05	0.69	-0.10	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23
2019-10-30 00:00:00	0.04	1	0	-0.20	-0.74	-0.33	-0.09	-0.07	0.76	0.24	-0.24	-1.09	-0.63	-0.33	-0.05	0.02	0.12	-0.29	-0.26	0.03	0.72	-0.10	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23

Status	Check	Condition	More Info
✖	Performance Report	Train-Test scores relative degradation is not greater than 0.1	F1 for class 1 (train=1 test=0.87) Precision for class 1 (train=1 test=0.89) Recall for class 1 (train=1 test=0.86)
!	Trust Score Comparison: Train vs. Test	Mean trust score decline is not greater than 20.00%	Found decline of: -97.21%
!	Unused Features	Number of high variance unused features is not greater than 5	Found number of unused high variance features above threshold: ['sscr', 'ext_info', 'ext_country', 'ext_html', 'ext_other', 'num_@', 'hasHttps', 'hasHttp', 'numLinks', 'ext_php']
✓	ROC Report - Train Dataset	AUC score for all the classes is not less than 0.7
✓	ROC Report - Test Dataset	AUC score for all the classes is not less than 0.7
✓	Simple Model Comparison	Model performance gain over simple model is not less than 10.00%
✓	Model Error Analysis	The performance difference of the detected segments must not be greater than 5.00%
✓	Boosting Overfit	Test score over iterations doesn't decline by more than 5.00% from the best score
✓	Model Inference Time Check - Train Dataset	Average model inference time for one sample is not greater than 0.001
✓	Model Inference Time Check - Test Dataset	Average model inference time for one sample is not greater than 0.001

	Trust Score	Model Prediction	target	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-20 00:00:00	0.27	1	1	-0.66	0.29	-0.33	-0.09	-0.07	-1.20	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	2.30	-0.40	-0.04	-0.28	-0.43	-0.23
2019-10-05 00:00:00	0.18	1	0	0.05	-0.74	-0.33	-0.09	-0.07	0.52	0.24	-0.24	-1.09	-0.63	-0.20	0.22	-0.26	-0.12	-0.19	-0.12	1.13	0.18	0.01	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23
2019-11-24 00:00:00	0.17	1	1	-0.96	-0.74	-0.33	-0.09	-0.07	0.18	0.24	-0.24	-1.09	-0.63	-0.36	-0.26	-0.16	-0.42	-0.34	-0.28	-0.16	0.50	-0.11	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23
2019-12-25 00:00:00	0.16	1	0	-0.71	-0.74	-0.33	-0.09	-0.07	0.48	0.24	-0.24	-1.09	-0.63	-0.09	0.22	-0.29	-0.12	-0.19	-0.17	1.13	0.18	0.01	-0.86	-0.43	2.49	-0.04	-0.28	-0.43	-0.23
2019-11-28 00:00:00	0.02	1	0	-0.73	-0.64	-0.33	-0.09	-0.07	1.22	0.24	-0.24	0.63	-0.59	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	1.16	-0.43	-0.40	-0.04	-0.28	-0.43	-0.23

Classifying Malicious URLs¶

🔷 Loading the data¶

🔹 Feature List¶

🔷 Data Integrity with Deepchecks!¶

Single Dataset Integrity Suite

Conditions Summary

Check With Conditions Output

Single Value in Column - Test Dataset

Conditions Summary

Additional Outputs

Data Duplicates - Test Dataset

Conditions Summary

Additional Outputs

Check Without Conditions Output

Other Checks That Weren't Displayed

🔴 Understanding the checks’ results!¶

🔷 Preprocessing¶

Train Test Validation Suite

Conditions Summary

Check With Conditions Output

Date Train-Test Leakage (overlap)

Conditions Summary

Additional Outputs

Train Test Drift

Conditions Summary

Additional Outputs

Train Test Label Drift

Conditions Summary

Additional Outputs

Whole Dataset Drift

Conditions Summary

Additional Outputs

Main features contributing to drift

Datasets Size Comparison

Conditions Summary

Additional Outputs

Single Feature Contribution Train-Test

Conditions Summary

Additional Outputs

Check Without Conditions Output

Other Checks That Weren't Displayed

🔴 Understanding the checks’ results!¶

🔹 Adjusting our preprocessing and refitting the model¶

🔷 Deepchecks’ Performance Checks¶

Model Evaluation Suite

Conditions Summary

Check With Conditions Output

Simple Model Comparison

Conditions Summary

Additional Outputs

Model Error Analysis

Conditions Summary

Additional Outputs

Trust Score Comparison: Train vs. Test

Conditions Summary

Additional Outputs

Worst Trust Score Samples

Top Trust Score Samples

Performance Report

Conditions Summary

Additional Outputs

ROC Report - Train Dataset

Conditions Summary

Additional Outputs

ROC Report - Test Dataset

Conditions Summary

Additional Outputs

Unused Features

Conditions Summary

Additional Outputs

Model Inference Time Check - Train Dataset

Conditions Summary

Additional Outputs

Model Inference Time Check - Test Dataset

Conditions Summary

Additional Outputs

Check Without Conditions Output

Confusion Matrix Report - Train Dataset

Additional Outputs

Confusion Matrix Report - Test Dataset