Classifying Malicious URLs¶
This notebook demonstrates how the deepchecks
package can help you validate your basic data science workflow right out of the box!
The scenario is a real business use case: You work as a data scientist at a cyber security startup, and the company wants to provide the clients with a tool to automatically detect phishing attempts performed through emails and warn clients about them. The idea is to scan emails and determine for each web URL they include whether it points to a phishing-related web page or not.
Since phishing attempts are an always-adapting efforts, static black lists or white lists composed of good or bad URLs seen in the past are simply not enough to make a good filtering system for the future. The way the company chose to deal with this challenge is to have you train a Machine Learning model to generalize what a phishing URL looks like from historic data!
To enable you to do this the company’s security team has collected a set of benign (meaning OK, or Kosher) URLs and phishing URLs observed during 2019 (not necessarily in clients emails). They have also wrote a script extracting features they believe should help discern phishing URLs from benign ones.
These features are divided to three sub-sets: * String Characteristics - Extracted from the URL string itself. * Domain Characteristics - Extracted by interacting with the domain provider. * Web Page Characteristics - Extracted from the content of the web page the URL points to.
The string characteristics are based the way URLs are structured, and what their different parts do. Here is an informative illustration. You can read more at Mozilla’s What is a URL article. We’ll see the specific features soon.
[1]:
from IPython.display import Image; from IPython.core.display import HTML;Image(url= "https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL/mdn-url-all.png")
[1]:
(Note: This is a slightly synthetic dataset based on a great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors has released it under an open license per our request, and for that we are very grateful to them.)
Installing requirements
[2]:
import sys
!{sys.executable} -m pip install deepchecks --quiet
🔷 Loading the data¶
OK, let’s take a look at the data!
[3]:
import numpy as np; import pandas as pd; import sklearn; import deepchecks;
pd.set_option('display.max_columns', 45); SEED=832; np.random.seed(SEED);
[4]:
from deepchecks.datasets.classification.phishing import load_data
[5]:
df = load_data(data_format='dataframe', as_train_test=False)
[6]:
df.shape
[6]:
(11350, 25)
[7]:
df.head(5)
[7]:
target | month | scrape_date | ext | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | has_ip | hasHttp | hasHttps | urlIsLive | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2019-01-01 | net | 102 | 8 | 0 | 0 | 0 | -4.384032 | 0 | True | False | False | 4921 | 191 | 32486 | 3 | 5 | 330 | 9419 | 23919 | 0.736286 | 0.289940 | 2.539442 |
1 | 0 | 1 | 2019-01-01 | country | 154 | 60 | 0 | 2 | 0 | -3.566515 | 0 | True | False | False | 0 | 0 | 16199 | 0 | 4 | 39 | 2735 | 794 | 0.049015 | 0.168838 | 0.290311 |
2 | 0 | 1 | 2019-01-01 | net | 171 | 5 | 11 | 0 | 0 | -4.608755 | 0 | True | False | False | 5374 | 104 | 103344 | 18 | 9 | 302 | 27798 | 83817 | 0.811049 | 0.268985 | 2.412174 |
3 | 0 | 1 | 2019-01-01 | com | 94 | 10 | 0 | 0 | 0 | -4.548921 | 0 | True | False | False | 6107 | 466 | 34093 | 11 | 43 | 199 | 9087 | 19427 | 0.569824 | 0.266536 | 2.137889 |
4 | 0 | 1 | 2019-01-01 | other | 95 | 11 | 0 | 0 | 0 | -4.717188 | 0 | True | False | False | 3819 | 928 | 202 | 1 | 0 | 0 | 39 | 0 | 0.000000 | 0.193069 | 0.000000 |
Here is the actual list of features:
[8]:
df.columns
[8]:
Index(['target', 'month', 'scrape_date', 'ext', 'urlLength', 'numDigits',
'numParams', 'num_%20', 'num_@', 'entropy', 'has_ip', 'hasHttp',
'hasHttps', 'urlIsLive', 'dsr', 'dse', 'bodyLength', 'numTitles',
'numImages', 'numLinks', 'specialChars', 'scriptLength', 'sbr', 'bscr',
'sscr'],
dtype='object')
🔹 Feature List¶
And here is a short explanation of each:
Feature Name |
Feature Group |
Description |
---|---|---|
target |
Meta Features |
0 if the URL is benign, 1 if it is related to phishing |
month |
Meta Features |
The month this URL was first encountered, as an int |
scrape_date |
Meta Features |
The exact date this URL was first encountered |
ext |
String Characteristics |
The domain extension |
urlLength |
String Characteristics |
The number of characters in the URL |
numDigits |
String Characteristics |
The number of digits in the URL |
numParams |
String Characteristics |
The number of query parameters in the URL |
num_%20 |
String Characteristics |
The number of ‘%20’ substrings in the URL |
num_@ |
String Characteristics |
The number of @ characters in the URL |
entropy |
String Characteristics |
The entropy of the URL |
has_ip |
String Characteristics |
True if the URL string contains an IP addres |
hasHttp |
Domain Characteristics |
True if the url’s domain supports http |
hasHttps |
Domain Characteristics |
True if the url’s domain supports https |
urlIsLive |
Domain Characteristics |
The URL was live at the time of scraping |
dsr |
Domain Characteristics |
The number of days since domain registration |
dse |
Domain Characteristics |
The number of days since domain registration expired |
bodyLength |
Web Page Characteristics |
The number of charcters in the URL’s web page |
numTitles |
Web Page Characteristics |
The number of HTML titles (H1/H2/…) in the page |
numImages |
Web Page Characteristics |
The number of images in the page |
numLinks |
Web Page Characteristics |
The number of links in the page |
specialChars |
Web Page Characteristics |
The number of special characters in the page |
scriptLength |
Web Page Characteristics |
The number of charcters in scripts embedded in the page |
sbr |
Web Page Characteristics |
The ratio of scriptLength to bodyLength ( |
bscr |
Web Page Characteristics |
The ratio of bodyLength to specialChars ( |
sscr |
Web Page Characteristics |
The ratio of scriptLength to specialChars ( |
🔷 Data Integrity with Deepchecks!¶
The nice thing about the deepchecks
package is that we can already use it out of the box! Instead of running a single check, we use a pre-defined test suite to run a host of data validation checks.
We think it’s valuable to start off with these types of suites as there are various issues we can identify at the get go just by looking at raw data.
We will first import the appropriate factory function from the deepchecks.suites
module - in this case, an integrity suite tailored for a single dataset (as opposed to a division into a train and test, for example) - and use it to create a new suite object:
[9]:
from deepchecks.suites import single_dataset_integrity
integ_suite = single_dataset_integrity()
We will now run that suite on our dataframe:
[10]:
integ_suite.run(test_dataset=df)
Single Dataset Integrity Suite
The suite is composed of various checks such as: Mixed Nulls, String Mismatch, String Length Out Of Bounds, etc...
Each check may contain conditions (which will result in pass / fail / warning, represented by
✓ /
✖ /
!
)
as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified (see the
Create a Custom Suite tutorial).
Conditions Summary
Status | Check | Condition | More Info |
---|---|---|---|
✖ |
Single Value in Column - Test Dataset | Does not contain only a single value | Found columns with a single value: ['has_ip', 'urlIsLive'] |
! |
Data Duplicates - Test Dataset | Duplicate data ratio is not greater than 0% | Found 8.81E-3% duplicate data |
✓ |
Mixed Nulls - Test Dataset | Not more than 1 different null types | |
✓ |
Mixed Data Types - Test Dataset | Rare data types in column are either more than 10.00% or less than 1.00% of the data | |
✓ |
String Mismatch - Test Dataset | No string variants | |
✓ |
String Length Out Of Bounds - Test Dataset | Ratio of outliers not greater than 0% string length outliers | |
✓ |
Special Characters - Test Dataset | Ratio of entirely special character samples not greater than 0.10% |
Check With Conditions Output
Single Value in Column - Test Dataset
Check if there are columns which have only a single unique value in all rows.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✖ |
Does not contain only a single value | Found columns with a single value: ['has_ip', 'urlIsLive'] |
Additional Outputs
has_ip | urlIsLive | |
---|---|---|
Single unique value | 0 | False |
Go to top
Data Duplicates - Test Dataset
Search for duplicate data in dataset.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
Duplicate data ratio is not greater than 0% | Found 8.81E-3% duplicate data |
Additional Outputs
target | month | scrape_date | ext | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | has_ip | hasHttp | hasHttps | urlIsLive | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Instances | Number of Duplicates | |||||||||||||||||||||||||
4696, 4719 | 2 | 0 | 6 | 2019-06-06 | other | 123 | 28 | 4 | 0 | 0 | -4.91 | 0 | True | False | False | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
Go to top
Check Without Conditions Output
No outputs to show.
Other Checks That Weren't Displayed
Check | Reason |
---|---|
Label Ambiguity - Test Dataset | DeepchecksValueError: Check requires dataset to be of type Dataset. instead got: DataFrame |
Mixed Nulls - Test Dataset | Nothing found |
Mixed Data Types - Test Dataset | Nothing found |
String Mismatch - Test Dataset | Nothing found |
String Length Out Of Bounds - Test Dataset | Nothing found |
Special Characters - Test Dataset | Nothing found |
🔴 Understanding the checks’ results!¶
Ok, so we’ve got some interesting results! Even though this is quite a tidy dataset without even any preprocessing, deepchecks
has found a couple of columns (has_ip
and urlIsLive
) containing only a single value and a couple of duplicate values.
We also get a nice list of all checks that turned out ok, and what each check is about.
So nothing dramatic, but we will be sure to drop those useless columns. :)
🔷 Preprocessing¶
Let’s split the data to train and test first. Since we want to examine how well a model can generalize from the past to the future, we’ll simply assign the first months of the dataset to the training set, and the last few months to the test set.
[11]:
raw_train_df = df[df.month <= 9]
len(raw_train_df)
[11]:
8626
[12]:
raw_test_df = df[df.month > 9]
len(raw_test_df)
[12]:
2724
Ok! Let’s process the data real quick and see how some baseline classifiers perform!
We’ll just set the scrape date as our index, drop a few useless columns, one-hot encode our categorical ext
column and scale all numeric data:
[13]:
from deepchecks.datasets.classification.phishing import get_url_preprocessor
pipeline = get_url_preprocessor()
Now we’ll fit on and transform the raw train dataframe:
[14]:
train_df = pipeline.fit_transform(raw_train_df)
train_X = train_df.drop('target', axis=1)
train_y = train_df['target']
train_X.head(3)
[14]:
urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | ||||||||||||||||||||||||||
2019-01-01 | -0.271569 | -0.329581 | -0.327303 | -0.089699 | -0.068846 | 0.314615 | 0.239243 | -0.241671 | 0.280235 | -0.356485 | -0.125958 | -0.255521 | -0.264688 | 1.393957 | -0.059321 | -0.068217 | 0.753133 | 0.753298 | -0.054849 | -0.859105 | -0.434899 | -0.401599 | -0.035733 | 3.553473 | -0.426577 | -0.226517 |
2019-01-01 | 0.917509 | 2.357675 | -0.327303 | 5.663025 | -0.068846 | 2.991389 | 0.239243 | -0.241671 | -1.093947 | -0.629844 | -0.254032 | -0.344488 | -0.290751 | -0.358447 | -0.269256 | -0.282689 | -1.087302 | -0.414405 | -0.174310 | -0.859105 | 2.299385 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
2019-01-01 | 1.306246 | -0.484615 | 6.957823 | -0.089699 | -0.068846 | -0.421190 | 0.239243 | -0.241671 | 0.406734 | -0.480999 | 0.431238 | 0.189313 | -0.160433 | 1.225340 | 0.517939 | 0.487306 | 0.953338 | 0.551243 | -0.061609 | -0.859105 | -0.434899 | -0.401599 | -0.035733 | 3.553473 | -0.426577 | -0.226517 |
And apply the same fitted preprocessing pipeline (with the fitted scaler, for example) to the test dataframe:
[15]:
test_df = pipeline.transform(raw_test_df)
test_X = test_df.drop('target', axis=1)
test_y = test_df['target']
test_X.head(3)
[15]:
urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | ||||||||||||||||||||||||||
2019-10-01 | -0.500238 | -0.691327 | -0.327303 | -0.089699 | -0.068846 | 0.956667 | 0.239243 | -0.241671 | -1.093947 | -0.629844 | -0.381413 | -0.344488 | -0.395006 | -0.593305 | -0.355159 | -0.290053 | -1.218560 | -2.042381 | -0.189730 | -0.859105 | 2.299385 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
2019-10-01 | 0.002834 | 0.238877 | -0.327303 | -0.089699 | -0.068846 | -0.498665 | 0.239243 | -0.241671 | -1.093947 | -0.629844 | 10.879221 | -0.136899 | 1.533700 | 0.153424 | 9.579742 | 8.281871 | 0.509814 | 0.087470 | -0.034532 | 1.164002 | -0.434899 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
2019-10-01 | -0.614572 | 0.342233 | -0.327303 | -0.089699 | -0.068846 | -0.030503 | 0.239243 | -0.241671 | -0.247266 | -0.266319 | -0.200150 | -0.314833 | -0.082243 | -0.448777 | -0.127258 | -0.174697 | 0.020147 | 0.559584 | -0.098683 | 1.164002 | -0.434899 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
[16]:
from sklearn.linear_model import LogisticRegression; from sklearn.metrics import accuracy_score; hyperparameters = {'penalty': 'l2', 'fit_intercept': True, 'random_state': SEED, 'C': 0.009}
[17]:
logreg = LogisticRegression(**hyperparameters)
logreg.fit(train_X, train_y);
[18]:
pred_y = logreg.predict(test_X)
[19]:
accuracy_score(test_y, pred_y)
[19]:
0.9698972099853157
Ok, so we’ve got a nice accuracy score from the get go! Let’s see what deepchecks
can tell us about our model…
[20]:
from deepchecks.suites import train_test_validation
[21]:
vsuite = train_test_validation()
First, we have to wrap the dataframes in deepchecks.Dataset
objects to give the package a bit more context, namely what is the label column, and whether we have a datetime column (we have, as an index, so we’ll set set_datetime_from_dataframe_index=True
), or any categorical features (we have none after one-hot encoding them, so we’ll set cate_features=[]
explicitly).
[22]:
ds_train = deepchecks.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True, cat_features=[])
ds_test = deepchecks.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[])
Now we just have to provide the run
method of the suite object with both the model and the Dataset
objects.
[23]:
vsuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test)
Train Test Validation Suite
The suite is composed of various checks such as: String Mismatch Comparison, Train Test Samples Mix, Date Train Test Leakage Duplicates, etc...
Each check may contain conditions (which will result in pass / fail / warning, represented by
✓ /
✖ /
!
)
as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified (see the
Create a Custom Suite tutorial).
Conditions Summary
Status | Check | Condition | More Info |
---|---|---|---|
✖ |
Date Train-Test Leakage (overlap) | Date leakage ratio is not greater than 0% | Found 100% leaked dates |
✓ |
Train Test Drift | PSI <= 0.2 and Earth Mover's Distance <= 0.1 | |
✓ |
Train Test Label Drift | PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift | |
✓ |
Whole Dataset Drift | Drift value is not greater than 0.25 | |
✓ |
Datasets Size Comparison | Test-Train size ratio is not smaller than 0.01 | |
✓ |
Single Feature Contribution Train-Test | Train-Test features' Predictive Power Score (PPS) difference is not greater than 0.2 | |
✓ |
Single Feature Contribution Train-Test | Train features' Predictive Power Score (PPS) is not greater than 0.7 | |
✓ |
Dominant Frequency Change | Change in ratio of dominant value in data is not greater than 25.00% | |
✓ |
Category Mismatch Train Test | Ratio of samples with a new category is not greater than 0% | |
✓ |
New Label Train Test | Number of new label values is not greater than 0 | |
✓ |
String Mismatch Comparison | No new variants allowed in test data | |
✓ |
Date Train-Test Leakage (duplicates) | Date leakage ratio is not greater than 0% |
Check With Conditions Output
Date Train-Test Leakage (overlap)
Check test data that is dated earlier than latest date in train.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✖ |
Date leakage ratio is not greater than 0% | Found 100% leaked dates |
Additional Outputs
Go to top
Train Test Drift
Calculate drift between train dataset and test dataset per feature, using statistical measures.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
PSI <= 0.2 and Earth Mover's Distance <= 0.1 |
Additional Outputs
The check shows the drift score and distributions for the features, sorted by feature importance and showing only the top 5 features, according to feature importance.
If available, the plot titles also show the feature importance (FI) rank.
Train Test Label Drift
Calculate label drift between train dataset and test dataset, using statistical measures.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift |
Additional Outputs
The check shows the drift score and distributions for the label.
Whole Dataset Drift
Calculate drift between the entire train and test datasets using a model trained to distinguish between them.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Drift value is not greater than 0.25 |
Additional Outputs
The percents of explained dataset difference are the calculated feature importance values for the feature.
Main features contributing to drift
Datasets Size Comparison
Verify test dataset size comparing it to the train dataset size.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Test-Train size ratio is not smaller than 0.01 |
Additional Outputs
Train | Test | |
---|---|---|
Size | 8626 | 2724 |
Go to top
Single Feature Contribution Train-Test
Return the Predictive Power Score of all features, in order to estimate each feature's ability to predict the label.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Train-Test features' Predictive Power Score (PPS) difference is not greater than 0.2 | |
✓ |
Train features' Predictive Power Score (PPS) is not greater than 0.7 |
Additional Outputs
Go to top
Check Without Conditions Output
No outputs to show.
Other Checks That Weren't Displayed
Check | Reason |
---|---|
Train Test Samples Mix | TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp' |
Identifier Leakage - Train Dataset | DeepchecksValueError: Dataset needs to have a date or index column. |
Identifier Leakage - Test Dataset | DeepchecksValueError: Dataset needs to have a date or index column. |
Index Train Test Leakage | DeepchecksValueError: Check requires dataset to have an index column |
Dominant Frequency Change | Nothing found |
Category Mismatch Train Test | Nothing found |
New Label Train Test | Nothing found |
String Mismatch Comparison | Nothing found |
Date Train-Test Leakage (duplicates) | Nothing found |
🔴 Understanding the checks’ results!¶
Whoa! It looks like we have some time leakage!
The Conditions Summary
section showed that the Date Train-Test Leakage (overlap)
check was the only failed check. The Additional Outputs
section helped us understand that the latest date in the train set belongs to January 2020!
It seems some entries from January 2020 made their way into the train set. We assumed the month
columns was enough to split the data with (which it would, have all data was indeed from 2019), but as in real life, things were a bit messy. We’ll adjust our preprocessing real quick, and with methodological errors out of the way we’ll get to checking our model’s performance.
It is also worth mentioning that deepchecks
found that urlLength
is the only feature that alone can predict the target with some measure of success. This is worth investigating!
🔹 Adjusting our preprocessing and refitting the model¶
Let’s just drop any row from 2020 from the raw dataframe and take it all from there
[24]:
df = df[~df['scrape_date'].str.contains('2020')]
df.shape
[24]:
(10896, 25)
[25]:
pipeline = get_url_preprocessor()
[26]:
train_df = pipeline.fit_transform(raw_train_df)
train_X = train_df.drop('target', axis=1)
train_y = train_df['target']
train_X.head(3)
[26]:
urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | ||||||||||||||||||||||||||
2019-01-01 | -0.271569 | -0.329581 | -0.327303 | -0.089699 | -0.068846 | 0.314615 | 0.239243 | -0.241671 | 0.280235 | -0.356485 | -0.125958 | -0.255521 | -0.264688 | 1.393957 | -0.059321 | -0.068217 | 0.753133 | 0.753298 | -0.054849 | -0.859105 | -0.434899 | -0.401599 | -0.035733 | 3.553473 | -0.426577 | -0.226517 |
2019-01-01 | 0.917509 | 2.357675 | -0.327303 | 5.663025 | -0.068846 | 2.991389 | 0.239243 | -0.241671 | -1.093947 | -0.629844 | -0.254032 | -0.344488 | -0.290751 | -0.358447 | -0.269256 | -0.282689 | -1.087302 | -0.414405 | -0.174310 | -0.859105 | 2.299385 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
2019-01-01 | 1.306246 | -0.484615 | 6.957823 | -0.089699 | -0.068846 | -0.421190 | 0.239243 | -0.241671 | 0.406734 | -0.480999 | 0.431238 | 0.189313 | -0.160433 | 1.225340 | 0.517939 | 0.487306 | 0.953338 | 0.551243 | -0.061609 | -0.859105 | -0.434899 | -0.401599 | -0.035733 | 3.553473 | -0.426577 | -0.226517 |
[27]:
test_df = pipeline.transform(raw_test_df)
test_X = test_df.drop('target', axis=1)
test_y = test_df['target']
test_X.head(3)
[27]:
urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | ||||||||||||||||||||||||||
2019-10-01 | -0.500238 | -0.691327 | -0.327303 | -0.089699 | -0.068846 | 0.956667 | 0.239243 | -0.241671 | -1.093947 | -0.629844 | -0.381413 | -0.344488 | -0.395006 | -0.593305 | -0.355159 | -0.290053 | -1.218560 | -2.042381 | -0.189730 | -0.859105 | 2.299385 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
2019-10-01 | 0.002834 | 0.238877 | -0.327303 | -0.089699 | -0.068846 | -0.498665 | 0.239243 | -0.241671 | -1.093947 | -0.629844 | 10.879221 | -0.136899 | 1.533700 | 0.153424 | 9.579742 | 8.281871 | 0.509814 | 0.087470 | -0.034532 | 1.164002 | -0.434899 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
2019-10-01 | -0.614572 | 0.342233 | -0.327303 | -0.089699 | -0.068846 | -0.030503 | 0.239243 | -0.241671 | -0.247266 | -0.266319 | -0.200150 | -0.314833 | -0.082243 | -0.448777 | -0.127258 | -0.174697 | 0.020147 | 0.559584 | -0.098683 | 1.164002 | -0.434899 | -0.401599 | -0.035733 | -0.281415 | -0.426577 | -0.226517 |
[28]:
logreg.fit(train_X, train_y)
[28]:
LogisticRegression(C=0.009, random_state=832)
[29]:
pred_y = logreg.predict(test_X)
[30]:
accuracy_score(test_y, pred_y)
[30]:
0.9698972099853157
🔷 Deepchecks’ Performance Checks¶
Ok! Now that we’re back on track lets run some performance checks to see how we did.
[31]:
from deepchecks.suites import model_evaluation
[32]:
msuite = model_evaluation()
[33]:
ds_train = deepchecks.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True, cat_features=[])
ds_test = deepchecks.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[])
[34]:
msuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test)
Model Evaluation Suite
The suite is composed of various checks such as: Simple Model Comparison, Roc Report, Unused Features, etc...
Each check may contain conditions (which will result in pass / fail / warning, represented by
✓ /
✖ /
!
)
as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified (see the
Create a Custom Suite tutorial).
Conditions Summary
Status | Check | Condition | More Info |
---|---|---|---|
✖ |
Simple Model Comparison | Model performance gain over simple model is not less than 10.00% | Found metrics with gain below threshold: {'F1': {0: '2.34%', 1: '4.65%'}} |
! |
Model Error Analysis | The performance difference of the detected segments must not be greater than 5.00% | Found change in Accuracy in features above threshold: {'urlLength': '31.20%'} |
! |
Trust Score Comparison: Train vs. Test | Mean trust score decline is not greater than 20.00% | Found decline of: -97.21% |
✓ |
Performance Report | Train-Test scores relative degradation is not greater than 0.1 | |
✓ |
ROC Report - Train Dataset | AUC score for all the classes is not less than 0.7 | |
✓ |
ROC Report - Test Dataset | AUC score for all the classes is not less than 0.7 | |
✓ |
Unused Features | Number of high variance unused features is not greater than 5 | |
✓ |
Model Inference Time Check - Train Dataset | Average model inference time for one sample is not greater than 0.001 | |
✓ |
Model Inference Time Check - Test Dataset | Average model inference time for one sample is not greater than 0.001 |
Check With Conditions Output
Simple Model Comparison
Compare given model score to simple model score (according to given model type).
Conditions Summary
Status | Condition | More Info |
---|---|---|
✖ |
Model performance gain over simple model is not less than 10.00% | Found metrics with gain below threshold: {'F1': {0: '2.34%', 1: '4.65%'}} |
Additional Outputs
Model Error Analysis
Find features that best split the data into segments of high and low model error.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
The performance difference of the detected segments must not be greater than 5.00% | Found change in Accuracy in features above threshold: {'urlLength': '31.20%'} |
Additional Outputs
Trust Score Comparison: Train vs. Test
Compares the model's trust scores of the train dataset with scores of the test dataset.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
Mean trust score decline is not greater than 20.00% | Found decline of: -97.21% |
Additional Outputs
Worst Trust Score Samples
Trust Score | Model Prediction | target | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | |||||||||||||||||||||||||||||
2019-10-04 00:00:00 | 0.08 | 0 | 1 | 0.21 | 0.14 | -0.33 | -0.09 | -0.07 | -0.76 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | 1.16 | -0.43 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-12-10 00:00:00 | 0.04 | 0 | 1 | 1.95 | 0.24 | -0.33 | -0.09 | -0.07 | -3.82 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.37 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | 3.55 | -0.43 | -0.23 |
2019-11-05 00:00:00 | 0.03 | 0 | 1 | 1.99 | 0.50 | -0.33 | -0.09 | -0.07 | -3.42 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-12-17 00:00:00 | 0.03 | 0 | 1 | 0.37 | -0.69 | 1.66 | -0.09 | -0.07 | 0.30 | 0.24 | 4.14 | 1.47 | 0.86 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | 3.55 | -0.43 | -0.23 |
2019-11-26 00:00:00 | 0.02 | 0 | 1 | -1.62 | -0.69 | -0.33 | -0.09 | -0.07 | 1.06 | 0.24 | -0.24 | 0.16 | -0.27 | -0.37 | -0.31 | -0.34 | -0.56 | -0.34 | -0.28 | 0.48 | 0.57 | -0.07 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
Top Trust Score Samples
Trust Score | Model Prediction | target | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | |||||||||||||||||||||||||||||
2019-12-03 00:00:00 | 339388615871.85 | 0 | 0 | 0.21 | 0.70 | 2.32 | -0.09 | -0.07 | -1.40 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | -0.28 | 2.34 | -0.23 |
2019-11-29 00:00:00 | 254814779268.18 | 0 | 0 | 0.85 | -0.43 | 2.32 | -0.09 | -0.07 | -1.53 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-20 00:00:00 | 254814779268.18 | 0 | 0 | 0.85 | -0.43 | 2.32 | -0.09 | -0.07 | -1.53 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-20 00:00:00 | 9250.03 | 0 | 0 | -0.50 | -0.64 | -0.33 | -0.09 | -0.07 | 0.57 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.31 | -0.40 | -0.59 | -0.35 | -0.29 | -1.22 | -0.42 | -0.19 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-02 00:00:00 | 3167.88 | 0 | 0 | -0.18 | -0.54 | 4.97 | -0.09 | -0.07 | -0.02 | 0.24 | -0.24 | -0.71 | -0.52 | -0.38 | -0.31 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | -0.28 | -0.43 | 4.41 |
Go to top
Performance Report
Summarize given scores on a dataset and model.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Train-Test scores relative degradation is not greater than 0.1 |
Additional Outputs
ROC Report - Train Dataset
Calculate the ROC curve for each class.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
AUC score for all the classes is not less than 0.7 |
Additional Outputs
Go to top
ROC Report - Test Dataset
Calculate the ROC curve for each class.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
AUC score for all the classes is not less than 0.7 |
Additional Outputs
Go to top
Unused Features
Detect features that are nearly unused by the model.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Number of high variance unused features is not greater than 5 |
Additional Outputs
Model Inference Time Check - Train Dataset
Measure model average inference time (in seconds) per sample.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Average model inference time for one sample is not greater than 0.001 |
Additional Outputs
Go to top
Model Inference Time Check - Test Dataset
Measure model average inference time (in seconds) per sample.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Average model inference time for one sample is not greater than 0.001 |
Additional Outputs
Go to top
Check Without Conditions Output
Confusion Matrix Report - Train Dataset
Calculate the confusion matrix of the model on the given dataset.
Additional Outputs
Confusion Matrix Report - Test Dataset
Calculate the confusion matrix of the model on the given dataset.
Additional Outputs
Calibration Metric - Train Dataset
Calculate the calibration curve with brier score for each class.
Additional Outputs
Go to top
Calibration Metric - Test Dataset
Calculate the calibration curve with brier score for each class.
Additional Outputs
Go to top
Other Checks That Weren't Displayed
Check | Reason |
---|---|
Regression Systematic Error - Train Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Systematic Error - Test Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Error Distribution - Train Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Error Distribution - Test Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Boosting Overfit | DeepchecksValueError: Unsupported model of type: LogisticRegression |
🔴 Understanding the checks’ results!¶
We have three either failed checks or warnings that look very important
Simple Model Comparison
- This checks make sure our model outperforms a very simple model to some degree. Having it fail means we might have a serious problem.Model Error Analysis
- This check analyses model errors and tries to find a way to segment our data in a way that is informative to error analysis. It seems that it found a valuable way to segment our data, error-wise, using theurlLength
feature. We’ll look into it soon enough.Trust Score Comparison
: Found a very significant decline in the trust score between test and train sets. This means that test samples are more likely to disagree with their counterparts (or neighbours) in the train set than we would want or expect, and thus our predictions on them are expected to be erroneous. (see more inthe paper introducing the trust score).
Looking at the metric plots for F1 for both our model and a simple one we see their performance are almost identical! How can this be? Fortunately the confusion matrices automagically generated for both the training and test sets help us understand what has happened.
Our evidently over-regularized classifier was over-impressed by the majority class (0, or non-malicious URL), and predicted a value of 0 for almost all samples in both the train and the test set, which yielded an seemingly-impressive 97% accuracy on the test set just due to the imbalanced nature of the problem.
deepchecks
also generated plots for F1, precision and recall on both the train and test set, as part of the performance report, and these also help us see recall scores are almost zero for both sets and understad what happaned.
Additionally, the best and worst trust score sample tables can help us on which samples the classifier should possibly not be trusted. In this case, the worst true score table is dominated by samplers with target=1
, also pointing us at a problem in generalizing the notion of malicisiousness from the train to the test set.
🔷 Trying out a different classifier¶
So let’s throw something a bit more rich in expressive power at the problem - a decision tree! 🌲
[35]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy', splitter='random', random_state=SEED)
model.fit(train_X, train_y)
msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test)
Model Evaluation Suite
The suite is composed of various checks such as: Simple Model Comparison, Roc Report, Unused Features, etc...
Each check may contain conditions (which will result in pass / fail / warning, represented by
✓ /
✖ /
!
)
as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified (see the
Create a Custom Suite tutorial).
Conditions Summary
Status | Check | Condition | More Info |
---|---|---|---|
✖ |
Performance Report | Train-Test scores relative degradation is not greater than 0.1 | F1 for class 1 (train=1 test=0.82) Precision for class 1 (train=1 test=0.79) Recall for class 1 (train=1 test=0.85) |
! |
Model Error Analysis | The performance difference of the detected segments must not be greater than 5.00% | Found change in Accuracy in features above threshold: {'urlLength': '5.73%'} |
! |
Trust Score Comparison: Train vs. Test | Mean trust score decline is not greater than 20.00% | Found decline of: -97.21% |
! |
Unused Features | Number of high variance unused features is not greater than 5 | Found number of unused high variance features above threshold: ['scriptLength', 'sscr', 'ext_info', 'numImages', 'num_@', 'hasHttp'] |
✓ |
ROC Report - Train Dataset | AUC score for all the classes is not less than 0.7 | |
✓ |
ROC Report - Test Dataset | AUC score for all the classes is not less than 0.7 | |
✓ |
Simple Model Comparison | Model performance gain over simple model is not less than 10.00% | |
✓ |
Model Inference Time Check - Train Dataset | Average model inference time for one sample is not greater than 0.001 | |
✓ |
Model Inference Time Check - Test Dataset | Average model inference time for one sample is not greater than 0.001 |
Check With Conditions Output
Performance Report
Summarize given scores on a dataset and model.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✖ |
Train-Test scores relative degradation is not greater than 0.1 | F1 for class 1 (train=1 test=0.82) Precision for class 1 (train=1 test=0.79) Recall for class 1 (train=1 test=0.85) |
Additional Outputs
Model Error Analysis
Find features that best split the data into segments of high and low model error.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
The performance difference of the detected segments must not be greater than 5.00% | Found change in Accuracy in features above threshold: {'urlLength': '5.73%'} |
Additional Outputs
Trust Score Comparison: Train vs. Test
Compares the model's trust scores of the train dataset with scores of the test dataset.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
Mean trust score decline is not greater than 20.00% | Found decline of: -97.21% |
Additional Outputs
Worst Trust Score Samples
Trust Score | Model Prediction | target | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | |||||||||||||||||||||||||||||
2019-11-01 00:00:00 | 0.07 | 1 | 0 | -0.57 | 0.45 | -0.33 | -0.09 | -0.07 | 0.83 | 0.24 | -0.24 | -0.26 | -0.20 | -0.38 | -0.31 | -0.37 | -0.59 | -0.35 | -0.29 | -1.22 | -0.14 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | 3.55 | -0.43 | -0.23 |
2019-11-14 00:00:00 | 0.07 | 1 | 0 | -0.13 | -0.69 | -0.33 | -0.09 | -0.07 | 0.75 | 0.24 | -0.24 | -1.09 | -0.63 | -0.15 | 1.23 | 0.80 | 0.88 | -0.05 | -0.09 | 0.53 | 0.85 | -0.07 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-10-01 00:00:00 | 0.05 | 1 | 0 | -0.45 | -0.74 | -0.33 | -0.09 | -0.07 | 0.67 | 0.24 | -0.24 | -1.09 | -0.63 | -0.24 | -0.02 | -0.32 | 0.44 | -0.23 | -0.19 | 0.36 | 0.09 | -0.05 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-12-24 00:00:00 | 0.04 | 1 | 0 | -0.09 | -0.69 | -0.33 | -0.09 | -0.07 | 0.80 | 0.24 | -0.24 | -1.09 | -0.63 | -0.30 | -0.05 | 0.02 | 0.12 | -0.29 | -0.26 | 0.05 | 0.69 | -0.10 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-10-30 00:00:00 | 0.04 | 1 | 0 | -0.20 | -0.74 | -0.33 | -0.09 | -0.07 | 0.76 | 0.24 | -0.24 | -1.09 | -0.63 | -0.33 | -0.05 | 0.02 | 0.12 | -0.29 | -0.26 | 0.03 | 0.72 | -0.10 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
Top Trust Score Samples
Trust Score | Model Prediction | target | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | |||||||||||||||||||||||||||||
2019-12-03 00:00:00 | 339388615871.85 | 0 | 0 | 0.21 | 0.70 | 2.32 | -0.09 | -0.07 | -1.40 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | -0.28 | 2.34 | -0.23 |
2019-11-29 00:00:00 | 254814779268.18 | 0 | 0 | 0.85 | -0.43 | 2.32 | -0.09 | -0.07 | -1.53 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-20 00:00:00 | 254814779268.18 | 0 | 0 | 0.85 | -0.43 | 2.32 | -0.09 | -0.07 | -1.53 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-20 00:00:00 | 9250.03 | 0 | 0 | -0.50 | -0.64 | -0.33 | -0.09 | -0.07 | 0.57 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.31 | -0.40 | -0.59 | -0.35 | -0.29 | -1.22 | -0.42 | -0.19 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-02 00:00:00 | 3167.88 | 0 | 0 | -0.18 | -0.54 | 4.97 | -0.09 | -0.07 | -0.02 | 0.24 | -0.24 | -0.71 | -0.52 | -0.38 | -0.31 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | -0.28 | -0.43 | 4.41 |
Go to top
Unused Features
Detect features that are nearly unused by the model.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
Number of high variance unused features is not greater than 5 | Found number of unused high variance features above threshold: ['scriptLength', 'sscr', 'ext_info', 'numImages', 'num_@', 'hasHttp'] |
Additional Outputs
ROC Report - Train Dataset
Calculate the ROC curve for each class.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
AUC score for all the classes is not less than 0.7 |
Additional Outputs
Go to top
ROC Report - Test Dataset
Calculate the ROC curve for each class.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
AUC score for all the classes is not less than 0.7 |
Additional Outputs
Go to top
Simple Model Comparison
Compare given model score to simple model score (according to given model type).
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Model performance gain over simple model is not less than 10.00% |
Additional Outputs
Model Inference Time Check - Train Dataset
Measure model average inference time (in seconds) per sample.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Average model inference time for one sample is not greater than 0.001 |
Additional Outputs
Go to top
Model Inference Time Check - Test Dataset
Measure model average inference time (in seconds) per sample.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Average model inference time for one sample is not greater than 0.001 |
Additional Outputs
Go to top
Check Without Conditions Output
Confusion Matrix Report - Train Dataset
Calculate the confusion matrix of the model on the given dataset.
Additional Outputs
Confusion Matrix Report - Test Dataset
Calculate the confusion matrix of the model on the given dataset.
Additional Outputs
Calibration Metric - Train Dataset
Calculate the calibration curve with brier score for each class.
Additional Outputs
Go to top
Calibration Metric - Test Dataset
Calculate the calibration curve with brier score for each class.
Additional Outputs
Go to top
Other Checks That Weren't Displayed
Check | Reason |
---|---|
Regression Systematic Error - Train Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Systematic Error - Test Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Error Distribution - Train Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Error Distribution - Test Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Boosting Overfit | DeepchecksValueError: Unsupported model of type: DecisionTreeClassifier |
🔴 Understanding the checks’ results!¶
Right of the bat deepchecks
alerts us to a significant degradation in performance - for F1, Precision and Recall - from the train to the test set. This immediately points us at the direction of overfitting.
Indeed, looking at the confusion metrics for the train and test set, we can immediately see that our decision tree fits the train set perfectly, but get significantly lower performance on the train set. A classic indicator of overfit, and decision trees are know for being prone to overfit.
In addition, the Model Error Analysis section help us see we underperform for samples with low urlLength
values. This can help us with more advanced feature generation and selection later in the project.
🔷 Boosting our model!¶
To try and solve the overfitting issue let’s try and throw at a problem an ensemble model that has a bit more resilience to overfitting than a decision tree: a gradient-boosted ensemble of them!
[36]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=250, random_state=SEED, max_depth=20, subsample=0.8 , loss='exponential')
model.fit(train_X, train_y)
msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test)
Model Evaluation Suite
The suite is composed of various checks such as: Simple Model Comparison, Roc Report, Unused Features, etc...
Each check may contain conditions (which will result in pass / fail / warning, represented by
✓ /
✖ /
!
)
as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified (see the
Create a Custom Suite tutorial).
Conditions Summary
Status | Check | Condition | More Info |
---|---|---|---|
✖ |
Performance Report | Train-Test scores relative degradation is not greater than 0.1 | F1 for class 1 (train=1 test=0.87) Precision for class 1 (train=1 test=0.89) Recall for class 1 (train=1 test=0.86) |
! |
Trust Score Comparison: Train vs. Test | Mean trust score decline is not greater than 20.00% | Found decline of: -97.21% |
! |
Unused Features | Number of high variance unused features is not greater than 5 | Found number of unused high variance features above threshold: ['sscr', 'ext_info', 'ext_country', 'ext_html', 'ext_other', 'num_@', 'hasHttps', 'hasHttp', 'numLinks', 'ext_php'] |
✓ |
ROC Report - Train Dataset | AUC score for all the classes is not less than 0.7 | |
✓ |
ROC Report - Test Dataset | AUC score for all the classes is not less than 0.7 | |
✓ |
Simple Model Comparison | Model performance gain over simple model is not less than 10.00% | |
✓ |
Model Error Analysis | The performance difference of the detected segments must not be greater than 5.00% | |
✓ |
Boosting Overfit | Test score over iterations doesn't decline by more than 5.00% from the best score | |
✓ |
Model Inference Time Check - Train Dataset | Average model inference time for one sample is not greater than 0.001 | |
✓ |
Model Inference Time Check - Test Dataset | Average model inference time for one sample is not greater than 0.001 |
Check With Conditions Output
Performance Report
Summarize given scores on a dataset and model.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✖ |
Train-Test scores relative degradation is not greater than 0.1 | F1 for class 1 (train=1 test=0.87) Precision for class 1 (train=1 test=0.89) Recall for class 1 (train=1 test=0.86) |
Additional Outputs
Trust Score Comparison: Train vs. Test
Compares the model's trust scores of the train dataset with scores of the test dataset.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
Mean trust score decline is not greater than 20.00% | Found decline of: -97.21% |
Additional Outputs
Worst Trust Score Samples
Trust Score | Model Prediction | target | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | |||||||||||||||||||||||||||||
2019-10-20 00:00:00 | 0.27 | 1 | 1 | -0.66 | 0.29 | -0.33 | -0.09 | -0.07 | -1.20 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-10-05 00:00:00 | 0.18 | 1 | 0 | 0.05 | -0.74 | -0.33 | -0.09 | -0.07 | 0.52 | 0.24 | -0.24 | -1.09 | -0.63 | -0.20 | 0.22 | -0.26 | -0.12 | -0.19 | -0.12 | 1.13 | 0.18 | 0.01 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-24 00:00:00 | 0.17 | 1 | 1 | -0.96 | -0.74 | -0.33 | -0.09 | -0.07 | 0.18 | 0.24 | -0.24 | -1.09 | -0.63 | -0.36 | -0.26 | -0.16 | -0.42 | -0.34 | -0.28 | -0.16 | 0.50 | -0.11 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-12-25 00:00:00 | 0.16 | 1 | 0 | -0.71 | -0.74 | -0.33 | -0.09 | -0.07 | 0.48 | 0.24 | -0.24 | -1.09 | -0.63 | -0.09 | 0.22 | -0.29 | -0.12 | -0.19 | -0.17 | 1.13 | 0.18 | 0.01 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-28 00:00:00 | 0.02 | 1 | 0 | -0.73 | -0.64 | -0.33 | -0.09 | -0.07 | 1.22 | 0.24 | -0.24 | 0.63 | -0.59 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | 1.16 | -0.43 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
Top Trust Score Samples
Trust Score | Model Prediction | target | urlLength | numDigits | numParams | num_%20 | num_@ | entropy | hasHttp | hasHttps | dsr | dse | bodyLength | numTitles | numImages | numLinks | specialChars | scriptLength | sbr | bscr | sscr | ext_com | ext_country | ext_html | ext_info | ext_net | ext_other | ext_php | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scrape_date | |||||||||||||||||||||||||||||
2019-12-03 00:00:00 | 339388615871.85 | 0 | 0 | 0.21 | 0.70 | 2.32 | -0.09 | -0.07 | -1.40 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | -0.28 | 2.34 | -0.23 |
2019-11-29 00:00:00 | 254814779268.18 | 0 | 0 | 0.85 | -0.43 | 2.32 | -0.09 | -0.07 | -1.53 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-20 00:00:00 | 254814779268.18 | 0 | 0 | 0.85 | -0.43 | 2.32 | -0.09 | -0.07 | -1.53 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.34 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | 2.30 | -0.40 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-20 00:00:00 | 9250.03 | 0 | 0 | -0.50 | -0.64 | -0.33 | -0.09 | -0.07 | 0.57 | 0.24 | -0.24 | -1.09 | -0.63 | -0.38 | -0.31 | -0.40 | -0.59 | -0.35 | -0.29 | -1.22 | -0.42 | -0.19 | -0.86 | -0.43 | 2.49 | -0.04 | -0.28 | -0.43 | -0.23 |
2019-11-02 00:00:00 | 3167.88 | 0 | 0 | -0.18 | -0.54 | 4.97 | -0.09 | -0.07 | -0.02 | 0.24 | -0.24 | -0.71 | -0.52 | -0.38 | -0.31 | -0.40 | -0.59 | -0.36 | -0.29 | -1.22 | -2.04 | -0.19 | -0.86 | -0.43 | -0.40 | -0.04 | -0.28 | -0.43 | 4.41 |
Go to top
Unused Features
Detect features that are nearly unused by the model.
Conditions Summary
Status | Condition | More Info |
---|---|---|
! |
Number of high variance unused features is not greater than 5 | Found number of unused high variance features above threshold: ['sscr', 'ext_info', 'ext_country', 'ext_html', 'ext_other', 'num_@', 'hasHttps', 'hasHttp', 'numLinks', 'ext_php'] |
Additional Outputs
ROC Report - Train Dataset
Calculate the ROC curve for each class.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
AUC score for all the classes is not less than 0.7 |
Additional Outputs
Go to top
ROC Report - Test Dataset
Calculate the ROC curve for each class.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
AUC score for all the classes is not less than 0.7 |
Additional Outputs
Go to top
Simple Model Comparison
Compare given model score to simple model score (according to given model type).
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Model performance gain over simple model is not less than 10.00% |
Additional Outputs
Model Error Analysis
Find features that best split the data into segments of high and low model error.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
The performance difference of the detected segments must not be greater than 5.00% |
Additional Outputs
Boosting Overfit
Check for overfit caused by using too many iterations in a gradient boosted model.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Test score over iterations doesn't decline by more than 5.00% from the best score |
Additional Outputs
Model Inference Time Check - Train Dataset
Measure model average inference time (in seconds) per sample.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Average model inference time for one sample is not greater than 0.001 |
Additional Outputs
Go to top
Model Inference Time Check - Test Dataset
Measure model average inference time (in seconds) per sample.
Conditions Summary
Status | Condition | More Info |
---|---|---|
✓ |
Average model inference time for one sample is not greater than 0.001 |
Additional Outputs
Go to top
Check Without Conditions Output
Confusion Matrix Report - Train Dataset
Calculate the confusion matrix of the model on the given dataset.
Additional Outputs
Confusion Matrix Report - Test Dataset
Calculate the confusion matrix of the model on the given dataset.
Additional Outputs
Calibration Metric - Train Dataset
Calculate the calibration curve with brier score for each class.
Additional Outputs
Go to top
Calibration Metric - Test Dataset
Calculate the calibration curve with brier score for each class.
Additional Outputs
Go to top
Other Checks That Weren't Displayed
Check | Reason |
---|---|
Regression Systematic Error - Train Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Systematic Error - Test Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Error Distribution - Train Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
Regression Error Distribution - Test Dataset | DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary |
🔴 Understanding the checks’ results!¶
Again, deepchecks
supplied some interesting insights, including a considerable performance degradation between the train and test sets. We can see that the degradation in performance between the train and test set that we witnessed before was mitigated only very little.
However, for a boosted model we get a pretty cool Boosting Overfit check that plots the accuracy of the model along increasing boosting iterations of the model. This can help us see that we might have a minor case of overfitting here, as train set accuracy is achieved rather early on, and while test set performance improve for a little while longer, they show some degradation starting from iteration 135.
This at least points to possible value in adjusting the n_estimators
parameter, either reducing it or increasing it to see if degradation continues or perhaps the trends shifts.
🗞 Wrapping it all up!¶
We haven’t got a decent model yet, but deepchecks
provides us with numerous tools to help us navigate our development and make better feature engineering and model selection decisions, by easily making critical issues in data drift, overfitting, leakage, feature importance and model calibration readily accessible.
And this is just what deepchecks
can do out of the box, with the prebuilt checks and suites! There is a lot more potential in the way the package lends itself to easy customization and creation of checks and suites tailored to your needs. We will touch upon some such advanced uses in future guides.
We, however, hope this example can already provide you with a good starting point for getting some immediate benefit out of using deepchecks
! Have fun, and reach out to us if you need assistance! :)