WholeDatasetDrift

class WholeDatasetDrift[source]

Calculate drift between the entire train and test datasets using a model trained to distinguish between them.

Check fits a new model to distinguish between train and test datasets, called a Domain Classifier. Once the Domain Classifier is fitted the check calculates the feature importance for the domain classifier model. The result of the check is based on the AUC of the domain classifier model, and the check displays the change in distribution between train and test for the top features according to the calculated feature importance.

Parameters
n_top_columnsint , default: 3

Amount of columns to show ordered by domain classifier feature importance. This limit is used together (AND) with min_feature_importance, so less than n_top_columns features can be displayed.

min_feature_importancefloat , default: 0.05

Minimum feature importance to show in the check display. Feature importance sums to 1, so for example the default value of 0.05 means that all features with importance contributing less than 5% to the predictive power of the Domain Classifier won’t be displayed. This limit is used together (AND) with n_top_columns, so features more important than min_feature_importance can be hidden.

max_num_categoriesint , default: 10

Only for categorical columns. Max number of categories to display in distributio plots. If there are more, they are binned into an “Other” category in the display. If max_num_categories=None, there is no limit.

sample_sizeint , default: 10_000

Max number of rows to use from each dataset for the training and evaluation of the domain classifier.

random_stateint , default: 42

Random seed for the check.

test_sizefloat , default: 0.3

Fraction of the combined datasets to use for the evaluation of the domain classifier.

__init__(n_top_columns: int = 3, min_feature_importance: float = 0.05, max_num_categories: int = 10, sample_size: int = 10000, random_state: int = 42, test_size: float = 0.3)[source]
__new__(*args, **kwargs)

Methods

WholeDatasetDrift.add_condition(name, ...)

Add new condition function to the check.

WholeDatasetDrift.add_condition_overall_drift_value_not_greater_than([...])

Add condition.

WholeDatasetDrift.auc_to_drift_score(auc)

Calculate the drift score, which is 2*auc - 1, with auc being the auc of the Domain Classifier.

WholeDatasetDrift.clean_conditions()

Remove all conditions from this check instance.

WholeDatasetDrift.conditions_decision(result)

Run conditions on given result.

WholeDatasetDrift.name()

Name of class in split camel case.

WholeDatasetDrift.params([show_defaults])

Return parameters to show when printing the check.

WholeDatasetDrift.remove_condition(index)

Remove given condition by index.

WholeDatasetDrift.run(train_dataset, ...[, ...])

Run check.

WholeDatasetDrift.run_logic(context)

Run check.

Example