Dataset

class Dataset[source]

Dataset wraps pandas DataFrame together with ML related metadata.

The Dataset class is containing additional data and methods intended for easily accessing metadata relevant for the training or validating of an ML models.

Parameters
dfpd.DataFrame

A pandas DataFrame containing data relevant for the training or validating of a ML models.

labelt.Union[Hashable, pd.Series, pd.DataFrame, np.ndarray] , default: None

label column provided either as a string with the name of an existing column in the DataFrame or a label object including the label data (pandas Series/DataFrame or a numpy array) that will be concatenated to the data in the DataFrame. in case of label data the following logic is applied to set the label name: - Series: takes the series name or ‘target’ if name is empty - DataFrame: expect single column in the dataframe and use its name - numpy: use ‘target’

featurest.Optional[t.Sequence[Hashable]] , default: None

List of names for the feature columns in the DataFrame.

cat_featurest.Optional[t.Sequence[Hashable]] , default: None

List of names for the categorical features in the DataFrame. In order to disable categorical. features inference, pass cat_features=[]

index_namet.Optional[Hashable] , default: None

Name of the index column in the dataframe. If set_index_from_dataframe_index is True and index_name is not None, index will be created from the dataframe index level with the given name. If index levels have no names, an int must be used to select the appropriate level by order.

set_index_from_dataframe_indexbool , default: False

If set to true, index will be created from the dataframe index instead of dataframe columns (default). If index_name is None, first level of the index will be used in case of a multilevel index.

datetime_namet.Optional[Hashable] , default: None

Name of the datetime column in the dataframe. If set_datetime_from_dataframe_index is True and datetime_name is not None, date will be created from the dataframe index level with the given name. If index levels have no names, an int must be used to select the appropriate level by order.

set_datetime_from_dataframe_indexbool , default: False

If set to true, date will be created from the dataframe index instead of dataframe columns (default). If datetime_name is None, first level of the index will be used in case of a multilevel index.

convert_datetimebool , default: True

If set to true, date will be converted to datetime using pandas.to_datetime.

datetime_argst.Optional[t.Dict] , default: None

pandas.to_datetime args used for conversion of the datetime column. (look at https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html for more documentation)

max_categorical_ratiofloat , default: 0.01

The max ratio of unique values in a column in order for it to be inferred as a categorical feature.

max_categoriesint , default: 30

The maximum number of categories in a column in order for it to be inferred as a categorical feature.

max_float_categoriesint , default: 5

The maximum number of categories in a float column in order for it to be inferred as a categorical feature.

label_typestr , default: None

Used to assume target model type if not found on model. Values (‘classification_label’, ‘regression_label’) If None then label type is inferred from label using is_categorical logic.

__init__(df: DataFrame, label: Optional[Union[Hashable, Series, DataFrame, ndarray]] = None, features: Optional[Sequence[Hashable]] = None, cat_features: Optional[Sequence[Hashable]] = None, index_name: Optional[Hashable] = None, set_index_from_dataframe_index: bool = False, datetime_name: Optional[Hashable] = None, set_datetime_from_dataframe_index: bool = False, convert_datetime: bool = True, datetime_args: Optional[Dict] = None, max_categorical_ratio: float = 0.01, max_categories: int = 30, max_float_categories: int = 5, label_type: Optional[str] = None)[source]
__new__(*args, **kwargs)

Attributes

Dataset.cat_features

Return list of categorical feature names.

Dataset.classes

Return the classes from label column in sorted list.

Dataset.columns_info

Return the role and logical type of each column.

Dataset.data

Return the data of dataset.

Dataset.datetime_col

Return datetime column if exists.

Dataset.datetime_name

If datetime column exists, return its name.

Dataset.features

Return list of feature names.

Dataset.index_col

Return index column.

Dataset.index_name

If index column exists, return its name.

Dataset.label_name

If label column exists, return its name.

Dataset.label_type

Return the label type.

Dataset.n_samples

Return number of samples in dataframe.

Methods

Dataset.copy(new_data)

Create a copy of this Dataset with new data.

Dataset.datasets_share_categorical_features(...)

Verify that all provided datasets share same categorical features.

Dataset.datasets_share_date(*datasets)

Verify that all provided datasets share same date column.

Dataset.datasets_share_features(*datasets)

Verify that all provided datasets share same features.

Dataset.datasets_share_index(*datasets)

Verify that all provided datasets share same index column.

Dataset.datasets_share_label(*datasets)

Verify that all provided datasets share same label column.

Dataset.datetime_exist()

Return whether datetime defined.

Dataset.ensure_not_empty_dataset(obj)

Verify Dataset or transform to Dataset.

Dataset.from_numpy(*args[, columns, label_name])

Create Dataset instance from numpy arrays.

Dataset.get_datetime_column_from_index(...)

Retrieve the datetime info from the index if _set_datetime_from_dataframe_index is True.

Dataset.index_exist()

Return whether index defined.

Dataset.is_categorical(col_name)

Check if uniques are few enough to count as categorical.

Dataset.sample(n_samples[, replace, ...])

Create a copy of the dataset object, with the internal dataframe being a sample of the original dataframe.

Dataset.select([columns, ignore_columns, ...])

Filter dataset columns by given params.

Dataset.train_test_split([train_size, ...])

Split dataset into random train and test datasets.