Dataset¶

class Dataset[source]¶

Dataset wraps pandas DataFrame together with ML related metadata.

The Dataset class is containing additional data and methods intended for easily accessing metadata relevant for the training or validating of an ML models.

Parameters

dfpd.DataFrame: A pandas DataFrame containing data relevant for the training or validating of a ML models.
labelt.Union[Hashable, pd.Series, pd.DataFrame, np.ndarray] , default: None: label column provided either as a string with the name of an existing column in the DataFrame or a label object including the label data (pandas Series/DataFrame or a numpy array) that will be concatenated to the data in the DataFrame. in case of label data the following logic is applied to set the label name: - Series: takes the series name or ‘target’ if name is empty - DataFrame: expect single column in the dataframe and use its name - numpy: use ‘target’
featurest.Optional[t.Sequence[Hashable]] , default: None: List of names for the feature columns in the DataFrame.
cat_featurest.Optional[t.Sequence[Hashable]] , default: None: List of names for the categorical features in the DataFrame. In order to disable categorical. features inference, pass cat_features=[]
index_namet.Optional[Hashable] , default: None: Name of the index column in the dataframe. If set_index_from_dataframe_index is True and index_name is not None, index will be created from the dataframe index level with the given name. If index levels have no names, an int must be used to select the appropriate level by order.
set_index_from_dataframe_indexbool , default: False: If set to true, index will be created from the dataframe index instead of dataframe columns (default). If index_name is None, first level of the index will be used in case of a multilevel index.
datetime_namet.Optional[Hashable] , default: None: Name of the datetime column in the dataframe. If set_datetime_from_dataframe_index is True and datetime_name is not None, date will be created from the dataframe index level with the given name. If index levels have no names, an int must be used to select the appropriate level by order.
set_datetime_from_dataframe_indexbool , default: False: If set to true, date will be created from the dataframe index instead of dataframe columns (default). If datetime_name is None, first level of the index will be used in case of a multilevel index.
convert_datetimebool , default: True: If set to true, date will be converted to datetime using pandas.to_datetime.
datetime_argst.Optional[t.Dict] , default: None: pandas.to_datetime args used for conversion of the datetime column. (look at https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html for more documentation)
max_categorical_ratiofloat , default: 0.01: The max ratio of unique values in a column in order for it to be inferred as a categorical feature.
max_categoriesint , default: 30: The maximum number of categories in a column in order for it to be inferred as a categorical feature.
max_float_categoriesint , default: 5: The maximum number of categories in a float column in order for it to be inferred as a categorical feature.
label_typestr , default: None: Used to assume target model type if not found on model. Values (‘classification_label’, ‘regression_label’) If None then label type is inferred from label using is_categorical logic.

__init__(df: DataFrame, label: Optional[Union[Hashable, Series, DataFrame, ndarray]] = None, features: Optional[Sequence[Hashable]] = None, cat_features: Optional[Sequence[Hashable]] = None, index_name: Optional[Hashable] = None, set_index_from_dataframe_index: bool = False, datetime_name: Optional[Hashable] = None, set_datetime_from_dataframe_index: bool = False, convert_datetime: bool = True, datetime_args: Optional[Dict] = None, max_categorical_ratio: float = 0.01, max_categories: int = 30, max_float_categories: int = 5, label_type: Optional[str] = None)[source]¶

__new__(*args, **kwargs)¶

Attributes

`Dataset.cat_features`	Return list of categorical feature names.
`Dataset.classes`	Return the classes from label column in sorted list.
`Dataset.columns_info`	Return the role and logical type of each column.
`Dataset.data`	Return the data of dataset.
`Dataset.datetime_col`	Return datetime column if exists.
`Dataset.datetime_name`	If datetime column exists, return its name.
`Dataset.features`	Return list of feature names.
`Dataset.index_col`	Return index column.
`Dataset.index_name`	If index column exists, return its name.
`Dataset.label_name`	If label column exists, return its name.
`Dataset.label_type`	Return the label type.
`Dataset.n_samples`	Return number of samples in dataframe.

Methods

`Dataset.copy`(new_data)	Create a copy of this Dataset with new data.
`Dataset.datasets_share_categorical_features`(...)	Verify that all provided datasets share same categorical features.
`Dataset.datasets_share_date`(*datasets)	Verify that all provided datasets share same date column.
`Dataset.datasets_share_features`(*datasets)	Verify that all provided datasets share same features.
`Dataset.datasets_share_index`(*datasets)	Verify that all provided datasets share same index column.
`Dataset.datasets_share_label`(*datasets)	Verify that all provided datasets share same label column.
`Dataset.datetime_exist`()	Return whether datetime defined.
`Dataset.ensure_not_empty_dataset`(obj)	Verify Dataset or transform to Dataset.
`Dataset.from_numpy`(*args[, columns, label_name])	Create Dataset instance from numpy arrays.
`Dataset.get_datetime_column_from_index`(...)	Retrieve the datetime info from the index if _set_datetime_from_dataframe_index is True.
`Dataset.index_exist`()	Return whether index defined.
`Dataset.is_categorical`(col_name)	Check if uniques are few enough to count as categorical.
`Dataset.sample`(n_samples[, replace, ...])	Create a copy of the dataset object, with the internal dataframe being a sample of the original dataframe.
`Dataset.select`([columns, ignore_columns, ...])	Filter dataset columns by given params.
`Dataset.train_test_split`([train_size, ...])	Split dataset into random train and test datasets.

dataset

Dataset.cat_features