General preprocessing methods#

class TSCC.preprocessing.general.Config[source]#
colname_target_detstr, optional

The column name of the target variable for detection models.

colname_target_corrstr, optional

The column name of the target variable for correction models.

colname_rawstr, optional

The column name of the raw data to be used as input.

colname_idstr, optional

The column name containing unique identifiers for each data point or entity.

colname_isErroneousPredstr, optional

The column name indicating whether a prediction considers an observation erroneous.

exclude_colslist, optional

A list of column names to exclude from processing.

colname_isEventstr, optional

The column name indicating whether an event occurred (used in event-based modeling).

sensortypestr, optional

Specifies the type of sensor used in data collection.

target_sensor_uncertaintyfloat, optional

The uncertainty associated with the target sensor measurement strategy.

frequencystr, optional

The frequency of the time series data.

det_ML_methodslist, optional

A list of machine learning methods for detection.

det_stat_methodslist, optional

A list of statistical methods for detection.

corr_ML_methodslist, optional

A list of machine learning methods for correction.

corr_stat_methodslist, optional

A list of statistical methods for correction.

cross_validationbool, default False

Whether to perform cross-validation during model training.

train_test_IDslist, optional

A list of IDs used to split the dataset into training and testing sets.

threshold_nrObsint, default 10000

The threshold number of observations required to perform analysis.

detMethod_for_corrstr, optional

The detection method used for data correction.

random_stateint, default 1

The seed for random number generation to ensure reproducibility.

dataset_sizestr, default “small”

The size of the dataset, used to optimize computation resources.

det_ML_model_savefolderstr, optional

The folder path to save detection machine learning models.

corr_ML_model_savefolderstr, optional

The folder path to save correction machine learning models.

Methods

update_frequency

__init__(colname_target_det=None, colname_target_corr=None, colname_raw=None, colname_id=None, colname_isErroneousPred=None, exclude_cols=None, colname_isEvent=None, sensortype=None, target_sensor_uncertainty=None, frequency=None, cross_validation=False, train_test_IDs=None, threshold_nrObs=10000, detMethod_for_corr=None, random_state=1, dataset_size='small', det_ML_model_savefolder=None, corr_ML_model_savefolder=None)[source]#
class TSCC.preprocessing.general.DataSetHandler[source]#

DataSetHandler class for handling data split into train and test as well as resampling operations.

Parameters:
- data (pd.DataFrame): The input DataFrame.
- test_size (float): The proportion of the dataset to include in the test split.
- random_state (int): Random seed for reproducibility.
Attributes:
df_train_list_fealist of pandas DataFrame

List of training feature DataFrames.

df_train_list_tarlist of pandas DataFrame

List of training target DataFrames.

df_train_list_excllist of pandas DataFrame

List of excluded columns in training DataFrames.

df_test_list_fealist of pandas DataFrame

List of testing feature DataFrames.

df_test_list_tarlist of pandas DataFrame

List of testing target DataFrames.

df_test_list_excllist of pandas DataFrame

List of excluded columns in testing DataFrames.

list_lenint

Length of the training feature DataFrame list.

Methods

append_col_to_df_results(series)

Append a new column to the DataFrame.

append_col_to_df_train_test(series)

Returns the list of training DataFrames for features with an additional column.

get_complete_dataset([exclude_columns])

Return the original data, optionally excluding specific columns.

get_test_features([exclude_columns])

Returns the list of testing DataFrames for features.

get_test_targets([extract_single_target])

Returns the list of testing DataFrames for targets.

get_train_features([exclude_columns])

Returns the list of training DataFrames for features.

get_train_targets([extract_single_target])

Returns the list of training DataFrames for targets.

getCharacteristics

preprocess_training_ds

__init__(df, config)[source]#
append_col_to_df_results(series)[source]#

Append a new column to the DataFrame. The new column should have a name.

append_col_to_df_train_test(series)[source]#

Returns the list of training DataFrames for features with an additional column.

Parameters:
seriespandas.Series

The series to be appended as a new column to the DataFrame.

Returns:
list of pandas DataFrame

A list of DataFrames for training with the additional column appended.

get_complete_dataset(exclude_columns=None)[source]#

Return the original data, optionally excluding specific columns.

Parameters : exclude_columns (list) : List of columns to exclude from the original data.

get_test_features(exclude_columns=None)[source]#

Returns the list of testing DataFrames for features.

Returns:
list of pandas DataFrame
get_test_targets(extract_single_target=None)[source]#

Returns the list of testing DataFrames for targets.

Returns:
list of pandas DataFrame
get_train_features(exclude_columns=None)[source]#

Returns the list of training DataFrames for features.

Returns:
list of pandas DataFrame
get_train_targets(extract_single_target=None)[source]#

Returns the list of training DataFrames for targets.

Returns:
list of pandas DataFrame
TSCC.preprocessing.general.SMOGN(df_fea, df_tar, colname, colname_idx, rel_thres=0.8)[source]#

Apply the SMOGN (Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise) to handle imbalanced regression datasets by oversampling both rare and extreme values.

Parameters:
df_feapandas DataFrame

The input feature dataframe containing the independent variables.

df_tarpandas DataFrame

The target dataframe containing the dependent variable(s).

colnamestr

The column name of the target variable that requires oversampling.

colname_idxstr

The column name that serves as the unique identifier for each instance in the data.

rel_thresfloat, optional

The relevance threshold for identifying rare and extreme values, default is 0.8.

Returns:
smogn_feapandas DataFrame

The oversampled feature dataframe with rare and extreme values oversampled, indexed by colname_idx.

smogn_tarpandas DataFrame

The oversampled target dataframe with rare and extreme values oversampled, indexed by colname_idx.

TSCC.preprocessing.general.SMOTE(df_fea, df_tar, colname_target, colname_id, exclude_columns, random_state=42)[source]#

Apply the SMOTE (Synthetic Minority Over-sampling Technique) to handle imbalanced datasets.

Parameters:
df_feapandas DataFrame

The input feature dataframe containing independent variables.

df_tarpandas DataFrame

The target dataframe containing the dependent variable(s).

colname_targetstr

The column name of the target variable that requires oversampling.

colname_idstr

The column name used as the unique identifier for each instance in the data.

exclude_columnslist

A list of column names to be excluded from the SMOTE process.

random_stateint, optional

The seed used by SMOTE for random number generation, default is 42.

Returns:
smote_feapandas DataFrame

The oversampled feature dataframe with the same structure as the original df_fea, indexed by colname_id.

smote_tarpandas DataFrame

The oversampled target dataframe with the same structure as the original df_tar, indexed by colname_id.

TSCC.preprocessing.general.SMOTEwithCat(df_fea, df_tar, colname, random_state=42)[source]#

Apply the SMOTE-NC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) to handle imbalanced datasets that include both categorical and continuous features.

Parameters:
df_feapandas DataFrame

The input feature dataframe containing both categorical and continuous independent variables.

df_tarpandas DataFrame

The target dataframe containing the dependent variable(s).

colnamestr

The column name of the target variable that requires oversampling.

random_stateint, optional

The seed used by SMOTE-NC for random number generation, default is 42.

Returns:
smote_feapandas DataFrame

The oversampled feature dataframe with categorical and continuous features.

smote_tarpandas DataFrame

The oversampled target dataframe.

TSCC.preprocessing.general.check_equidistant_minute_timestamps(df)[source]#

Check if the datetime index of a DataFrame has equidistant time stamps each minute.

Parameters: df (pd.DataFrame): DataFrame with a datetime index.

Returns: bool: True if time stamps are equidistant each minute, False otherwise.

TSCC.preprocessing.general.check_equidistant_problems(df, frequency='T')[source]#

Check problems in a DataFrame regarding equidistant datetime index values.

Parameters: df (pd.DataFrame): DataFrame with a datetime index. frequency (str): Frequency string (e.g., ‘T’ for minute, ‘H’ for hour).

Returns: dict: Dictionary with the problems found in the dataset.

TSCC.preprocessing.general.train_test_fea_tar_split(df_model, config, tar_columns)[source]#

Generate list of train and test datasets. results in one entry per list if cross_validation is False

Parameters:
df_modelpandas dataframe

including all features, target, and supplementary columns

cross_validationboolean

train_test_IDs

only possible when cross_validation is False

Returns:
train_listlist

list of train datasets

df_listlist

list of test datasets

TSCC.preprocessing.general.undersampling_valrange(df_fea, df_tar, colname, undersampling_rate, val_range, inclusive_range='both')[source]#

Imbalanced data method

Parameters:
dfpandas data frame

Data frame to be undersampled

undersampling_ratefloat

Ranges in [0, 1]

colnamestring

Column name to be undersampled

val_rangelist

List of min, max value range

inclusive_rangestring

inclusive_range “both” means [val_range[0], val_range[1]]

Returns:
u_df_fea: pandas data frame

The undersampled target dataframe.

u_df_tar: pandas data frame

The undersampled target dataframe.