General preprocessing methods#

class TSCC.preprocessing.general.Config[source]#

colname_target_detstr, optional: The column name of the target variable for detection models.
colname_target_corrstr, optional: The column name of the target variable for correction models.
colname_rawstr, optional: The column name of the raw data to be used as input.
colname_idstr, optional: The column name containing unique identifiers for each data point or entity.
colname_isErroneousPredstr, optional: The column name indicating whether a prediction considers an observation erroneous.
exclude_colslist, optional: A list of column names to exclude from processing.
colname_isEventstr, optional: The column name indicating whether an event occurred (used in event-based modeling).
sensortypestr, optional: Specifies the type of sensor used in data collection.
target_sensor_uncertaintyfloat, optional: The uncertainty associated with the target sensor measurement strategy.
frequencystr, optional: The frequency of the time series data.
det_ML_methodslist, optional: A list of machine learning methods for detection.
det_stat_methodslist, optional: A list of statistical methods for detection.
corr_ML_methodslist, optional: A list of machine learning methods for correction.
corr_stat_methodslist, optional: A list of statistical methods for correction.
cross_validationbool, default False: Whether to perform cross-validation during model training.
train_test_IDslist, optional: A list of IDs used to split the dataset into training and testing sets.
threshold_nrObsint, default 10000: The threshold number of observations required to perform analysis.
detMethod_for_corrstr, optional: The detection method used for data correction.
random_stateint, default 1: The seed for random number generation to ensure reproducibility.
dataset_sizestr, default “small”: The size of the dataset, used to optimize computation resources.
det_ML_model_savefolderstr, optional: The folder path to save detection machine learning models.
corr_ML_model_savefolderstr, optional: The folder path to save correction machine learning models.

Methods

update_frequency

__init__(colname_target_det=None, colname_target_corr=None, colname_raw=None, colname_id=None, colname_isErroneousPred=None, exclude_cols=None, colname_isEvent=None, sensortype=None, target_sensor_uncertainty=None, frequency=None, cross_validation=False, train_test_IDs=None, threshold_nrObs=10000, detMethod_for_corr=None, random_state=1, dataset_size='small', det_ML_model_savefolder=None, corr_ML_model_savefolder=None)[source]#

class TSCC.preprocessing.general.DataSetHandler[source]#

DataSetHandler class for handling data split into train and test as well as resampling operations.

Parameters:

- data (pd.DataFrame): The input DataFrame.
- test_size (float): The proportion of the dataset to include in the test split.
- random_state (int): Random seed for reproducibility.

Attributes:

df_train_list_fealist of pandas DataFrame: List of training feature DataFrames.
df_train_list_tarlist of pandas DataFrame: List of training target DataFrames.
df_train_list_excllist of pandas DataFrame: List of excluded columns in training DataFrames.
df_test_list_fealist of pandas DataFrame: List of testing feature DataFrames.
df_test_list_tarlist of pandas DataFrame: List of testing target DataFrames.
df_test_list_excllist of pandas DataFrame: List of excluded columns in testing DataFrames.
list_lenint: Length of the training feature DataFrame list.

Methods

`append_col_to_df_results`(series)	Append a new column to the DataFrame.
`append_col_to_df_train_test`(series)	Returns the list of training DataFrames for features with an additional column.
`get_complete_dataset`([exclude_columns])	Return the original data, optionally excluding specific columns.
`get_test_features`([exclude_columns])	Returns the list of testing DataFrames for features.
`get_test_targets`([extract_single_target])	Returns the list of testing DataFrames for targets.
`get_train_features`([exclude_columns])	Returns the list of training DataFrames for features.
`get_train_targets`([extract_single_target])	Returns the list of training DataFrames for targets.

getCharacteristics
preprocess_training_ds

__init__(df, config)[source]#

append_col_to_df_results(series)[source]#: Append a new column to the DataFrame. The new column should have a name.

append_col_to_df_train_test(series)[source]#

Returns the list of training DataFrames for features with an additional column.

Parameters:

seriespandas.Series: The series to be appended as a new column to the DataFrame.

Returns:

list of pandas DataFrame: A list of DataFrames for training with the additional column appended.

get_complete_dataset(exclude_columns=None)[source]#

Return the original data, optionally excluding specific columns.

Parameters : exclude_columns (list) : List of columns to exclude from the original data.

get_test_features(exclude_columns=None)[source]#

Returns the list of testing DataFrames for features.

Returns:

list of pandas DataFrame

get_test_targets(extract_single_target=None)[source]#

Returns the list of testing DataFrames for targets.

Returns:

list of pandas DataFrame

get_train_features(exclude_columns=None)[source]#

Returns the list of training DataFrames for features.

Returns:

list of pandas DataFrame

get_train_targets(extract_single_target=None)[source]#

Returns the list of training DataFrames for targets.

Returns:

list of pandas DataFrame

TSCC.preprocessing.general.SMOGN(df_fea, df_tar, colname, colname_idx, rel_thres=0.8)[source]#

Apply the SMOGN (Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise) to handle imbalanced regression datasets by oversampling both rare and extreme values.

Parameters:

df_feapandas DataFrame: The input feature dataframe containing the independent variables.
df_tarpandas DataFrame: The target dataframe containing the dependent variable(s).
colnamestr: The column name of the target variable that requires oversampling.
colname_idxstr: The column name that serves as the unique identifier for each instance in the data.
rel_thresfloat, optional: The relevance threshold for identifying rare and extreme values, default is 0.8.

Returns:

smogn_feapandas DataFrame: The oversampled feature dataframe with rare and extreme values oversampled, indexed by colname_idx.
smogn_tarpandas DataFrame: The oversampled target dataframe with rare and extreme values oversampled, indexed by colname_idx.

TSCC.preprocessing.general.SMOTE(df_fea, df_tar, colname_target, colname_id, exclude_columns, random_state=42)[source]#

Apply the SMOTE (Synthetic Minority Over-sampling Technique) to handle imbalanced datasets.

Parameters:

df_feapandas DataFrame: The input feature dataframe containing independent variables.
df_tarpandas DataFrame: The target dataframe containing the dependent variable(s).
colname_targetstr: The column name of the target variable that requires oversampling.
colname_idstr: The column name used as the unique identifier for each instance in the data.
exclude_columnslist: A list of column names to be excluded from the SMOTE process.
random_stateint, optional: The seed used by SMOTE for random number generation, default is 42.

Returns:

smote_feapandas DataFrame: The oversampled feature dataframe with the same structure as the original df_fea, indexed by colname_id.
smote_tarpandas DataFrame: The oversampled target dataframe with the same structure as the original df_tar, indexed by colname_id.

TSCC.preprocessing.general.SMOTEwithCat(df_fea, df_tar, colname, random_state=42)[source]#

Apply the SMOTE-NC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) to handle imbalanced datasets that include both categorical and continuous features.

Parameters:

df_feapandas DataFrame: The input feature dataframe containing both categorical and continuous independent variables.
df_tarpandas DataFrame: The target dataframe containing the dependent variable(s).
colnamestr: The column name of the target variable that requires oversampling.
random_stateint, optional: The seed used by SMOTE-NC for random number generation, default is 42.

Returns:

smote_feapandas DataFrame: The oversampled feature dataframe with categorical and continuous features.
smote_tarpandas DataFrame: The oversampled target dataframe.

TSCC.preprocessing.general.check_equidistant_minute_timestamps(df)[source]#

Check if the datetime index of a DataFrame has equidistant time stamps each minute.

Parameters: df (pd.DataFrame): DataFrame with a datetime index.

Returns: bool: True if time stamps are equidistant each minute, False otherwise.

TSCC.preprocessing.general.check_equidistant_problems(df, frequency='T')[source]#

Check problems in a DataFrame regarding equidistant datetime index values.

Parameters: df (pd.DataFrame): DataFrame with a datetime index. frequency (str): Frequency string (e.g., ‘T’ for minute, ‘H’ for hour).

Returns: dict: Dictionary with the problems found in the dataset.

TSCC.preprocessing.general.train_test_fea_tar_split(df_model, config, tar_columns)[source]#

Generate list of train and test datasets. results in one entry per list if cross_validation is False

Parameters:

df_modelpandas dataframe: including all features, target, and supplementary columns
cross_validationboolean: …
train_test_IDs…: only possible when cross_validation is False

Returns:

train_listlist: list of train datasets
df_listlist: list of test datasets

TSCC.preprocessing.general.undersampling_valrange(df_fea, df_tar, colname, undersampling_rate, val_range, inclusive_range='both')[source]#

Imbalanced data method

Parameters:

dfpandas data frame: Data frame to be undersampled
undersampling_ratefloat: Ranges in [0, 1]
colnamestring: Column name to be undersampled
val_rangelist: List of min, max value range
inclusive_rangestring: inclusive_range “both” means [val_range[0], val_range[1]]

Returns:

u_df_fea: pandas data frame: The undersampled target dataframe.
u_df_tar: pandas data frame: The undersampled target dataframe.