General preprocessing methods#
- class TSCC.preprocessing.general.Config[source]#
- colname_target_detstr, optional
The column name of the target variable for detection models.
- colname_target_corrstr, optional
The column name of the target variable for correction models.
- colname_rawstr, optional
The column name of the raw data to be used as input.
- colname_idstr, optional
The column name containing unique identifiers for each data point or entity.
- colname_isErroneousPredstr, optional
The column name indicating whether a prediction considers an observation erroneous.
- exclude_colslist, optional
A list of column names to exclude from processing.
- colname_isEventstr, optional
The column name indicating whether an event occurred (used in event-based modeling).
- sensortypestr, optional
Specifies the type of sensor used in data collection.
- target_sensor_uncertaintyfloat, optional
The uncertainty associated with the target sensor measurement strategy.
- frequencystr, optional
The frequency of the time series data.
- det_ML_methodslist, optional
A list of machine learning methods for detection.
- det_stat_methodslist, optional
A list of statistical methods for detection.
- corr_ML_methodslist, optional
A list of machine learning methods for correction.
- corr_stat_methodslist, optional
A list of statistical methods for correction.
- cross_validationbool, default False
Whether to perform cross-validation during model training.
- train_test_IDslist, optional
A list of IDs used to split the dataset into training and testing sets.
- threshold_nrObsint, default 10000
The threshold number of observations required to perform analysis.
- detMethod_for_corrstr, optional
The detection method used for data correction.
- random_stateint, default 1
The seed for random number generation to ensure reproducibility.
- dataset_sizestr, default “small”
The size of the dataset, used to optimize computation resources.
- det_ML_model_savefolderstr, optional
The folder path to save detection machine learning models.
- corr_ML_model_savefolderstr, optional
The folder path to save correction machine learning models.
Methods
update_frequency
- __init__(colname_target_det=None, colname_target_corr=None, colname_raw=None, colname_id=None, colname_isErroneousPred=None, exclude_cols=None, colname_isEvent=None, sensortype=None, target_sensor_uncertainty=None, frequency=None, cross_validation=False, train_test_IDs=None, threshold_nrObs=10000, detMethod_for_corr=None, random_state=1, dataset_size='small', det_ML_model_savefolder=None, corr_ML_model_savefolder=None)[source]#
- class TSCC.preprocessing.general.DataSetHandler[source]#
DataSetHandler class for handling data split into train and test as well as resampling operations.
- Parameters:
- - data (pd.DataFrame): The input DataFrame.
- - test_size (float): The proportion of the dataset to include in the test split.
- - random_state (int): Random seed for reproducibility.
- Attributes:
- df_train_list_fealist of pandas DataFrame
List of training feature DataFrames.
- df_train_list_tarlist of pandas DataFrame
List of training target DataFrames.
- df_train_list_excllist of pandas DataFrame
List of excluded columns in training DataFrames.
- df_test_list_fealist of pandas DataFrame
List of testing feature DataFrames.
- df_test_list_tarlist of pandas DataFrame
List of testing target DataFrames.
- df_test_list_excllist of pandas DataFrame
List of excluded columns in testing DataFrames.
- list_lenint
Length of the training feature DataFrame list.
Methods
append_col_to_df_results
(series)Append a new column to the DataFrame.
append_col_to_df_train_test
(series)Returns the list of training DataFrames for features with an additional column.
get_complete_dataset
([exclude_columns])Return the original data, optionally excluding specific columns.
get_test_features
([exclude_columns])Returns the list of testing DataFrames for features.
get_test_targets
([extract_single_target])Returns the list of testing DataFrames for targets.
get_train_features
([exclude_columns])Returns the list of training DataFrames for features.
get_train_targets
([extract_single_target])Returns the list of training DataFrames for targets.
getCharacteristics
preprocess_training_ds
- append_col_to_df_results(series)[source]#
Append a new column to the DataFrame. The new column should have a name.
- append_col_to_df_train_test(series)[source]#
Returns the list of training DataFrames for features with an additional column.
- Parameters:
- seriespandas.Series
The series to be appended as a new column to the DataFrame.
- Returns:
- list of pandas DataFrame
A list of DataFrames for training with the additional column appended.
- get_complete_dataset(exclude_columns=None)[source]#
Return the original data, optionally excluding specific columns.
Parameters : exclude_columns (list) : List of columns to exclude from the original data.
- get_test_features(exclude_columns=None)[source]#
Returns the list of testing DataFrames for features.
- Returns:
- list of pandas DataFrame
- get_test_targets(extract_single_target=None)[source]#
Returns the list of testing DataFrames for targets.
- Returns:
- list of pandas DataFrame
- TSCC.preprocessing.general.SMOGN(df_fea, df_tar, colname, colname_idx, rel_thres=0.8)[source]#
Apply the SMOGN (Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise) to handle imbalanced regression datasets by oversampling both rare and extreme values.
- Parameters:
- df_feapandas DataFrame
The input feature dataframe containing the independent variables.
- df_tarpandas DataFrame
The target dataframe containing the dependent variable(s).
- colnamestr
The column name of the target variable that requires oversampling.
- colname_idxstr
The column name that serves as the unique identifier for each instance in the data.
- rel_thresfloat, optional
The relevance threshold for identifying rare and extreme values, default is 0.8.
- Returns:
- smogn_feapandas DataFrame
The oversampled feature dataframe with rare and extreme values oversampled, indexed by colname_idx.
- smogn_tarpandas DataFrame
The oversampled target dataframe with rare and extreme values oversampled, indexed by colname_idx.
- TSCC.preprocessing.general.SMOTE(df_fea, df_tar, colname_target, colname_id, exclude_columns, random_state=42)[source]#
Apply the SMOTE (Synthetic Minority Over-sampling Technique) to handle imbalanced datasets.
- Parameters:
- df_feapandas DataFrame
The input feature dataframe containing independent variables.
- df_tarpandas DataFrame
The target dataframe containing the dependent variable(s).
- colname_targetstr
The column name of the target variable that requires oversampling.
- colname_idstr
The column name used as the unique identifier for each instance in the data.
- exclude_columnslist
A list of column names to be excluded from the SMOTE process.
- random_stateint, optional
The seed used by SMOTE for random number generation, default is 42.
- Returns:
- smote_feapandas DataFrame
The oversampled feature dataframe with the same structure as the original df_fea, indexed by colname_id.
- smote_tarpandas DataFrame
The oversampled target dataframe with the same structure as the original df_tar, indexed by colname_id.
- TSCC.preprocessing.general.SMOTEwithCat(df_fea, df_tar, colname, random_state=42)[source]#
Apply the SMOTE-NC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) to handle imbalanced datasets that include both categorical and continuous features.
- Parameters:
- df_feapandas DataFrame
The input feature dataframe containing both categorical and continuous independent variables.
- df_tarpandas DataFrame
The target dataframe containing the dependent variable(s).
- colnamestr
The column name of the target variable that requires oversampling.
- random_stateint, optional
The seed used by SMOTE-NC for random number generation, default is 42.
- Returns:
- smote_feapandas DataFrame
The oversampled feature dataframe with categorical and continuous features.
- smote_tarpandas DataFrame
The oversampled target dataframe.
- TSCC.preprocessing.general.check_equidistant_minute_timestamps(df)[source]#
Check if the datetime index of a DataFrame has equidistant time stamps each minute.
Parameters: df (pd.DataFrame): DataFrame with a datetime index.
Returns: bool: True if time stamps are equidistant each minute, False otherwise.
- TSCC.preprocessing.general.check_equidistant_problems(df, frequency='T')[source]#
Check problems in a DataFrame regarding equidistant datetime index values.
Parameters: df (pd.DataFrame): DataFrame with a datetime index. frequency (str): Frequency string (e.g., ‘T’ for minute, ‘H’ for hour).
Returns: dict: Dictionary with the problems found in the dataset.
- TSCC.preprocessing.general.train_test_fea_tar_split(df_model, config, tar_columns)[source]#
Generate list of train and test datasets. results in one entry per list if cross_validation is False
- Parameters:
- df_modelpandas dataframe
including all features, target, and supplementary columns
- cross_validationboolean
…
- train_test_IDs…
only possible when
cross_validation
is False
- Returns:
- train_listlist
list of train datasets
- df_listlist
list of test datasets
- TSCC.preprocessing.general.undersampling_valrange(df_fea, df_tar, colname, undersampling_rate, val_range, inclusive_range='both')[source]#
Imbalanced data method
- Parameters:
- dfpandas data frame
Data frame to be undersampled
- undersampling_ratefloat
Ranges in [0, 1]
- colnamestring
Column name to be undersampled
- val_rangelist
List of min, max value range
- inclusive_rangestring
inclusive_range “both” means [val_range[0], val_range[1]]
- Returns:
- u_df_fea: pandas data frame
The undersampled target dataframe.
- u_df_tar: pandas data frame
The undersampled target dataframe.