Machine learning error correction#
- class TSCC.correction.ml.ML_byGAIN_DW[source]#
This class implements a model for data imputation using the GAIN (Generative Adversarial Imputation Nets) framework by jsyoon0823/GAIN. The model is designed to fill in missing values in datasets by leveraging a deep learning approach.
Right now only fit_transform part, therefore also newly trained model for test dataset.
- Parameters:
- gain_parameters: dict, optional
A dictionary containing the parameters for the GAIN model, including ‘batch_size’, ‘hint_rate’, ‘alpha’, and ‘iterations’. Default parameters will be used if not provided.
- alpha: float, default=1.0
The weight for the dense loss function, influencing the training dynamics.
- exclude_cols: list, optional
A list of columns to exclude from the training process. Not implemented in the current version.
Methods
load_model
(path)Load the model from a file.
save_model
(path)Save the model to a file.
fit
predict
- class TSCC.correction.ml.ML_byMissForest[source]#
This class implements the MissForest algorithm for imputation of missing values using Random Forests. It is designed to handle datasets with missing entries, providing a mechanism to fill in these gaps effectively.
- Parameters:
- max_featuresint, default=100
The maximum number of features to consider when looking for the best split at each node in the forest. This can help in controlling overfitting and improving computational efficiency.
Methods
fit
(df_fea, df_tar, df_flag_erroneous, config)Fits the MissForest model to the training data for imputation of missing values.
load_model
(path)Load the model from a file.
predict
(df_fea, df_flag_erroneous, config)Imputes missing values in the input feature DataFrame using the fitted MissForest model.
save_model
(path)Save the model to a file.
- __init__(max_features=100)[source]#
Initializes the ML_byRF class with a RandomForestClassifier model.
- Parameters:
- - n_estimators: int, default=100
The number of trees in the forest.
- - random_state: int, RandomState instance or None, default=None
Controls the randomness of the estimator.
- fit(df_fea, df_tar, df_flag_erroneous, config)[source]#
Fits the MissForest model to the training data for imputation of missing values.
This method prepares the feature DataFrame for training the MissForest model. It replaces missing values based on the specified criteria and fits the model to the non-empty feature columns.
- Parameters:
- df_feapandas DataFrame
The input features with missing values that need to be imputed.
- df_tarpandas DataFrame
The target values (not used in MissForest but can be included for consistency).
- df_flag_erroneouspandas Series
A boolean Series indicating which rows contain erroneous or missing values.
- configobject
Configuration object containing information about the dataset, including the name of the column to be predicted.
- Returns:
- None
- load_model(path)[source]#
Load the model from a file.
- Parameters:
- pathstr
The path to the model file.
- predict(df_fea, df_flag_erroneous, config)[source]#
Imputes missing values in the input feature DataFrame using the fitted MissForest model.
This method predicts values for the specified column in the DataFrame where the original values are missing (indicated by the df_flag_erroneous). It retains the original values where valid and fills in NaN where the values are flagged as erroneous.
- Parameters:
- df_feapandas DataFrame
The input features with missing values to be imputed.
- df_flag_erroneousString
A String referring to a column of df_fea with a boolean Series indicating which rows contain erroneous or missing values.
- configobject
Configuration object that contains information about the dataset, including the name of the column to be predicted.
- Returns:
- pandas Series
A Series containing the predicted values for the specified column, with NaN where the original values were not valid.
- class TSCC.correction.ml.ML_byRF[source]#
This class implements a Random Forest model for regression tasks using the RandomForestRegressor from scikit-learn.
- Parameters:
- n_estimatorsint, default=100
The number of trees in the forest, influencing the model’s complexity and performance.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator for reproducibility.
- max_depthint, default=15
The maximum depth of the trees. It helps prevent overfitting by limiting how deep the trees can grow.
Methods
fit
(df_fea, df_tar, df_flag_erroneous, config)Fits the RandomForestClassifier model to the data.
load_model
(path)Load the model from a file.
predict
(df_fea, df_flag_erroneous, config)Predict class for X.
save_model
(path)Save the model to a file.
- __init__(n_estimators=100, random_state=None, max_depth=15)[source]#
Initializes the ML_byRF class with a RandomForestClassifier model.
- Parameters:
- n_estimatorsint, default=100
The number of trees in the forest.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the estimator.
- fit(df_fea, df_tar, df_flag_erroneous, config)[source]#
Fits the RandomForestClassifier model to the data. Function does not support NaN values.
- Parameters:
- df_feapandas dataframe
The training input samples.
- df_tarpandas dataframe
The target values (class labels).
- Returns:
- decision_clfRandomForestClassifier object
- load_model(path)[source]#
Load the model from a file.
- Parameters:
- pathstr
The path to the model file.