Machine learning error detection#
- class TSCC.detection.ml.ML_byIsolationForest[source]#
A class that encapsulates the functionality of an Isolation Forest model for detecting anomalies and provides methods for fitting the model, predicting outliers, and saving/loading the trained model.
The class is designed for unsupervised anomaly detection in time series or tabular data and leverages the Isolation Forest algorithm, which isolates anomalies by partitioning data points in a random forest structure. It supports configurations for model parameters like the number of estimators and contamination rate.
- Parameters:
- n_estimatorsint, default=100
The number of base estimators (trees) in the Isolation Forest ensemble.
- max_samplesint, float, or ‘auto’, default=’auto’
The number of samples to draw from the input data to train each base estimator.
- contamination‘auto’ or float, default=’auto’
The proportion of outliers in the dataset. If ‘auto’, the contamination is inferred.
- random_stateint, RandomState instance, or None, default=None
Controls the randomness of the algorithm for reproducibility.
- Attributes:
- modelIsolationForest
The underlying Isolation Forest model used for anomaly detection.
Methods
fit(df_fea, df_tar, config)
Fits the Isolation Forest model to the provided feature data.
predict(df_fea)
Predicts whether each observation in the input dataframe is an anomaly.
save_model(path)
Saves the trained Isolation Forest model to a specified file path.
load_model(path)
Loads a previously saved Isolation Forest model from a file.
- __init__(n_estimators=100, max_samples='auto', contamination='auto', random_state=None)[source]#
Initialize the IsolationForest model.
- fit(df_fea, df_tar=None, config=None)[source]#
Fits the Isolation Forest model to the input feature data.
This method trains the Isolation Forest model on the provided dataset, identifying patterns in the normal data points and learning how to detect anomalies. It requires that the input data contains no NaN values, as the model cannot handle missing data.
- Parameters:
- df_feapandas DataFrame or pandas Series
The feature data on which to train the model. This data should not contain any missing values (NaN).
- df_tarpandas DataFrame or pandas Series, optional
The target data (if any) is not used in unsupervised anomaly detection but can be passed for consistency in the API. Default is None.
- configobject, optional
Configuration object (if any) is not used in unsupervised anomaly detection but can be passed for consistency in the API. Default is None.
- Returns:
- None
This method fits the Isolation Forest model and updates the model’s internal state.
- load_model(path)[source]#
Load the model from a file.
- Parameters:
- pathstr
The path to the model file.
- class TSCC.detection.ml.ML_byMLP[source]#
A class for building and using a Multi-Layer Perceptron (MLP) classifier for supervised machine learning tasks.
This class utilizes the MLPClassifier from the scikit-learn library to model data using a feedforward neural network. The architecture of the network, including the number of hidden layers and the activation function, can be customized through the constructor parameters.
- Parameters:
- hidden_layer_sizestuple, optional, default=(100,)
Defines the number of neurons in each hidden layer. The length of the tuple indicates the number of hidden layers.
- activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, optional, default=’relu’
The activation function for the hidden layer.
- solver{‘lbfgs’, ‘sgd’, ‘adam’}, optional, default=’adam’
The algorithm used for weight optimization.
- alphafloat, optional, default=0.0001
L2 regularization term to prevent overfitting.
- batch_sizeint or ‘auto’, optional, default=’auto’
The size of minibatches for stochastic optimizers.
- learning_rate{‘constant’, ‘invscaling’, ‘adaptive’}, optional, default=’constant’
The learning rate schedule for weight updates.
- learning_rate_initfloat, optional, default=0.001
The initial learning rate.
- max_iterint, optional, default=200
The maximum number of iterations for training.
- random_stateint, RandomState instance or None, optional, default=None
Controls the random seed for reproducibility.
Methods
fit(df_fea, df_tar, config):
Trains the MLP model on the provided feature data and target labels.
predict(df_fea):
Predicts class labels for new input samples based on the trained model.
save_model(path):
Saves the trained model to a specified file path.
load_model(path):
Loads a trained model from a specified file path.
- __init__(hidden_layer_sizes=(100,), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, max_iter=200, random_state=None)[source]#
Initialize the MLPClassifier model.
- fit(df_fea, df_tar, config=None)[source]#
Trains the Multi-Layer Perceptron (MLP) classifier using the provided feature data and target labels.
- Parameters:
- df_feapandas DataFrame
The input feature data used for training the model, where each row represents a sample and each column represents a feature.
- df_tarpandas Series or array-like
The target labels corresponding to the input feature data. Each value indicates the class label for the respective sample.
- configobject, optional
Configuration object for any additional settings (not currently used in this method).
- load_model(path)[source]#
Load the model from a file.
- Parameters:
- pathstr
The path to the model file.
- predict(df_fea)[source]#
Predicts class labels for the given input feature data using the trained MLP classifier.
This method uses the fitted MLP model in a class instance to generate predictions.
- Parameters:
- df_feapandas DataFrame
The input feature data for which predictions are to be made. Each row should represent a sample, and each column should represent a feature.
- Returns:
- pandas Series
A Series containing the predicted class labels for each input sample. The values are cast to float for compatibility with other numerical operations.
- class TSCC.detection.ml.ML_byOneClassSVM[source]#
A class that encapsulates the functionality of a OneClassSVM model for anomaly detection and provides methods for reducing training samples, fitting the model, and making predictions.
The class is designed to handle large datasets and includes functionality for reducing the number of training samples by undersampling based on conditions such as sensor type and the distribution of target values. It uses a OneClassSVM model to detect anomalies in time series data.
- Parameters:
- threshold_nrObsint, default=10000
The maximum number of observations to be used for training. If the dataset exceeds this threshold, undersampling or random sampling is applied.
- Attributes:
- modelOneClassSVM
The OneClassSVM model used for anomaly detection.
Methods
reduce_training_samples(df_fea, gt_series_num, config)
Reduces the number of training samples based on ground truth values, sensor type, and predefined thresholds.
fit(df_fea, df_tar, config)
Fits the OneClassSVM model to the input features.
predict(df_fea)
Predicts whether each observation in the input dataframe is an anomaly.
save_model(path)
Saves the trained OneClassSVM model to a specified path.
load_model(path)
Loads a previously saved OneClassSVM model from a file.
- __init__(threshold_nrObs=10000)[source]#
Initializes the ML_byOneClassSVM class with a default threshold for the number of observations.
- fit(df_fea, df_tar=None, config=None)[source]#
Fits the OneClassSVM model to the provided dataframe.
- Parameters:
- df_feapandas dataframe
- load_model(path)[source]#
Load the model from a file.
- Parameters:
- pathstr
The path to the model file.
- predict(df_fea)[source]#
Predict values based on the OneClassSVM model to the provided dataframe. Function does not support NaN values.
- Parameters:
- dfpandas dataframe
- oneClass_SVMOneClassSVM object
- Returns:
- hasGoodDQpandas series
- reduce_training_samples(df_fea, gt_series_num, config)[source]#
Reduces the number of training samples in the dataframe based on specific conditions and performs undersampling.
This function performs undersampling on the input dataframe based on the target values (gt_series_num) and sensor type specified in the config. The function supports precipitation-type data by undersampling within specified ranges of ground truth values and target detection. The number of samples is reduced to match a predefined threshold.
If the size of the filtered dataframe still exceeds the threshold, the function reduces it further by random sampling or concatenating a small subset of the original dataframe.
- Parameters:
- df_feapandas DataFrame
The input features for training.
- gt_series_numpandas Series or ndarray
The target ground truth values (used for filtering).
- configobject
Configuration object that includes sensor type and column names.
- Returns:
- df_filteredpandas DataFrame
The dataframe with reduced samples, after applying undersampling based on conditions.
- class TSCC.detection.ml.ML_byRF[source]#
A class for implementing a Random Forest Classifier for classification tasks.
This class utilizes the RandomForestClassifier from the scikit-learn library to model the relationship between input features and target class labels. It provides methods for training the model, making predictions, and saving/loading the model.
- Parameters:
- n_estimators: int, optional, default=100
The number of trees in the forest. A higher number can improve performance but increases computation time.
- random_state: int, RandomState instance or None, optional, default=None
Controls the randomness of the estimator for reproducibility.
Methods
fit(df_fea, df_tar, config):
Fits the RandomForestClassifier model to the training data.
predict(df_fea):
Predicts class labels for the given input feature data.
save_model(path):
Saves the trained model to a specified file path.
load_model(path):
Loads a previously saved model from a specified file path.
- __init__(n_estimators=100, random_state=None)[source]#
Initializes the ML_byRF class with a RandomForestClassifier model.
- fit(df_fea, df_tar, config=None)[source]#
Trains the RandomForestClassifier model using the provided feature and target data.
This method fits the Random Forest model to the training dataset, enabling it to learn the relationship between the input features and their corresponding target class labels. Note that the function does not handle NaN values, so the input data must be preprocessed to ensure it is free of missing values.
- Parameters:
- df_fea: pandas DataFrame
A DataFrame containing the input features for training. Each column represents a feature, and each row represents a training sample.
- df_tar: pandas DataFrame or Series
The target values (class labels) corresponding to the input features. This should be a one-dimensional array-like structure where each entry matches the class label for the respective training sample in df_fea.
- config: optional
A configuration object that can be used to specify additional parameters for model fitting, although it is not utilized in this method.
- Returns:
- None
The fitted model is stored within the instance, allowing for subsequent predictions.
- load_model(path)[source]#
Load the model from a file.
- Parameters:
- pathstr
The path to the model file.
- predict(df_fea)[source]#
Predicts the class labels for the given input samples using the trained RandomForestClassifier model.
This method takes a DataFrame of input features and uses the fitted model to generate predictions for each sample. The predictions indicate whether each sample is classified as an outlier or not. The method does not support NaN values, so ensure that the input data is clean before calling this function.
- Parameters:
- df_fea: pandas DataFrame
A DataFrame containing the input features for which predictions are to be made. Each column represents a feature, and each row corresponds to a sample.
- Returns:
- pandas Series
A Series containing the predicted class labels for each input sample, with the same index as the input DataFrame. The values are converted to floats for consistency.