Machine learning error detection#

class TSCC.detection.ml.ML_byIsolationForest[source]#

A class that encapsulates the functionality of an Isolation Forest model for detecting anomalies and provides methods for fitting the model, predicting outliers, and saving/loading the trained model.

The class is designed for unsupervised anomaly detection in time series or tabular data and leverages the Isolation Forest algorithm, which isolates anomalies by partitioning data points in a random forest structure. It supports configurations for model parameters like the number of estimators and contamination rate.

Parameters:
n_estimatorsint, default=100

The number of base estimators (trees) in the Isolation Forest ensemble.

max_samplesint, float, or ‘auto’, default=’auto’

The number of samples to draw from the input data to train each base estimator.

contamination‘auto’ or float, default=’auto’

The proportion of outliers in the dataset. If ‘auto’, the contamination is inferred.

random_stateint, RandomState instance, or None, default=None

Controls the randomness of the algorithm for reproducibility.

Attributes:
modelIsolationForest

The underlying Isolation Forest model used for anomaly detection.

Methods

fit(df_fea, df_tar, config)

Fits the Isolation Forest model to the provided feature data.

predict(df_fea)

Predicts whether each observation in the input dataframe is an anomaly.

save_model(path)

Saves the trained Isolation Forest model to a specified file path.

load_model(path)

Loads a previously saved Isolation Forest model from a file.

__init__(n_estimators=100, max_samples='auto', contamination='auto', random_state=None)[source]#

Initialize the IsolationForest model.

fit(df_fea, df_tar=None, config=None)[source]#

Fits the Isolation Forest model to the input feature data.

This method trains the Isolation Forest model on the provided dataset, identifying patterns in the normal data points and learning how to detect anomalies. It requires that the input data contains no NaN values, as the model cannot handle missing data.

Parameters:
df_feapandas DataFrame or pandas Series

The feature data on which to train the model. This data should not contain any missing values (NaN).

df_tarpandas DataFrame or pandas Series, optional

The target data (if any) is not used in unsupervised anomaly detection but can be passed for consistency in the API. Default is None.

configobject, optional

Configuration object (if any) is not used in unsupervised anomaly detection but can be passed for consistency in the API. Default is None.

Returns:
None

This method fits the Isolation Forest model and updates the model’s internal state.

load_model(path)[source]#

Load the model from a file.

Parameters:
pathstr

The path to the model file.

predict(df_fea)[source]#

Predict values based on the Isolation Forest model to the provided dataframe. Function does not support NaN values.

Parameters:
df_feapandas dataframe or pandas series
Returns:
hasGoodDQpandas series
save_model(path)[source]#

Save the model to a file.

Parameters:
pathstr

The path to save the model file.

class TSCC.detection.ml.ML_byMLP[source]#

A class for building and using a Multi-Layer Perceptron (MLP) classifier for supervised machine learning tasks.

This class utilizes the MLPClassifier from the scikit-learn library to model data using a feedforward neural network. The architecture of the network, including the number of hidden layers and the activation function, can be customized through the constructor parameters.

Parameters:
hidden_layer_sizestuple, optional, default=(100,)

Defines the number of neurons in each hidden layer. The length of the tuple indicates the number of hidden layers.

activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, optional, default=’relu’

The activation function for the hidden layer.

solver{‘lbfgs’, ‘sgd’, ‘adam’}, optional, default=’adam’

The algorithm used for weight optimization.

alphafloat, optional, default=0.0001

L2 regularization term to prevent overfitting.

batch_sizeint or ‘auto’, optional, default=’auto’

The size of minibatches for stochastic optimizers.

learning_rate{‘constant’, ‘invscaling’, ‘adaptive’}, optional, default=’constant’

The learning rate schedule for weight updates.

learning_rate_initfloat, optional, default=0.001

The initial learning rate.

max_iterint, optional, default=200

The maximum number of iterations for training.

random_stateint, RandomState instance or None, optional, default=None

Controls the random seed for reproducibility.

Methods

fit(df_fea, df_tar, config):

Trains the MLP model on the provided feature data and target labels.

predict(df_fea):

Predicts class labels for new input samples based on the trained model.

save_model(path):

Saves the trained model to a specified file path.

load_model(path):

Loads a trained model from a specified file path.

__init__(hidden_layer_sizes=(100,), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, max_iter=200, random_state=None)[source]#

Initialize the MLPClassifier model.

fit(df_fea, df_tar, config=None)[source]#

Trains the Multi-Layer Perceptron (MLP) classifier using the provided feature data and target labels.

Parameters:
df_feapandas DataFrame

The input feature data used for training the model, where each row represents a sample and each column represents a feature.

df_tarpandas Series or array-like

The target labels corresponding to the input feature data. Each value indicates the class label for the respective sample.

configobject, optional

Configuration object for any additional settings (not currently used in this method).

load_model(path)[source]#

Load the model from a file.

Parameters:
pathstr

The path to the model file.

predict(df_fea)[source]#

Predicts class labels for the given input feature data using the trained MLP classifier.

This method uses the fitted MLP model in a class instance to generate predictions.

Parameters:
df_feapandas DataFrame

The input feature data for which predictions are to be made. Each row should represent a sample, and each column should represent a feature.

Returns:
pandas Series

A Series containing the predicted class labels for each input sample. The values are cast to float for compatibility with other numerical operations.

save_model(path)[source]#

Save the model to a file.

Parameters:
pathstr

The path to save the model file.

class TSCC.detection.ml.ML_byOneClassSVM[source]#

A class that encapsulates the functionality of a OneClassSVM model for anomaly detection and provides methods for reducing training samples, fitting the model, and making predictions.

The class is designed to handle large datasets and includes functionality for reducing the number of training samples by undersampling based on conditions such as sensor type and the distribution of target values. It uses a OneClassSVM model to detect anomalies in time series data.

Parameters:
threshold_nrObsint, default=10000

The maximum number of observations to be used for training. If the dataset exceeds this threshold, undersampling or random sampling is applied.

Attributes:
modelOneClassSVM

The OneClassSVM model used for anomaly detection.

Methods

reduce_training_samples(df_fea, gt_series_num, config)

Reduces the number of training samples based on ground truth values, sensor type, and predefined thresholds.

fit(df_fea, df_tar, config)

Fits the OneClassSVM model to the input features.

predict(df_fea)

Predicts whether each observation in the input dataframe is an anomaly.

save_model(path)

Saves the trained OneClassSVM model to a specified path.

load_model(path)

Loads a previously saved OneClassSVM model from a file.

__init__(threshold_nrObs=10000)[source]#

Initializes the ML_byOneClassSVM class with a default threshold for the number of observations.

fit(df_fea, df_tar=None, config=None)[source]#

Fits the OneClassSVM model to the provided dataframe.

Parameters:
df_feapandas dataframe
load_model(path)[source]#

Load the model from a file.

Parameters:
pathstr

The path to the model file.

predict(df_fea)[source]#

Predict values based on the OneClassSVM model to the provided dataframe. Function does not support NaN values.

Parameters:
dfpandas dataframe
oneClass_SVMOneClassSVM object
Returns:
hasGoodDQpandas series
reduce_training_samples(df_fea, gt_series_num, config)[source]#

Reduces the number of training samples in the dataframe based on specific conditions and performs undersampling.

This function performs undersampling on the input dataframe based on the target values (gt_series_num) and sensor type specified in the config. The function supports precipitation-type data by undersampling within specified ranges of ground truth values and target detection. The number of samples is reduced to match a predefined threshold.

If the size of the filtered dataframe still exceeds the threshold, the function reduces it further by random sampling or concatenating a small subset of the original dataframe.

Parameters:
df_feapandas DataFrame

The input features for training.

gt_series_numpandas Series or ndarray

The target ground truth values (used for filtering).

configobject

Configuration object that includes sensor type and column names.

Returns:
df_filteredpandas DataFrame

The dataframe with reduced samples, after applying undersampling based on conditions.

save_model(path)[source]#

Save the model to a file.

Parameters:
pathstr

The path to save the model file.

class TSCC.detection.ml.ML_byRF[source]#

A class for implementing a Random Forest Classifier for classification tasks.

This class utilizes the RandomForestClassifier from the scikit-learn library to model the relationship between input features and target class labels. It provides methods for training the model, making predictions, and saving/loading the model.

Parameters:
n_estimators: int, optional, default=100

The number of trees in the forest. A higher number can improve performance but increases computation time.

random_state: int, RandomState instance or None, optional, default=None

Controls the randomness of the estimator for reproducibility.

Methods

fit(df_fea, df_tar, config):

Fits the RandomForestClassifier model to the training data.

predict(df_fea):

Predicts class labels for the given input feature data.

save_model(path):

Saves the trained model to a specified file path.

load_model(path):

Loads a previously saved model from a specified file path.

__init__(n_estimators=100, random_state=None)[source]#

Initializes the ML_byRF class with a RandomForestClassifier model.

fit(df_fea, df_tar, config=None)[source]#

Trains the RandomForestClassifier model using the provided feature and target data.

This method fits the Random Forest model to the training dataset, enabling it to learn the relationship between the input features and their corresponding target class labels. Note that the function does not handle NaN values, so the input data must be preprocessed to ensure it is free of missing values.

Parameters:
df_fea: pandas DataFrame

A DataFrame containing the input features for training. Each column represents a feature, and each row represents a training sample.

df_tar: pandas DataFrame or Series

The target values (class labels) corresponding to the input features. This should be a one-dimensional array-like structure where each entry matches the class label for the respective training sample in df_fea.

config: optional

A configuration object that can be used to specify additional parameters for model fitting, although it is not utilized in this method.

Returns:
None

The fitted model is stored within the instance, allowing for subsequent predictions.

load_model(path)[source]#

Load the model from a file.

Parameters:
pathstr

The path to the model file.

predict(df_fea)[source]#

Predicts the class labels for the given input samples using the trained RandomForestClassifier model.

This method takes a DataFrame of input features and uses the fitted model to generate predictions for each sample. The predictions indicate whether each sample is classified as an outlier or not. The method does not support NaN values, so ensure that the input data is clean before calling this function.

Parameters:
df_fea: pandas DataFrame

A DataFrame containing the input features for which predictions are to be made. Each column represents a feature, and each row corresponds to a sample.

Returns:
pandas Series

A Series containing the predicted class labels for each input sample, with the same index as the input DataFrame. The values are converted to floats for consistency.

save_model(path)[source]#

Save the model to a file.

Parameters:
pathstr

The path to save the model file.