Configurations and data set handling#

Function overview

Configurations#

Configurations of data quality enhancement procedures are defined in the Config class. The Config class is initialized with several parameters, such as the column names for detection targets, ground truth values, and raw data, alongside other configurations like sensor type and event classification. These settings are used for model development and assessment in a data processing pipeline focused on extreme events like precipitation.

import TSCC

# instance of the class
config = TSCC.assessment.Config(
    colname_target_det="isCorrect_gt",
    colname_target_corr="ground_truth",
    colname_raw="raw",
    df_id_column=None,
    exclude_cols=[],
    colname_isEvent="isHeavyRain",
    sensortype='precipitation',
    cross_validation = False)

Data set handling#

The data frame to be qualified is df. config is an Instance of TSCC.assessment.Config. This section explains how to handle a dataset within the TSCC framework. The example df DataFrame contains columns like “ground_truth” and “raw” sensor values of synthetic data. The DataSetHandler class is then initialized with this DataFrame and a config (an instance of Config), holding configurations for the processing pipeline including the data set handling. The DataSetHandler divides the data set into training and test. If the config parameter cross_validation is False, a single train-test split is used. Otherwise, 5 training and test sets are extracted per default. dataSetHandler.list_len shows that the handler contains one train-test split in this example.

import pandas as pd
import numpy as np

n = 10
# Set random seed for reproducibility
np.random.seed(0)
# Generate random data
data = np.random.randn(n, 4)
date_range = pd.date_range(start='2024-08-01', periods=n, freq='5min', name = "timestamp")
# Create DataFrame with specified column names
df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = date_range)
df["raw"] = df["ground_truth"] + np.random.normal(0, 5, n)*np.random.randint(0, 2, n)
df["isCorrect_gt"] = df["ground_truth"] == df["raw"]
df["isHeavyRain"] = 1

dataSetHandler = TSCC.preprocessing.DataSetHandler(df, config)
>>>dataSetHandler.list_len
1
>>>print(dataSetHandler.get_train_features()[0])
                           raw     fea_1     fea_2  isHeavyRain
timestamp
2024-08-01 00:00:00  -3.478712  0.978738  2.240893            1
2024-08-01 00:05:00  -5.232532  0.950088 -0.151357            1
2024-08-01 00:10:00  -8.634570  0.144044  1.454274            1
2024-08-01 00:15:00  10.514915  0.443863  0.333674            1
2024-08-01 00:20:00  -1.054182  0.313068 -0.854096            1
2024-08-01 00:25:00  -2.552990  0.864436 -0.742165            1
2024-08-01 00:30:00  -3.994222  0.045759 -0.187184            1
2024-08-01 00:35:00   5.420231  0.154947  0.378163            1
>>>print(dataSetHandler.get_test_features()[0])
                          raw     fea_1     fea_2  isHeavyRain
timestamp
2024-08-01 00:40:00 -8.957275 -0.347912  0.156349            1
2024-08-01 00:45:00  0.166589 -0.387327 -0.302303            1
>>>print(dataSetHandler.get_train_targets()[0])
                     isCorrect_gt  ground_truth
timestamp
2024-08-01 00:00:00         False      1.764052
2024-08-01 00:05:00         False      1.867558
2024-08-01 00:10:00         False     -0.103219
2024-08-01 00:15:00         False      0.761038
2024-08-01 00:20:00         False      1.494079
2024-08-01 00:25:00          True     -2.552990
2024-08-01 00:30:00         False      2.269755
2024-08-01 00:35:00         False      1.532779
>>>print(dataSetHandler.get_test_targets()[0])
                      isCorrect_gt  ground_truth
timestamp
2024-08-01 00:40:00         False     -0.887786
2024-08-01 00:45:00         False      1.230291

Preprocessing for skewed distributions#

This section provides an example of preprocessing for skewed distributions using various techniques available in the TSCC framework. Functions like undersampling_valrange, SMOTEwithCat, and SMOGN are introduced to manipulate the distribution of observations in the dataset, addressing class imbalances. In the given code, synthetic data is generated, and the SMOTE function is applied to create more balanced training features and targets. This process enhances the dataset by oversampling minority classes, particularly for extreme events like precipitation.

import pandas as pd
import numpy as np
import TSCC

# instance of the class
config = TSCC.assessment.Config(
    colname_target_det="isCorrect_gt",
    colname_target_corr="ground_truth",
    colname_raw="raw",
    colname_id="timestamp",
    exclude_cols=[],
    colname_isEvent="isHeavyRain",
    sensortype='precipitation')

# Set random seed for reproducibility
np.random.seed(0)
# Generate random data
data = np.random.randn(100, 4)
date_range = pd.date_range(start='2024-08-01', periods=100, freq='5min', name = "timestamp")
# Create DataFrame with specified column names
df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = date_range)
df["raw"] = df["ground_truth"] + np.random.normal(0, 5, 100)*np.random.randint(0, 2, 100)
df["isCorrect_gt"] = df["ground_truth"] == df["raw"]
df["isHeavyRain"] = 1

dataSetHandler = TSCC.preprocessing.DataSetHandler(df, config)

fea_smote, tar_smote = TSCC.preprocessing.\
SMOTE(dataSetHandler.get_train_features(exclude_columns=config.exclude_cols)[0],
       dataSetHandler.get_train_targets()[0],
       config.colname_target_det,
     config.df_id_column,
     ["isHeavyRain"])