Statistical error detection#

TSCC.detection.statistical.STAT_byCorrelationAnalysis(df, target_col, correlation_threshold=0.2)[source]#

Check the consistency of the target variable (e.g., rainfall) by comparing it with other variables.

Parameters:

dfpandas dataframe: The dataframe contains the target variable and other meteorological variables.
target_colstr: The target variable (e.g., ‘rainfall’) to be validated.
correlation_thresholdfloat: Minimum acceptable correlation between the target variable and other variables.

Returns:

pandas series: A series with an additional column ‘consistency_error’ (True if error, False otherwise).

Examples

>>> nr_obs = 6
>>> np.random.seed(0)
>>> # Generate random data
>>> data = np.random.randn(nr_obs, 4)
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=nr_obs, freq='30T')
>>> # Create DataFrame with specified column names
>>> df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = time_index)
>>> df["raw"] = df["ground_truth"] + np.random.normal(0, 5, nr_obs)*np.random.randint(0, 2, nr_obs)
>>> df["isCorrect_gt"] = df["ground_truth"] == df["raw"]
>>> TSCC.detection.STAT_byCorrelationAnalysis(df, "raw")
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.5
Freq: 30T, Name: consistency_error, dtype: float64

TSCC.detection.statistical.STAT_byDistFromCenter(series, eps, center_measure='median', dynamic_window=None)[source]#

Check if observation is within reasonable distance to center, eps and eps_pro define the distance.

Parameters:

seriespandas series
epsfloat: Define a threshold (epsilon) for the acceptable distance (only positive value > 0) from the center
center_measurestring, optional: choose from list [‘median’, ‘mean’]
dynamic_windowint, optional: Epsilon Pro version, epsilon threshold is dynamic according to the window frame

Returns:

spandas series: float values, TRUE (1.0) if value is regarded as correct by this method FALSE (0.0) otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byDistFromCenter(s, 3)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.5
2023-01-01 02:30:00    0.5
2023-01-01 03:00:00    0.0
Freq: 30T, dtype: float64

TSCC.detection.statistical.STAT_byDistFromCenterRolling(series, threshold, window=3)[source]#

Check if values are within static distance to rolling center.

Parameters:

seriespandas series
thresholdnumber, optional #nicht ganz sicher ob wirklich optional: The default is 1.
windowint, optional: Rolling center according to the window frame
centerboolean, optional: The default is False.

Returns:

spandas series: boolean values, TRUE if value is regarded as correct by this method FALSE otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byDistFromCenterRolling(s, 3, 2)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.5
2023-01-01 03:00:00    0.0
Freq: 30T, dtype: float64

TSCC.detection.statistical.STAT_byIQR(series, lo=0.25, up=0.75, k=1.5)[source]#

Check if values are in the scaled interquantile range.

Parameters:

seriespandas series
lonumber, optional: Percentage value in range [0, 1]. The default is 0.25.
upnumber, optional: Percentage value in range [0, 1]. The default is 0.75.
knumber, optional: Scaling factor of regular IQR to determine outliers, typical scaling factor 1.5 as default

Returns:

spandas series: boolean values, TRUE if value is regarded as correct by this method FALSE otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byIQR(s)
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.5
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.5
2023-01-01 03:00:00    0.5
Freq: 30T, dtype: float64

TSCC.detection.statistical.STAT_byIQRdiff(series, lo=0.25, up=0.75, k=1.5)[source]#

Check if differences to previous values are in the scaled interquantile range.

Parameters:

seriespandas series
lonumber, optional: Percentage value in range [0, 1]. The default is 0.25.
upnumber, optional: Percentage value in range [0, 1]. The default is 0.75.
knumber, optional: Scaling factor of regular IQR to determine outliers, typical scaling factor 1.5 as default

Returns:

spandas series: boolean values, TRUE if value is regarded as correct by this method FALSE otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byIQRdiff(s)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.5
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.5
2023-01-01 02:00:00    0.5
2023-01-01 02:30:00    0.0
2023-01-01 03:00:00    0.5
Freq: 30T, dtype: float64

TSCC.detection.statistical.STAT_byZScore(series, z=3, b_modified=False, k=0.6745)[source]#

Check if values are within a standardized score from the mean. Best for normally distributed data.

Parameters:

seriespandas series
zfloat, optional: z-score for outliers. Typically, values are considered as outliers when abs(z)>3.
b_modifiedboolean, optional: Weather modified z-score is used or not.
kfloat, optional: The default is 0.6745.

Returns:

pandas series: Series of index of outliers or cleaned data of the original series.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byZScore(s)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.0
2023-01-01 03:00:00    0.0
Freq: 30T, dtype: float64