Statistical error detection#

TSCC.detection.statistical.STAT_byCorrelationAnalysis(df, target_col, correlation_threshold=0.2)[source]#

Check the consistency of the target variable (e.g., rainfall) by comparing it with other variables.

Parameters:
dfpandas dataframe

The dataframe contains the target variable and other meteorological variables.

target_colstr

The target variable (e.g., ‘rainfall’) to be validated.

correlation_thresholdfloat

Minimum acceptable correlation between the target variable and other variables.

Returns:
pandas series

A series with an additional column ‘consistency_error’ (True if error, False otherwise).

Examples

>>> nr_obs = 6
>>> np.random.seed(0)
>>> # Generate random data
>>> data = np.random.randn(nr_obs, 4)
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=nr_obs, freq='30T')
>>> # Create DataFrame with specified column names
>>> df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = time_index)
>>> df["raw"] = df["ground_truth"] + np.random.normal(0, 5, nr_obs)*np.random.randint(0, 2, nr_obs)
>>> df["isCorrect_gt"] = df["ground_truth"] == df["raw"]
>>> TSCC.detection.STAT_byCorrelationAnalysis(df, "raw")
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.5
Freq: 30T, Name: consistency_error, dtype: float64
TSCC.detection.statistical.STAT_byDistFromCenter(series, eps, center_measure='median', dynamic_window=None)[source]#

Check if observation is within reasonable distance to center, eps and eps_pro define the distance.

Parameters:
seriespandas series
epsfloat

Define a threshold (epsilon) for the acceptable distance (only positive value > 0) from the center

center_measurestring, optional

choose from list [‘median’, ‘mean’]

dynamic_windowint, optional

Epsilon Pro version, epsilon threshold is dynamic according to the window frame

Returns:
spandas series

float values, TRUE (1.0) if value is regarded as correct by this method FALSE (0.0) otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byDistFromCenter(s, 3)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.5
2023-01-01 02:30:00    0.5
2023-01-01 03:00:00    0.0
Freq: 30T, dtype: float64
TSCC.detection.statistical.STAT_byDistFromCenterRolling(series, threshold, window=3)[source]#

Check if values are within static distance to rolling center.

Parameters:
seriespandas series
thresholdnumber, optional #nicht ganz sicher ob wirklich optional

The default is 1.

windowint, optional

Rolling center according to the window frame

centerboolean, optional

The default is False.

Returns:
spandas series

boolean values, TRUE if value is regarded as correct by this method FALSE otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byDistFromCenterRolling(s, 3, 2)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.5
2023-01-01 03:00:00    0.0
Freq: 30T, dtype: float64
TSCC.detection.statistical.STAT_byIQR(series, lo=0.25, up=0.75, k=1.5)[source]#

Check if values are in the scaled interquantile range.

Parameters:
seriespandas series
lonumber, optional

Percentage value in range [0, 1]. The default is 0.25.

upnumber, optional

Percentage value in range [0, 1]. The default is 0.75.

knumber, optional

Scaling factor of regular IQR to determine outliers, typical scaling factor 1.5 as default

Returns:
spandas series

boolean values, TRUE if value is regarded as correct by this method FALSE otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byIQR(s)
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.5
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.5
2023-01-01 03:00:00    0.5
Freq: 30T, dtype: float64
TSCC.detection.statistical.STAT_byIQRdiff(series, lo=0.25, up=0.75, k=1.5)[source]#

Check if differences to previous values are in the scaled interquantile range.

Parameters:
seriespandas series
lonumber, optional

Percentage value in range [0, 1]. The default is 0.25.

upnumber, optional

Percentage value in range [0, 1]. The default is 0.75.

knumber, optional

Scaling factor of regular IQR to determine outliers, typical scaling factor 1.5 as default

Returns:
spandas series

boolean values, TRUE if value is regarded as correct by this method FALSE otherwise

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byIQRdiff(s)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.5
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.5
2023-01-01 02:00:00    0.5
2023-01-01 02:30:00    0.0
2023-01-01 03:00:00    0.5
Freq: 30T, dtype: float64
TSCC.detection.statistical.STAT_byZScore(series, z=3, b_modified=False, k=0.6745)[source]#

Check if values are within a standardized score from the mean. Best for normally distributed data.

Parameters:
seriespandas series
zfloat, optional

z-score for outliers. Typically, values are considered as outliers when abs(z)>3.

b_modifiedboolean, optional

Weather modified z-score is used or not.

kfloat, optional

The default is 0.6745.

Returns:
pandas series

Series of index of outliers or cleaned data of the original series.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T')
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None], index = time_index)
>>> TSCC.detection.STAT_byZScore(s)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.0
2023-01-01 03:00:00    0.0
Freq: 30T, dtype: float64