Statistical error detection#
- TSCC.detection.statistical.STAT_byCorrelationAnalysis(df, target_col, correlation_threshold=0.2)[source]#
Check the consistency of the target variable (e.g., rainfall) by comparing it with other variables.
- Parameters:
- dfpandas dataframe
The dataframe contains the target variable and other meteorological variables.
- target_colstr
The target variable (e.g., ‘rainfall’) to be validated.
- correlation_thresholdfloat
Minimum acceptable correlation between the target variable and other variables.
- Returns:
- pandas series
A series with an additional column ‘consistency_error’ (True if error, False otherwise).
Examples
>>> nr_obs = 6 >>> np.random.seed(0) >>> # Generate random data >>> data = np.random.randn(nr_obs, 4) >>> time_index = pd.date_range(start='2023-01-01 00:00', periods=nr_obs, freq='30T') >>> # Create DataFrame with specified column names >>> df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = time_index) >>> df["raw"] = df["ground_truth"] + np.random.normal(0, 5, nr_obs)*np.random.randint(0, 2, nr_obs) >>> df["isCorrect_gt"] = df["ground_truth"] == df["raw"] >>> TSCC.detection.STAT_byCorrelationAnalysis(df, "raw") 2023-01-01 00:00:00 0.5 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.0 2023-01-01 02:30:00 0.5 Freq: 30T, Name: consistency_error, dtype: float64
- TSCC.detection.statistical.STAT_byDistFromCenter(series, eps, center_measure='median', dynamic_window=None)[source]#
Check if observation is within reasonable distance to center, eps and eps_pro define the distance.
- Parameters:
- seriespandas series
- epsfloat
Define a threshold (epsilon) for the acceptable distance (only positive value > 0) from the center
- center_measurestring, optional
choose from list [‘median’, ‘mean’]
- dynamic_windowint, optional
Epsilon Pro version, epsilon threshold is dynamic according to the window frame
- Returns:
- spandas series
float values, TRUE (1.0) if value is regarded as correct by this method FALSE (0.0) otherwise
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T') >>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999, None], index = time_index) >>> TSCC.detection.STAT_byDistFromCenter(s, 3) 2023-01-01 00:00:00 0.0 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.5 2023-01-01 02:30:00 0.5 2023-01-01 03:00:00 0.0 Freq: 30T, dtype: float64
- TSCC.detection.statistical.STAT_byDistFromCenterRolling(series, threshold, window=3)[source]#
Check if values are within static distance to rolling center.
- Parameters:
- seriespandas series
- thresholdnumber, optional #nicht ganz sicher ob wirklich optional
The default is 1.
- windowint, optional
Rolling center according to the window frame
- centerboolean, optional
The default is False.
- Returns:
- spandas series
boolean values, TRUE if value is regarded as correct by this method FALSE otherwise
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T') >>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999, None], index = time_index) >>> TSCC.detection.STAT_byDistFromCenterRolling(s, 3, 2) 2023-01-01 00:00:00 0.0 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.0 2023-01-01 02:30:00 0.5 2023-01-01 03:00:00 0.0 Freq: 30T, dtype: float64
- TSCC.detection.statistical.STAT_byIQR(series, lo=0.25, up=0.75, k=1.5)[source]#
Check if values are in the scaled interquantile range.
- Parameters:
- seriespandas series
- lonumber, optional
Percentage value in range [0, 1]. The default is 0.25.
- upnumber, optional
Percentage value in range [0, 1]. The default is 0.75.
- knumber, optional
Scaling factor of regular IQR to determine outliers, typical scaling factor 1.5 as default
- Returns:
- spandas series
boolean values, TRUE if value is regarded as correct by this method FALSE otherwise
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T') >>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999, None], index = time_index) >>> TSCC.detection.STAT_byIQR(s) 2023-01-01 00:00:00 0.5 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.5 2023-01-01 02:00:00 0.0 2023-01-01 02:30:00 0.5 2023-01-01 03:00:00 0.5 Freq: 30T, dtype: float64
- TSCC.detection.statistical.STAT_byIQRdiff(series, lo=0.25, up=0.75, k=1.5)[source]#
Check if differences to previous values are in the scaled interquantile range.
- Parameters:
- seriespandas series
- lonumber, optional
Percentage value in range [0, 1]. The default is 0.25.
- upnumber, optional
Percentage value in range [0, 1]. The default is 0.75.
- knumber, optional
Scaling factor of regular IQR to determine outliers, typical scaling factor 1.5 as default
- Returns:
- spandas series
boolean values, TRUE if value is regarded as correct by this method FALSE otherwise
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T') >>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999, None], index = time_index) >>> TSCC.detection.STAT_byIQRdiff(s) 2023-01-01 00:00:00 0.0 2023-01-01 00:30:00 0.5 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.5 2023-01-01 02:00:00 0.5 2023-01-01 02:30:00 0.0 2023-01-01 03:00:00 0.5 Freq: 30T, dtype: float64
- TSCC.detection.statistical.STAT_byZScore(series, z=3, b_modified=False, k=0.6745)[source]#
Check if values are within a standardized score from the mean. Best for normally distributed data.
- Parameters:
- seriespandas series
- zfloat, optional
z-score for outliers. Typically, values are considered as outliers when abs(z)>3.
- b_modifiedboolean, optional
Weather modified z-score is used or not.
- kfloat, optional
The default is 0.6745.
- Returns:
- pandas series
Series of index of outliers or cleaned data of the original series.
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=7, freq='30T') >>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999, None], index = time_index) >>> TSCC.detection.STAT_byZScore(s) 2023-01-01 00:00:00 0.0 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.0 2023-01-01 02:30:00 0.0 2023-01-01 03:00:00 0.0 Freq: 30T, dtype: float64