Observation reliability#
- TSCC.detection.basics.BASIC_byBoundaryVariance_60min(series, min_var)[source]#
Check if variance is too low.
- Parameters:
- seriespanda series
timestamp index
- min_varstring or integer or float
minimal variability
- Returns:
- boolean
A series where 0.5 indicates that the variance is below the minimum threshold (i.e., uncertain), and 1.0 indicates no issue.
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T') >>> s = pd.Series([0.05, 0.0, 0.1, 0.3, 0.25], index=time_index) >>> TSCC.detection.BASIC_byBoundaryVariance_60min(s, 0.005) 2023-01-01 00:00:00 0.5 2023-01-01 00:30:00 0.5 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.0 Freq: 30T, Name: belowMinVar, dtype: float64
- TSCC.detection.basics.BASIC_byClimaAverage(series, period='monthly', deviation_threshold=2)[source]#
Perform a temporal consistency check against climatological averages (weekly, monthly, or seasonally) of the own time series.
- Parameters:
- seriespandas series
A series of the target variable with a datetime index.
- periodstr, optional
The period for climatological comparison. Options are ‘weekly’, ‘monthly’, or ‘seasonal’. Default is ‘monthly’.
- deviation_thresholdfloat, optional
The number of standard deviations from the climatological mean to flag as an error. Default is 2.
- Returns:
- pandas series
A series where 0.5 indicates that the value deviates significantly from the climatological average, and 1.0 indicates no significant deviation.
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T') >>> s = pd.Series([0.05, 0.0, 0.1, 0.3, 0.25], index=time_index) >>> TSCC.detection.BASIC_byClimaAverage(s) index 2023-01-01 00:00:00 0.0 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.0 2023-01-01 02:30:00 0.0 2023-01-01 03:00:00 0.0 Name: clima_avg_err, dtype: float64
- TSCC.detection.basics.BASIC_byExistance(series, no_data_char=None, obs_type=None)[source]#
Check if observation values exist and have the correct data type.
- Parameters:
- seriespandas series
The input series to check for validity.
- no_data_charoptional
A placeholder value for missing data.
- obs_typedata type, e.g. int, float, str
- Returns:
- pandas series
A series where 1.0 indicates an error (i.e., invalid observation) and 0.0 indicates no error.
Examples
>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999, None]) >>> TSCC.detection.BASIC_byExistance(s, -999, float) 0 1.0 1 1.0 2 0.0 3 1.0 4 0.0 5 1.0 6 1.0 dtype: float64
- TSCC.detection.basics.BASIC_byNeighbors(series, neighbors_df, max_diff=100)[source]#
Compare rainfall data from a given station with neighboring stations to detect outliers.
- Parameters:
- seriespandas dataframe
A dataframe of rainfall data from a single station.
- neighbors_dfpandas.DataFrame
A DataFrame containing rainfall data from neighboring stations.
- max_difffloat, optional
The maximum allowable difference between the station’s rainfall value and the average of neighboring stations. Default is 100.
- Returns:
- pandas series
A series where 0.5 indicates that the difference between the station and its neighbors exceeds the maximum allowable difference, and 1.0 indicates no issue.
Examples
>>> nr_obs = 6 >>> np.random.seed(0) >>> # Generate random data >>> data = np.random.randn(nr_obs, 4) >>> time_index = pd.date_range(start='2023-01-01 00:00', periods=nr_obs, freq='30T') >>> # Create DataFrame with specified column names >>> df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = time_index) >>> df["raw"] = df["ground_truth"] + np.random.normal(0, 5, nr_obs)*np.random.randint(0, 2, nr_obs) >>> df["isCorrect_gt"] = df["ground_truth"] == df["raw"] >>> TSCC.detection.BASIC_byNeighbors(df["raw"], df[["fea_1", "fea_2"]], max_diff=1) 2023-01-01 00:00:00 0.5 2023-01-01 00:30:00 0.5 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.5 2023-01-01 02:30:00 0.5 Freq: 30T, dtype: float64
- TSCC.detection.basics.BASIC_byPersistence(series, persistence_window=3)[source]#
Perform a persistence check to identify unrealistically constant values over consecutive time steps.
- Parameters:
- seriespandas series
A series of the target variable with a datetime index.
- persistence_windowint, optional
The number of consecutive time steps with the same value required to trigger an error flag. Default is 3.
- Returns:
- pandas series
A series where 0.5 indicates that the value has been constant over the specified persistence window, and 1.0 indicates no issue.
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T') >>> s = pd.Series([0.05, 0.0, 3, 0.3, 0.25], index=time_index) >>> TSCC.detection.BASIC_byPersistence(s) 2023-01-01 00:00:00 0.0 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.0 2023-01-01 01:30:00 0.0 2023-01-01 02:00:00 0.0 Freq: 30T, Name: persistence_error, dtype: float64
- TSCC.detection.basics.BASIC_byRange(series, lower=None, upper=None)[source]#
Check if values are within plausible limits.
- Parameters:
- df_feapandas dataframe
series of detMLConfig.colname_raw is used only
- df_tarpandas dataframe
only serves as placeholder, no usage of it
- uppernumber, optional
Upper boundary of the series. The default is None.
- lowernumber, optional
Lower boundary of the series. The default is None.
- Returns:
- spandas series
cleaned series.
Examples
The first example returns False for every value out of boundary.
>>> test_boolean = STAT_byBoundary(df['col1'], upper=9, lower=1, setValueTo = "Boolean") 0 True 1 True 2 True 3 True 4 False 5 True 6 True 7 True 8 False 9 True Name: col1, dtype: bool
The second example returns numpy.NaN for every value out of boundary, else the initial value.
>>> BASIC_isValid_byRange(df['col1'], upper=9, lower=1, setValueTo = "OutOfBoundary_toNaN") 0 1.0 1 9.0 2 3.0 3 6.0 4 NaN 5 8.0 6 3.0 7 1.0 8 NaN 9 5.0 Name: col1, dtype: float64
The third example deletes every value which is out of boundary.
>>> BASIC_isValid_byRange(df['col1'], upper=9, lower=1, setValueTo = "OutOfBoundary_Delete") 0 1 1 9 2 3 3 6 5 8 6 3 7 1 9 5 Name: col1, dtype: int64
- TSCC.detection.basics.BASIC_byStepChangeMax(series, max_diff, timestep=Timedelta('0 days 00:30:00'))[source]#
Check if variance is too high.
- Parameters:
- seriespanda series
A series of numeric values with a timestamp index.
- max_difffloat
The maximum allowable difference between consecutive values.
- timestep: timedelta variable, optional
The maximum time interval between consecutive values. Default is 30 minutes.
- Returns:
- pandas series
A series where 0.5 indicates that the step change exceeds the maximum difference and 1.0 indicates no issue.
Examples
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T') >>> s = pd.Series([0.05, 0.0, 3, 0.3, 0.25], index=time_index) >>> TSCC.detection.BASIC_byStepChangeMax(s, 2, pd.Timedelta(minutes = 30)) 2023-01-01 00:00:00 0.0 2023-01-01 00:30:00 0.0 2023-01-01 01:00:00 0.5 2023-01-01 01:30:00 0.5 2023-01-01 02:00:00 0.0 Freq: 30T, dtype: float64