Observation reliability#

TSCC.detection.basics.BASIC_byBoundaryVariance_60min(series, min_var)[source]#

Check if variance is too low.

Parameters:
seriespanda series

timestamp index

min_varstring or integer or float

minimal variability

Returns:
boolean

A series where 0.5 indicates that the variance is below the minimum threshold (i.e., uncertain), and 1.0 indicates no issue.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 0.1, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byBoundaryVariance_60min(s, 0.005)
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.5
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
Freq: 30T, Name: belowMinVar, dtype: float64
TSCC.detection.basics.BASIC_byClimaAverage(series, period='monthly', deviation_threshold=2)[source]#

Perform a temporal consistency check against climatological averages (weekly, monthly, or seasonally) of the own time series.

Parameters:
seriespandas series

A series of the target variable with a datetime index.

periodstr, optional

The period for climatological comparison. Options are ‘weekly’, ‘monthly’, or ‘seasonal’. Default is ‘monthly’.

deviation_thresholdfloat, optional

The number of standard deviations from the climatological mean to flag as an error. Default is 2.

Returns:
pandas series

A series where 0.5 indicates that the value deviates significantly from the climatological average, and 1.0 indicates no significant deviation.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 0.1, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byClimaAverage(s)
index
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.0
2023-01-01 03:00:00    0.0
Name: clima_avg_err, dtype: float64
TSCC.detection.basics.BASIC_byExistance(series, no_data_char=None, obs_type=None)[source]#

Check if observation values exist and have the correct data type.

Parameters:
seriespandas series

The input series to check for validity.

no_data_charoptional

A placeholder value for missing data.

obs_typedata type, e.g. int, float, str
Returns:
pandas series

A series where 1.0 indicates an error (i.e., invalid observation) and 0.0 indicates no error.

Examples

>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None])
>>> TSCC.detection.BASIC_byExistance(s, -999, float)
0    1.0
1    1.0
2    0.0
3    1.0
4    0.0
5    1.0
6    1.0
dtype: float64
TSCC.detection.basics.BASIC_byNeighbors(series, neighbors_df, max_diff=100)[source]#

Compare rainfall data from a given station with neighboring stations to detect outliers.

Parameters:
seriespandas dataframe

A dataframe of rainfall data from a single station.

neighbors_dfpandas.DataFrame

A DataFrame containing rainfall data from neighboring stations.

max_difffloat, optional

The maximum allowable difference between the station’s rainfall value and the average of neighboring stations. Default is 100.

Returns:
pandas series

A series where 0.5 indicates that the difference between the station and its neighbors exceeds the maximum allowable difference, and 1.0 indicates no issue.

Examples

>>> nr_obs = 6
>>> np.random.seed(0)
>>> # Generate random data
>>> data = np.random.randn(nr_obs, 4)
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=nr_obs, freq='30T')
>>> # Create DataFrame with specified column names
>>> df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = time_index)
>>> df["raw"] = df["ground_truth"] + np.random.normal(0, 5, nr_obs)*np.random.randint(0, 2, nr_obs)
>>> df["isCorrect_gt"] = df["ground_truth"] == df["raw"]
>>> TSCC.detection.BASIC_byNeighbors(df["raw"], df[["fea_1", "fea_2"]], max_diff=1)
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.5
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.5
2023-01-01 02:30:00    0.5
Freq: 30T, dtype: float64
TSCC.detection.basics.BASIC_byPersistence(series, persistence_window=3)[source]#

Perform a persistence check to identify unrealistically constant values over consecutive time steps.

Parameters:
seriespandas series

A series of the target variable with a datetime index.

persistence_windowint, optional

The number of consecutive time steps with the same value required to trigger an error flag. Default is 3.

Returns:
pandas series

A series where 0.5 indicates that the value has been constant over the specified persistence window, and 1.0 indicates no issue.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 3, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byPersistence(s)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
Freq: 30T, Name: persistence_error, dtype: float64
TSCC.detection.basics.BASIC_byRange(series, lower=None, upper=None)[source]#

Check if values are within plausible limits.

Parameters:
df_feapandas dataframe

series of detMLConfig.colname_raw is used only

df_tarpandas dataframe

only serves as placeholder, no usage of it

uppernumber, optional

Upper boundary of the series. The default is None.

lowernumber, optional

Lower boundary of the series. The default is None.

Returns:
spandas series

cleaned series.

Examples

The first example returns False for every value out of boundary.

>>> test_boolean = STAT_byBoundary(df['col1'], upper=9, lower=1, setValueTo = "Boolean")
0     True
1     True
2     True
3     True
4    False
5     True
6     True
7     True
8    False
9     True
Name: col1, dtype: bool

The second example returns numpy.NaN for every value out of boundary, else the initial value.

>>> BASIC_isValid_byRange(df['col1'], upper=9, lower=1, setValueTo = "OutOfBoundary_toNaN")
0    1.0
1    9.0
2    3.0
3    6.0
4    NaN
5    8.0
6    3.0
7    1.0
8    NaN
9    5.0
Name: col1, dtype: float64

The third example deletes every value which is out of boundary.

>>> BASIC_isValid_byRange(df['col1'], upper=9, lower=1, setValueTo = "OutOfBoundary_Delete")
0    1
1    9
2    3
3    6
5    8
6    3
7    1
9    5
Name: col1, dtype: int64
TSCC.detection.basics.BASIC_byStepChangeMax(series, max_diff, timestep=Timedelta('0 days 00:30:00'))[source]#

Check if variance is too high.

Parameters:
seriespanda series

A series of numeric values with a timestamp index.

max_difffloat

The maximum allowable difference between consecutive values.

timestep: timedelta variable, optional

The maximum time interval between consecutive values. Default is 30 minutes.

Returns:
pandas series

A series where 0.5 indicates that the step change exceeds the maximum difference and 1.0 indicates no issue.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 3, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byStepChangeMax(s, 2, pd.Timedelta(minutes = 30))
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.5
2023-01-01 01:30:00    0.5
2023-01-01 02:00:00    0.0
Freq: 30T, dtype: float64