Observation reliability#

TSCC.detection.basics.BASIC_byBoundaryVariance_60min(series, min_var)[source]#

Check if variance is too low.

Parameters:

seriespanda series: timestamp index
min_varstring or integer or float: minimal variability

Returns:

boolean: A series where 0.5 indicates that the variance is below the minimum threshold (i.e., uncertain), and 1.0 indicates no issue.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 0.1, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byBoundaryVariance_60min(s, 0.005)
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.5
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
Freq: 30T, Name: belowMinVar, dtype: float64

TSCC.detection.basics.BASIC_byClimaAverage(series, period='monthly', deviation_threshold=2)[source]#

Perform a temporal consistency check against climatological averages (weekly, monthly, or seasonally) of the own time series.

Parameters:

seriespandas series: A series of the target variable with a datetime index.
periodstr, optional: The period for climatological comparison. Options are ‘weekly’, ‘monthly’, or ‘seasonal’. Default is ‘monthly’.
deviation_thresholdfloat, optional: The number of standard deviations from the climatological mean to flag as an error. Default is 2.

Returns:

pandas series: A series where 0.5 indicates that the value deviates significantly from the climatological average, and 1.0 indicates no significant deviation.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 0.1, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byClimaAverage(s)
index
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
2023-01-01 02:30:00    0.0
2023-01-01 03:00:00    0.0
Name: clima_avg_err, dtype: float64

TSCC.detection.basics.BASIC_byExistance(series, no_data_char=None, obs_type=None)[source]#

Check if observation values exist and have the correct data type.

Parameters:

seriespandas series: The input series to check for validity.
no_data_charoptional: A placeholder value for missing data.
obs_typedata type, e.g. int, float, str

Returns:

pandas series: A series where 1.0 indicates an error (i.e., invalid observation) and 0.0 indicates no error.

Examples

>>> s = pd.Series([np.nan, 1, 2.0, "string", 5.4, -999,  None])
>>> TSCC.detection.BASIC_byExistance(s, -999, float)
  1.0
  1.0
  0.0
  1.0
  0.0
  1.0
  1.0
dtype: float64

TSCC.detection.basics.BASIC_byNeighbors(series, neighbors_df, max_diff=100)[source]#

Compare rainfall data from a given station with neighboring stations to detect outliers.

Parameters:

seriespandas dataframe: A dataframe of rainfall data from a single station.
neighbors_dfpandas.DataFrame: A DataFrame containing rainfall data from neighboring stations.
max_difffloat, optional: The maximum allowable difference between the station’s rainfall value and the average of neighboring stations. Default is 100.

Returns:

pandas series: A series where 0.5 indicates that the difference between the station and its neighbors exceeds the maximum allowable difference, and 1.0 indicates no issue.

Examples

>>> nr_obs = 6
>>> np.random.seed(0)
>>> # Generate random data
>>> data = np.random.randn(nr_obs, 4)
>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=nr_obs, freq='30T')
>>> # Create DataFrame with specified column names
>>> df = pd.DataFrame(data, columns=["ground_truth", "raw", 'fea_1', 'fea_2'], index = time_index)
>>> df["raw"] = df["ground_truth"] + np.random.normal(0, 5, nr_obs)*np.random.randint(0, 2, nr_obs)
>>> df["isCorrect_gt"] = df["ground_truth"] == df["raw"]
>>> TSCC.detection.BASIC_byNeighbors(df["raw"], df[["fea_1", "fea_2"]], max_diff=1)
2023-01-01 00:00:00    0.5
2023-01-01 00:30:00    0.5
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.5
2023-01-01 02:30:00    0.5
Freq: 30T, dtype: float64

TSCC.detection.basics.BASIC_byPersistence(series, persistence_window=3)[source]#

Perform a persistence check to identify unrealistically constant values over consecutive time steps.

Parameters:

seriespandas series: A series of the target variable with a datetime index.
persistence_windowint, optional: The number of consecutive time steps with the same value required to trigger an error flag. Default is 3.

Returns:

pandas series: A series where 0.5 indicates that the value has been constant over the specified persistence window, and 1.0 indicates no issue.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 3, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byPersistence(s)
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.0
2023-01-01 01:30:00    0.0
2023-01-01 02:00:00    0.0
Freq: 30T, Name: persistence_error, dtype: float64

TSCC.detection.basics.BASIC_byRange(series, lower=None, upper=None)[source]#

Check if values are within plausible limits.

Parameters:

df_feapandas dataframe: series of detMLConfig.colname_raw is used only
df_tarpandas dataframe: only serves as placeholder, no usage of it
uppernumber, optional: Upper boundary of the series. The default is None.
lowernumber, optional: Lower boundary of the series. The default is None.

Returns:

spandas series: cleaned series.

Examples

The first example returns False for every value out of boundary.

>>> test_boolean = STAT_byBoundary(df['col1'], upper=9, lower=1, setValueTo = "Boolean")
   True
   True
   True
   True
  False
   True
   True
   True
  False
   True
Name: col1, dtype: bool

The second example returns numpy.NaN for every value out of boundary, else the initial value.

>>> BASIC_isValid_byRange(df['col1'], upper=9, lower=1, setValueTo = "OutOfBoundary_toNaN")
  1.0
  9.0
  3.0
  6.0
  NaN
  8.0
  3.0
  1.0
  NaN
  5.0
Name: col1, dtype: float64

The third example deletes every value which is out of boundary.

>>> BASIC_isValid_byRange(df['col1'], upper=9, lower=1, setValueTo = "OutOfBoundary_Delete")
  1
  9
  3
  6
  8
  3
  1
  5
Name: col1, dtype: int64

TSCC.detection.basics.BASIC_byStepChangeMax(series, max_diff, timestep=Timedelta('0 days 00:30:00'))[source]#

Check if variance is too high.

Parameters:

seriespanda series: A series of numeric values with a timestamp index.
max_difffloat: The maximum allowable difference between consecutive values.
timestep: timedelta variable, optional: The maximum time interval between consecutive values. Default is 30 minutes.

Returns:

pandas series: A series where 0.5 indicates that the step change exceeds the maximum difference and 1.0 indicates no issue.

Examples

>>> time_index = pd.date_range(start='2023-01-01 00:00', periods=5, freq='30T')
>>> s = pd.Series([0.05, 0.0, 3, 0.3, 0.25], index=time_index)
>>> TSCC.detection.BASIC_byStepChangeMax(s, 2, pd.Timedelta(minutes = 30))
2023-01-01 00:00:00    0.0
2023-01-01 00:30:00    0.0
2023-01-01 01:00:00    0.5
2023-01-01 01:30:00    0.5
2023-01-01 02:00:00    0.0
Freq: 30T, dtype: float64