General exploration methods#

TSCC.exploration.general.aggregate_sums(df, freq='Y', date_column=None)[source]#

Compute sums with options for yearly, quarterly, and monthly for each column of the dataframe.

Parameters:

dfpandas dataframe: A dataframe with data, index must be datetime.
freqstr: The frequency of aggregation. Options are [‘Y’ (yearly), ‘Q’ (quaterly), ‘M’(monthly)].
date_columnstr: Name of the column to be used as the date index. If None, the dataframe’s index is used.

Returns:

pandas dataframe: Aggregated sums dataframe.

Examples

>>> # Step 1: Generate a date range for three consecutive years
>>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D')
>>> # Step 2: Create random data for the three columns
>>> np.random.seed(42)  # For reproducibility
>>> data = np.random.randint(0, 18, size=(len(date_range), 3))
>>> # Step 3: Combine the date range and the data into a DataFrame
>>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3'])
>>> TSCC.exploration.aggregate_sums(df, freq = "Y")
         Column1 Column2 Column3
2021-12-31  2875    3092    2946
2022-12-31  3189    3106    3119
2023-12-31  2938    3115    3176

TSCC.exploration.general.getCorrelationAnalysis(df, target_col, other_cols=[], plot=True)[source]#

Perform correlation analysis between the target variable and other meteorological variables.

Parameters:

dfdataframe: The dataframe contains the target variable and other meteorological variables.
target_colstr: The target variable (e.g., ‘rainfall’) to correlate with other variables.
other_colslist of str: A list of other meteorological variables to check correlation against.
plotbool, optional: If True, plot the correlation matrix as a heatmap. The default is True.

Returns:

pandas dataframe: A dataframe containing correlation values between target and other variables.

Examples

>>> # Step 1: Generate a date range for three consecutive years
>>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D')
>>> # Step 2: Create random data for the three columns
>>> np.random.seed(42)  # For reproducibility
>>> data = np.random.randint(0, 18, size=(len(date_range), 3))
>>> # Step 3: Combine the date range and the data into a DataFrame
>>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3'])
>>> TSCC.exploration.getCorrelationAnalysis(df, "Column1", ["Column2", "Column3"])
         Column1    Column2      Column3
Column1     1.000000        0.01820     0.038464
Column2     0.018200        1.00000     0.021740
Column3     0.038464        0.02174     1.000000

TSCC.exploration.general.getDoubleMassAnalysis(df, target_var, reference_vars, plot=True)[source]#

Perform Double Mass Analysis to assess the consistency of a target variable (e.g., rainfall).

Parameters:

dfpandas dataframe: A dataframe containing the target variable, reference variables, and a datetime index or column.
target_varstr: The target variable (e.g., ‘rainfall’) whose consistency is being checked.
reference_varslist of str: A list of reference variables, typically from nearby stations, to compare with the target variable.
plotbool, optional: If True, a plot of the Double Mass Curve will be generated. The default is True.

Returns:

pandas dataframe: A dataframe containing the cumulative sums of the target variable and the reference variables.

Examples

>>> # Step 1: Generate a date range for three consecutive years
>>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D')
>>> # Step 2: Create random data for the three columns
>>> np.random.seed(42)  # For reproducibility
>>> data = np.random.randint(0, 18, size=(len(date_range), 3))
>>> # Step 3: Combine the date range and the data into a DataFrame
>>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3'])
>>> TSCC.exploration.getDoubleMassAnalysis(df, "Column1", ["Column2", "Column3"])
    cum_target      cum_reference
2021-01-01  6       12.0
2021-01-02  13      20.0
2021-01-03  23      25.0
2021-01-04  25      31.0
2021-01-05  30      31.5
... ...     ...
2023-12-27  8958    9254.0
2023-12-28  8974    9255.5
2023-12-29  8990    9260.0
2023-12-30  8997    9270.0
2023-12-31  9002    9277.0
1095 rows × 2 columns

TSCC.exploration.general.getGapDist(series)[source]#

Calculate and summarize the distribution of gaps between consecutive time index values in a time series.

Parameters:

seriespandas series: The time series with a datetime index for which the gap distribution is to be calculated.

Returns:

pandas series: A series representing the distribution of time gaps between consecutive index values, where the index is the gap (timedelta) and the values are the counts of how often each gap occurs.

Examples

>>> # Step 1: Generate a date range for three consecutive years
>>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D')
>>> # Step 2: Create random data for the three columns
>>> np.random.seed(42)  # For reproducibility
>>> data = np.random.randint(0, 18, size=(len(date_range), 3))
>>> # Step 3: Combine the date range and the data into a DataFrame
>>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3'])
>>> TSCC.exploration.getGapDist(df["Column1"])
1 days    1094
Name: count, dtype: int64

TSCC.exploration.general.identify_errorClasses(series_raw, series_gt, uncertainty_threshold=0.1)[source]#

This function compares raw values (series_raw) with ground truth values (series_gt) and assigns an error class for each observation. It identifies specific error types, including ‘Not evaluable’ for missing values in the ground truth, ‘No error’ for matching values, and ‘Missing value’ where the raw value is missing but the ground truth is present. Additionally, the function integrates several error detection functions (e.g., identify_stuckAtZero, identify_outlier, identify_drift, identify_constantValue, identify_uncertainty) to flag other error patterns, such as constant values or outliers. The final output is a series of categorized error classes.

Parameters:

series_rawpandas Series: The raw data series.
series_gtpandas Series: The ground truth data series.

Returns:

error_classpandas Series: A categorical series indicating the type of error for each observation.

TSCC.exploration.general.isCredible_ByNeighbors_yearly(series, df_neighbors, threshold=0.1)[source]#

Assess credibility of sensor node throgh comparison with neighbouring sensor nodes by using the mean key figure. Otional function addition: distance of nodes as parameter to be respected.

Parameters:

seriespandas dataframe: A dataframe of the node to be assessed. The index should be datetime-based.
df_neighborspandas dataframe: A dataframe containing time series data from neighboring sensor nodes. The index should match the series index.
thresholdfloat, optional: The acceptable deviation threshold between the node’s value and the mean of its neighbors. The default is 0.1 (10%).

Returns:

pandas series: A yearly series of boolean values indicating whether the observations of the node are within the credible range based on its neighbors. True means the node is credible, and False indicates a significant deviation.

Examples

>>> # Step 1: Generate a date range for three consecutive years
>>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D')
>>> # Step 2: Create random data for the three columns
>>> np.random.seed(42)  # For reproducibility
>>> data = np.random.randint(0, 18, size=(len(date_range), 3))
>>> # Step 3: Combine the date range and the data into a DataFrame
>>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3'])
>>> TSCC.exploration.isCredible_ByNeighbors_yearly(series = df["Column1"], df_neighbors = df[["Column2", "Column3"]])
2021    True
2022    True
2023    True
dtype: bool