General exploration methods#
- TSCC.exploration.general.aggregate_sums(df, freq='Y', date_column=None)[source]#
Compute sums with options for yearly, quarterly, and monthly for each column of the dataframe.
- Parameters:
- dfpandas dataframe
A dataframe with data, index must be datetime.
- freqstr
The frequency of aggregation. Options are [‘Y’ (yearly), ‘Q’ (quaterly), ‘M’(monthly)].
- date_columnstr
Name of the column to be used as the date index. If None, the dataframe’s index is used.
- Returns:
- pandas dataframe
Aggregated sums dataframe.
Examples
>>> # Step 1: Generate a date range for three consecutive years >>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D') >>> # Step 2: Create random data for the three columns >>> np.random.seed(42) # For reproducibility >>> data = np.random.randint(0, 18, size=(len(date_range), 3)) >>> # Step 3: Combine the date range and the data into a DataFrame >>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3']) >>> TSCC.exploration.aggregate_sums(df, freq = "Y") Column1 Column2 Column3 2021-12-31 2875 3092 2946 2022-12-31 3189 3106 3119 2023-12-31 2938 3115 3176
- TSCC.exploration.general.getCorrelationAnalysis(df, target_col, other_cols=[], plot=True)[source]#
Perform correlation analysis between the target variable and other meteorological variables.
- Parameters:
- dfdataframe
The dataframe contains the target variable and other meteorological variables.
- target_colstr
The target variable (e.g., ‘rainfall’) to correlate with other variables.
- other_colslist of str
A list of other meteorological variables to check correlation against.
- plotbool, optional
If True, plot the correlation matrix as a heatmap. The default is True.
- Returns:
- pandas dataframe
A dataframe containing correlation values between target and other variables.
Examples
>>> # Step 1: Generate a date range for three consecutive years >>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D') >>> # Step 2: Create random data for the three columns >>> np.random.seed(42) # For reproducibility >>> data = np.random.randint(0, 18, size=(len(date_range), 3)) >>> # Step 3: Combine the date range and the data into a DataFrame >>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3']) >>> TSCC.exploration.getCorrelationAnalysis(df, "Column1", ["Column2", "Column3"]) Column1 Column2 Column3 Column1 1.000000 0.01820 0.038464 Column2 0.018200 1.00000 0.021740 Column3 0.038464 0.02174 1.000000
- TSCC.exploration.general.getDoubleMassAnalysis(df, target_var, reference_vars, plot=True)[source]#
Perform Double Mass Analysis to assess the consistency of a target variable (e.g., rainfall).
- Parameters:
- dfpandas dataframe
A dataframe containing the target variable, reference variables, and a datetime index or column.
- target_varstr
The target variable (e.g., ‘rainfall’) whose consistency is being checked.
- reference_varslist of str
A list of reference variables, typically from nearby stations, to compare with the target variable.
- plotbool, optional
If True, a plot of the Double Mass Curve will be generated. The default is True.
- Returns:
- pandas dataframe
A dataframe containing the cumulative sums of the target variable and the reference variables.
Examples
>>> # Step 1: Generate a date range for three consecutive years >>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D') >>> # Step 2: Create random data for the three columns >>> np.random.seed(42) # For reproducibility >>> data = np.random.randint(0, 18, size=(len(date_range), 3)) >>> # Step 3: Combine the date range and the data into a DataFrame >>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3']) >>> TSCC.exploration.getDoubleMassAnalysis(df, "Column1", ["Column2", "Column3"]) cum_target cum_reference 2021-01-01 6 12.0 2021-01-02 13 20.0 2021-01-03 23 25.0 2021-01-04 25 31.0 2021-01-05 30 31.5 ... ... ... 2023-12-27 8958 9254.0 2023-12-28 8974 9255.5 2023-12-29 8990 9260.0 2023-12-30 8997 9270.0 2023-12-31 9002 9277.0 1095 rows × 2 columns
- TSCC.exploration.general.getGapDist(series)[source]#
Calculate and summarize the distribution of gaps between consecutive time index values in a time series.
- Parameters:
- seriespandas series
The time series with a datetime index for which the gap distribution is to be calculated.
- Returns:
- pandas series
A series representing the distribution of time gaps between consecutive index values, where the index is the gap (timedelta) and the values are the counts of how often each gap occurs.
Examples
>>> # Step 1: Generate a date range for three consecutive years >>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D') >>> # Step 2: Create random data for the three columns >>> np.random.seed(42) # For reproducibility >>> data = np.random.randint(0, 18, size=(len(date_range), 3)) >>> # Step 3: Combine the date range and the data into a DataFrame >>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3']) >>> TSCC.exploration.getGapDist(df["Column1"]) 1 days 1094 Name: count, dtype: int64
- TSCC.exploration.general.identify_errorClasses(series_raw, series_gt, uncertainty_threshold=0.1)[source]#
This function compares raw values (series_raw) with ground truth values (series_gt) and assigns an error class for each observation. It identifies specific error types, including ‘Not evaluable’ for missing values in the ground truth, ‘No error’ for matching values, and ‘Missing value’ where the raw value is missing but the ground truth is present. Additionally, the function integrates several error detection functions (e.g., identify_stuckAtZero, identify_outlier, identify_drift, identify_constantValue, identify_uncertainty) to flag other error patterns, such as constant values or outliers. The final output is a series of categorized error classes.
- Parameters:
- series_rawpandas Series
The raw data series.
- series_gtpandas Series
The ground truth data series.
- Returns:
- error_classpandas Series
A categorical series indicating the type of error for each observation.
- TSCC.exploration.general.isCredible_ByNeighbors_yearly(series, df_neighbors, threshold=0.1)[source]#
Assess credibility of sensor node throgh comparison with neighbouring sensor nodes by using the mean key figure. Otional function addition: distance of nodes as parameter to be respected.
- Parameters:
- seriespandas dataframe
A dataframe of the node to be assessed. The index should be datetime-based.
- df_neighborspandas dataframe
A dataframe containing time series data from neighboring sensor nodes. The index should match the series index.
- thresholdfloat, optional
The acceptable deviation threshold between the node’s value and the mean of its neighbors. The default is 0.1 (10%).
- Returns:
- pandas series
A yearly series of boolean values indicating whether the observations of the node are within the credible range based on its neighbors. True means the node is credible, and False indicates a significant deviation.
Examples
>>> # Step 1: Generate a date range for three consecutive years >>> date_range = pd.date_range(start='2021-01-01', end='2023-12-31', freq='D') >>> # Step 2: Create random data for the three columns >>> np.random.seed(42) # For reproducibility >>> data = np.random.randint(0, 18, size=(len(date_range), 3)) >>> # Step 3: Combine the date range and the data into a DataFrame >>> df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2', 'Column3']) >>> TSCC.exploration.isCredible_ByNeighbors_yearly(series = df["Column1"], df_neighbors = df[["Column2", "Column3"]]) 2021 True 2022 True 2023 True dtype: bool