============================== Detection Module (:mod:`marEx.detect`) ============================== .. currentmodule:: marEx.detect The :mod:`marEx.detect` module provides functions for data preprocessing, detrending, anomaly detection, and extreme event identification. This module forms the core of the marEx preprocessing pipeline, transforming raw oceanographic data into standardised anomalies and binary extreme event masks. Overview ======== The detection module implements a comprehensive workflow for converting raw oceanographic time series data into standardised anomalies and binary extreme event masks. It supports both structured (regular lat/lon grids) and unstructured (irregular mesh) data formats with advanced statistical methods for robust extreme event detection. **Key Features:** * **Dual Anomaly Methods**: Detrended baseline vs. shifting baseline approaches * **Flexible Extreme Detection**: Global percentile vs. day-of-year specific thresholds * **Dask Integration**: Memory-efficient processing of large datasets with parallel computation * **Grid Agnostic**: Works seamlessly with both structured and unstructured grids * **Statistical Rigor**: Advanced statistical methods for robust anomaly calculation Main Functions ============== .. autosummary:: :toctree: ../generated/ preprocess_data compute_normalised_anomaly rolling_climatology smoothed_rolling_climatology identify_extremes Detailed Documentation ====================== Main Preprocessing Function --------------------------- .. autofunction:: preprocess_data Core Processing Functions ------------------------- .. autofunction:: compute_normalised_anomaly .. autofunction:: rolling_climatology .. autofunction:: smoothed_rolling_climatology .. autofunction:: identify_extremes Basic Usage Examples ==================== Simple Preprocessing -------------------- .. code-block:: python import xarray as xr import marEx # Load sea surface temperature data sst = xr.open_dataset('sst_daily.nc', chunks={'time': 365}).sst # Basic preprocessing with default parameters extremes_ds = marEx.preprocess_data( sst, threshold_percentile=95 ) # Result contains anomalies and extreme events print(extremes_ds) Advanced Preprocessing ---------------------- .. code-block:: python # Advanced preprocessing with custom parameters extremes_ds = marEx.preprocess_data( sst, threshold_percentile=90, method_anomaly='shifting_baseline', # Use rolling climatology method_extreme='hobday_extreme', # Day-of-year specific thresholds window_year_baseline=20, # 20-year rolling baseline smooth_days_baseline=31, # 31-day smoothing window_days_hobday=11, # 11-day window for thresholds window_spatial_hobday=5, # 5-cell spatial clustering dask_chunks={'time': 25} ) Unstructured Grid Processing ---------------------------- .. code-block:: python # For unstructured grids, specify dimensions dimensions = {'time': 'time', 'x': 'ncells'} coordinates = {'time': 'time', 'x': 'lon', 'y': 'lat'} extremes_ds = marEx.preprocess_data( sst_unstructured, threshold_percentile=95, dimensions=dimensions, coordinates=coordinates ) Output Data Structure ===================== The preprocessing function returns an xarray Dataset with the following structure: .. code-block:: python # extremes_ds Dataset structure: xarray.Dataset Dimensions: (lat, lon, time, dayofyear) Coordinates: lat (lat) lon (lon) time (time) dayofyear (dayofyear) # Only for hobday_extreme method Data variables: dat_anomaly (time, lat, lon) float64 # Anomaly field mask (lat, lon) bool # Land-sea mask extreme_events (time, lat, lon) bool # Binary extreme events thresholds (dayofyear, lat, lon) float64 # Thresholds **Key Variables:** * **dat_anomaly**: Anomaly field (either detrended or from rolling climatology) * **mask**: Deduced land-sea mask (True = ocean, False = land) * **extreme_events**: Binary field locating extreme events (True = extreme) * **thresholds**: Thresholds Method Comparison ================= Anomaly Detection Methods ------------------------- **Harmonic Detrending** (``method_anomaly='detrend_harmonic'``): .. code-block:: python # Fast polynomial detrending with harmonic components extremes_ds = marEx.preprocess_data( sst, method_anomaly='detrend_harmonic', threshold_percentile=95, # Additional parameters: std_normalise=False, # Optional STD normalisation detrend_orders=[1], # Linear detrending (default) force_zero_mean=True # Enforce zero mean ) **Characteristics:** * Faster computation using polynomial fitting * Uses harmonic components (annual, semi-annual cycles) * May introduce biases in variability statistics * Best for: Quick analysis **Fixed Baseline** (``method_anomaly='fixed_baseline'``): .. code-block:: python # Daily climatology (calculated using the full time series) without detrending extremes_ds = marEx.preprocess_data( sst, method_anomaly='fixed_baseline', threshold_percentile=95, # Additional parameters: smooth_days_baseline=11 # Smoothing window for climatology ) # Or where the climatology is calculated for a specific reference period (e.g., 1990-2020) extremes_ds = marEx.preprocess_data( sst, method_anomaly='fixed_baseline', threshold_percentile=95, reference_period=(1990, 2020) # Climatology from 1990-2020 only ) **Characteristics:** * Anomaly relative to the daily climatology using full time series (or a specified ``reference_period``) * Preserves long-term / climate trends * Simple interpretation and fast computation * Best for: Baseline comparison studies, trend-inclusive analysis, public outreach **Detrend Fixed Baseline** (``method_anomaly='detrend_fixed_baseline'``): .. code-block:: python # Polynomial detrending followed by fixed daily climatology extremes_ds = marEx.preprocess_data( sst, method_anomaly='detrend_fixed_baseline', threshold_percentile=95, # Additional parameters: detrend_orders=[1], # Linear detrending (default) smooth_days_baseline=11, # Smoothing window for climatology force_zero_mean=True, # Enforce zero mean reference_period=(1990, 2020) # Optional: restrict climatology to a specific reference period ) **Characteristics:** * Polynomial detrending followed by removing the fixed daily climatology * Preserves the full time-series of data, but does not account for trends in the timing of seasonal transitions * Removes long-term trends * Best for: Climate variability studies with trend removal **Shifting Baseline** (``method_anomaly='shifting_baseline'``): .. code-block:: python # Rolling climatology from previous years extremes_ds = marEx.preprocess_data( sst, method_anomaly='shifting_baseline', threshold_percentile=95, window_year_baseline=15, # 15-year rolling baseline smooth_days_baseline=21 # 21-day smoothing window ) **Characteristics:** * More accurate climatology using rolling window * Shortens time series by baseline window length * Computationally intensive but scientifically rigorous * Best for: Research applications, intricate & accurate analysis Extreme Event Detection Methods ------------------------------- **Global Extreme** (``method_extreme='global_extreme'``): .. code-block:: python # Single threshold across all time points extremes_ds = marEx.preprocess_data( sst, method_extreme='global_extreme', threshold_percentile=95, # Optional STD normalisation std_normalise=True ) **Characteristics:** * Uses percentiles from entire time series * Single threshold value for all time points * Simple interpretation and fast computation * Best for: Exploratory analysis **Hobday Extreme** (``method_extreme='hobday_extreme'``): .. code-block:: python # Day-of-year specific thresholds extremes_ds = marEx.preprocess_data( sst, method_extreme='hobday_extreme', threshold_percentile=95, window_days_hobday=11, # 11-day window window_spatial_hobday=None # No spatial clustering (default) ) **Characteristics:** * Day-of-year specific percentile thresholds * Accounts for seasonal variations * Follows Hobday et al. (2016) methodology Spatial Window Enhancement (``window_spatial_hobday``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **New in v3.0+**: The Hobday extreme method supports optional spatial pooling window for more robust, spatially coherent thresholds. **Algorithm Details**: For each grid cell ``(i, j)`` and each day-of-year ``d``: 1. **Temporal Sampling**: Collect anomalies from all years within ±``window_days_hobday`` days around day ``d`` * Traditional: Samples from single cell ``(i, j)`` * With spatial window: Samples from all cells in neighbourhood 2. **Spatial Pooling** (if ``window_spatial_hobday`` specified): * Define spatial window centered at ``(i, j)`` with radius ``r = (window_spatial_hobday - 1) / 2`` * Pool samples from cells ``(i-r:i+r+1, j-r:j+r+1)`` * Edge handling: Smaller windows near boundaries 3. **Percentile Calculation**: * Use histogram approximation (``method_percentile='approximate'``) * Precision controlled by ``precision`` parameter (default 0.01°C) * Calculate threshold at ``threshold_percentile`` (e.g., 95th) **Sample Size Comparison**:: Configuration Sample Count ────────────────────────────────── ───────────── Traditional (no spatial window) N_years × window_days_hobday Example: 30 years × 11 days = 330 samples With 5×5 spatial window N_years × window_days_hobday × 25 cells Example: 30 years × 11 days × 25 = 8,250 samples With 9×9 spatial window N_years × window_days_hobday × 81 cells Example: 30 years × 11 days × 81 = 26,730 samples **Example: Enabling Spatial Window**: .. code-block:: python # Very long time-series (50+ years): no spatial pooling needed extremes_highres = marEx.preprocess_data( sst_0.1deg, method_extreme='hobday_extreme', window_days_hobday=11, window_spatial_hobday=None ) # Short time-series: use spatial pooling to increase robustness of threshold calculation extremes_coarse = marEx.preprocess_data( sst_2deg, method_extreme='hobday_extreme', window_days_hobday=11, window_spatial_hobday=5 # Increase samples in anomaly distribution using a 5×5 window ) # High threshold percentile (>95%): use spatial pooling to robustly sample distribution tails extremes_99th = marEx.preprocess_data( sst, method_extreme='hobday_extreme', threshold_percentile=99, # Extreme percentiles require more samples window_days_hobday=11, window_spatial_hobday=7 # Larger window to sample tails (7×7 = 49 cells) ) **Performance Implications**: * **Memory**: Approximate method remains memory-efficient regardless of window size * **Computation**: Larger windows → more samples → slightly slower but still fast * **Typical**: 5×5 window adds ~10-15% to computation time vs. no spatial window **When to Use Spatial Windowing**: Spatial windowing increases the sample size for percentile calculation, improving statistical robustness. Note that this is not a spatial smoothing of the data itself, but rather a pooling of samples from neighbouring grid cells to better estimate the percentile thresholds. This is motivated from a spatial decorrelation length-scale argument, in the same way Hobday has argued for decorrelation time-scale for the 11-day time window. This is critical in several scenarios: **1. Short Time Series** (insufficient samples): When time series length is limited, sample size for each day-of-year may be inadequate for robust percentile estimation: .. code-block:: text Time Series Length Samples (no spatial) Extreme samples Recommendation ────────────────── ─────────────────── ─────────────── ────────────── 10 years 110 samples 5 samples ✓ Use spatial 20 years 220 samples 11 samples ✓ Use spatial 30 years 330 samples 16 samples ○ Optional 50+ years 550+ samples 27+ samples ✗ Not needed *Guideline*: Use spatial windowing if time series < 30 years for robust 95th percentile estimation. **2. Extreme Percentiles** (>95%, sampling distribution tails): Higher percentiles require more samples to characterise the tail of the distribution accurately. Without sufficient samples, extreme percentile thresholds become unreliable (Smith et al. 2025, https://doi.org/10.1016/j.pocean.2024.103404). .. code-block:: text Percentile Min Samples Needed 30-years (no spatial) 30-year with 5×5 Recommendation ────────── ────────────────── ───────────────────── ──────────────── ────────────── 90th 100-200 330 ✓ 8,250 ✓ Either works 95th 200-400 330 ○ 8,250 ✓ Spatial helps 97.5th 400-800 330 ✗ 8,250 ✓ ✓ Use spatial 99th 1000+ 330 ✗ 8,250 ✓ ✓ Use spatial 99.9th 10000+ 330 ✗ 8,250 ✗ ✓ Need 9×9+ *Guideline*: For percentiles >95th, spatial windowing is strongly recommended. For >97.5th, it is essential. Parameter Reference =================== Core Parameters --------------- **threshold_percentile** : float, default=95 Percentile threshold for extreme event identification (e.g., 95 for 95th percentile) **method_anomaly** : {'detrend_harmonic', 'fixed_baseline', 'detrend_fixed_baseline', 'shifting_baseline'}, default='shifting_baseline' Method for anomaly computation **method_extreme** : {'global_extreme', 'hobday_extreme'}, default='hobday_extreme' Method for extreme identification **dask_chunks** : dict, default={'time': 25} Chunk sizes for Dask arrays for memory management Anomaly Method Parameters ------------------------- **Harmonic Detrending Parameters:** **std_normalise** : bool, default=False Whether to normalise anomalies using 30-day rolling standard deviation **detrend_orders** : list of int, default=[1] Polynomial orders for detrending (e.g., [1, 2] for linear + quadratic) **force_zero_mean** : bool, default=True Whether to explicitly enforce zero mean in final anomalies **Fixed Baseline Parameters:** **smooth_days_baseline** : int, default=11 Number of days for smoothing the daily climatology **reference_period** : tuple of (int, int), optional Year range ``(start_year, end_year)`` inclusive for computing the daily climatology. If ``None`` (default), uses all available years. Anomalies are computed for the full time series regardless. Example: ``reference_period=(1990, 2020)`` **Detrend Fixed Baseline Parameters:** **detrend_orders** : list of int, default=[1] Polynomial orders for detrending (e.g., [1, 2] for linear + quadratic) **smooth_days_baseline** : int, default=11 Number of days for smoothing the daily climatology after detrending **force_zero_mean** : bool, default=True Whether to explicitly enforce zero mean in final anomalies **reference_period** : tuple of (int, int), optional Year range ``(start_year, end_year)`` inclusive for computing the daily climatology (only affects the climatology step; detrending still uses all data). If ``None`` (default), uses all available years. **Shifting Baseline Parameters:** **window_year_baseline** : int, default=15 Number of years for rolling climatology baseline **smooth_days_baseline** : int, default=21 Number of days for smoothing the rolling climatology baseline Extreme Detection Parameters ---------------------------- **Hobday Extreme Parameters:** **window_days_hobday** : int, default=11 Window size for day-of-year threshold calculation **window_spatial_hobday** : int, optional Spatial window size for clustering (None = no spatial clustering) **method_percentile** : {'exact', 'approximate'}, default='approximate' Method for percentile calculation **precision** : float, default=0.01 Precision for histogram bins in approximate percentile calculation **max_anomaly** : float, default=5.0 Maximum anomaly value for approximate percentile calculation Grid Configuration ------------------ **dimensions** : dict Dimension mapping for different grid types: * Structured: ``{'time': 'time', 'x': 'lon', 'y': 'lat'}`` * Unstructured: ``{'time': 'time', 'x': 'ncells'}`` **coordinates** : dict Coordinate mapping for different grid types: * Structured: ``{'time': 'time', 'x': 'lon', 'y': 'lat'}`` * Unstructured: ``{'time': 'time', 'x': 'lon', 'y': 'lat'}`` **neighbours** : xarray.DataArray, optional Neighbour connectivity array for spatial clustering (unstructured grids) **cell_areas** : xarray.DataArray, optional Cell areas for weighted spatial statistics (unstructured grids) Advanced Usage Examples ======================= Method Combinations ------------------- .. code-block:: python # Most rigorous combination (computationally intensive) extremes_ds = marEx.preprocess_data( sst, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=90, window_year_baseline=20, smooth_days_baseline=31, window_days_hobday=11 ) # Fastest combination (less rigorous) extremes_ds = marEx.preprocess_data( sst, method_anomaly='detrend_harmonic', method_extreme='global_extreme', threshold_percentile=95, std_normalise=False ) Performance Optimisations ------------------------- .. code-block:: python # Optimised chunking for large datasets extremes_ds = marEx.preprocess_data( sst, threshold_percentile=95, dask_chunks={'time': 25, 'lat': 200, 'lon': 200}, # Use approximate percentiles for speed method_percentile='approximate', precision=0.05, # Coarser precision for speed max_anomaly=10.0 ) Multi-Variable Processing ------------------------- .. code-block:: python # Process multiple variables variables = ['sst', 'sss', 'chlorophyll'] extreme_datasets = {} for var_name in variables: data = xr.open_dataset(f'{var_name}_daily.nc')[var_name] extreme_datasets[var_name] = marEx.preprocess_data( data, threshold_percentile=95, method_anomaly='shifting_baseline', method_extreme='hobday_extreme' ) Integration with Tracking ========================= Complete Workflow ----------------- .. code-block:: python import xarray as xr import marEx # Step 1: Load data sst = xr.open_dataset('sst_daily.nc', chunks={'time': 365}).sst # Step 2: Preprocess extremes extremes_ds = marEx.preprocess_data( sst, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, dask_chunks={'time': 25} ) # Step 3: Track events event_tracker = marEx.tracker( extremes_ds.extreme_events, extremes_ds.mask, R_fill=8, area_filter_quartile=0.5 ) tracked_events = event_tracker.run() # Step 4: Visualise results config = marEx.PlotConfig( title='Marine Heatwave Events', plot_IDs=True ) fig, ax, im = tracked_events.ID_field.isel(time=0).plotX.single_plot(config) Quality Control =============== Data Validation --------------- .. code-block:: python # Check preprocessing results extremes_ds = marEx.preprocess_data(sst, threshold_percentile=95) # Validate anomaly statistics anomaly_stats = extremes_ds.dat_anomaly.std() print(f"Anomaly standard deviation: {anomaly_stats.values:.3f}") # Check extreme event frequency event_frequency = extremes_ds.extreme_events.mean() * 100 print(f"Extreme event frequency: {event_frequency.values:.1f}%") # Validate mask coverage ocean_fraction = extremes_ds.mask.mean() * 100 print(f"Ocean coverage: {ocean_fraction.values:.1f}%") Threshold Validation -------------------- .. code-block:: python # For hobday_extreme method, examine thresholds if 'thresholds' in extremes_ds: # Check seasonal threshold variation seasonal_range = (extremes_ds.thresholds.max(dim='dayofyear') - extremes_ds.thresholds.min(dim='dayofyear')) print(f"Seasonal threshold range: {seasonal_range.mean().values:.3f}") # Plot threshold climatology threshold_clim = extremes_ds.thresholds.mean(dim=['lat', 'lon']) import matplotlib.pyplot as plt plt.plot(threshold_clim.dayofyear, threshold_clim.values) plt.xlabel('Day of Year') plt.ylabel('Threshold Value') plt.title('Seasonal Threshold Climatology') Error Handling ============== Common Issues and Solutions --------------------------- **Memory Errors**: .. code-block:: python # Solution: Optimise chunking and use approximate methods extremes_ds = marEx.preprocess_data( sst, threshold_percentile=95, dask_chunks={'time': 15}, # Smaller chunks method_percentile='approximate', precision=0.1 # Coarser precision ) **Performance Issues**: .. code-block:: python # Solution: Use faster methods for exploration extremes_ds = marEx.preprocess_data( sst, threshold_percentile=95, method_anomaly='detrend_harmonic', method_extreme='global_extreme', std_normalise=False ) **Threshold Calculation Issues**: .. code-block:: python # Solution: Adjust window sizes and use spatial clustering extremes_ds = marEx.preprocess_data( sst, method_extreme='hobday_extreme', window_days_hobday=21, # Larger window window_spatial_hobday=5, # Add spatial clustering precision=0.01 # Higher precision ) **Coordinate System Issues**: .. code-block:: python # Solution: Specify custom dimensions and coordinates extremes_ds = marEx.preprocess_data( sst, threshold_percentile=95, dimensions={'time': 'time', 'x': 'longitude', 'y': 'latitude'}, coordinates={'time': 'time', 'x': 'longitude', 'y': 'latitude'} ) Performance Benchmarks ====================== Method Performance Comparison ----------------------------- .. code-block:: python # Relative performance for global 0.25° daily data: # Fastest: detrend_harmonic + global_extreme # - Processing time: ~0.5 wall-minutes per decade (2 CPU-hours) # - Memory usage: ~1 GB # - Accuracy: Good for first analysis # Balanced: detrend_fixed_baseline + hobday_extreme # - Processing time: ~5 wall-minutes per decade (21 CPU-hours) # - Memory usage: ~8 GB # - Accuracy: Better climatology # Most rigorous: shifting_baseline + hobday_extreme # - Processing time: ~8 wall-minutes per decade (34 CPU-hours) # - Memory usage: ~12 GB # - Accuracy: Best for research applications Scaling Characteristics ----------------------- .. code-block:: python # Scaling with dask (64 cores, 25-day chunks): # - Linear scaling up to ~100 cores # - Memory usage: ~2-4 GB per core # - I/O becomes bottleneck beyond 200 cores # - Optimal chunk size depends on data resolution Best Practices ============== Method Selection Guidelines --------------------------- 1. **Quick Exploratory Analysis**: Use ``detrend_harmonic`` + ``global_extreme`` 2. **Climate Change Research Studies**: Use ``shifting_baseline`` + ``hobday_extreme`` 3. **Limited Timeseries**: Use ``detrend_fixed_baseline`` + ``hobday_extreme`` Chunking Guidelines ------------------- .. code-block:: python # Optimal chunking strategies: # For global 0.25° data: optimal_chunks = {'time': 25, 'lat': 200, 'lon': 200} # For regional high-resolution data: optimal_chunks = {'time': 50, 'lat': 100, 'lon': 100} # For unstructured grids: optimal_chunks = {'time': 25, 'ncells': 50000} See Also ======== * :mod:`marEx.track` - Event tracking module * :mod:`marEx.plotX` - Visualisation module * :mod:`marEx.helper` - HPC utilities