==============================
Detection Module (:mod:`marEx.detect`)
==============================

.. currentmodule:: marEx.detect

The :mod:`marEx.detect` module provides functions for data preprocessing, detrending,
anomaly detection, and extreme event identification. This module forms the core of the
marEx preprocessing pipeline, transforming raw oceanographic data into standardised
anomalies and binary extreme event masks.

Overview
========

The detection module implements a comprehensive workflow for converting raw oceanographic
time series data into standardised anomalies and binary extreme event masks. It supports
both structured (regular lat/lon grids) and unstructured (irregular mesh) data formats
with advanced statistical methods for robust extreme event detection.

**Key Features:**

* **Dual Anomaly Methods**: Detrended baseline vs. shifting baseline approaches
* **Flexible Extreme Detection**: Global percentile vs. day-of-year specific thresholds
* **Dask Integration**: Memory-efficient processing of large datasets with parallel computation
* **Grid Agnostic**: Works seamlessly with both structured and unstructured grids
* **Statistical Rigor**: Advanced statistical methods for robust anomaly calculation

Main Functions
==============

.. autosummary::
   :toctree: ../generated/

   preprocess_data
   compute_normalised_anomaly
   rolling_climatology
   smoothed_rolling_climatology
   identify_extremes

Detailed Documentation
======================

Main Preprocessing Function
---------------------------

.. autofunction:: preprocess_data

Core Processing Functions
-------------------------

.. autofunction:: compute_normalised_anomaly

.. autofunction:: rolling_climatology

.. autofunction:: smoothed_rolling_climatology

.. autofunction:: identify_extremes

Basic Usage Examples
====================

Simple Preprocessing
--------------------

.. code-block:: python

   import xarray as xr
   import marEx

   # Load sea surface temperature data
   sst = xr.open_dataset('sst_daily.nc', chunks={'time': 365}).sst

   # Basic preprocessing with default parameters
   extremes_ds = marEx.preprocess_data(
       sst,
       threshold_percentile=95
   )

   # Result contains anomalies and extreme events
   print(extremes_ds)

Advanced Preprocessing
----------------------

.. code-block:: python

   # Advanced preprocessing with custom parameters
   extremes_ds = marEx.preprocess_data(
       sst,
       threshold_percentile=90,
       method_anomaly='shifting_baseline',      # Use rolling climatology
       method_extreme='hobday_extreme',         # Day-of-year specific thresholds
       window_year_baseline=20,                 # 20-year rolling baseline
       smooth_days_baseline=31,                 # 31-day smoothing
       window_days_hobday=11,                   # 11-day window for thresholds
       window_spatial_hobday=5,                 # 5-cell spatial clustering
       dask_chunks={'time': 25}
   )

Unstructured Grid Processing
----------------------------

.. code-block:: python

   # For unstructured grids, specify dimensions
   dimensions = {'time': 'time', 'x': 'ncells'}
   coordinates = {'time': 'time', 'x': 'lon', 'y': 'lat'}

   extremes_ds = marEx.preprocess_data(
       sst_unstructured,
       threshold_percentile=95,
       dimensions=dimensions,
       coordinates=coordinates
   )

Output Data Structure
=====================

The preprocessing function returns an xarray Dataset with the following structure:

.. code-block:: python

   # extremes_ds Dataset structure:
   xarray.Dataset
   Dimensions:     (lat, lon, time, dayofyear)
   Coordinates:
       lat         (lat)
       lon         (lon)
       time        (time)
       dayofyear   (dayofyear)    # Only for hobday_extreme method
   Data variables:
       dat_anomaly     (time, lat, lon)        float64     # Anomaly field
       mask            (lat, lon)              bool        # Land-sea mask
       extreme_events  (time, lat, lon)        bool        # Binary extreme events
       thresholds      (dayofyear, lat, lon)   float64     # Thresholds

**Key Variables:**

* **dat_anomaly**: Anomaly field (either detrended or from rolling climatology)
* **mask**: Deduced land-sea mask (True = ocean, False = land)
* **extreme_events**: Binary field locating extreme events (True = extreme)
* **thresholds**: Thresholds

Method Comparison
=================

Anomaly Detection Methods
-------------------------

**Harmonic Detrending** (``method_anomaly='detrend_harmonic'``):

.. code-block:: python

   # Fast polynomial detrending with harmonic components
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='detrend_harmonic',
       threshold_percentile=95,
       # Additional parameters:
       std_normalise=False,          # Optional STD normalisation
       detrend_orders=[1],           # Linear detrending (default)
       force_zero_mean=True          # Enforce zero mean
   )

**Characteristics:**
  * Faster computation using polynomial fitting
  * Uses harmonic components (annual, semi-annual cycles)
  * May introduce biases in variability statistics
  * Best for: Quick analysis

**Fixed Baseline** (``method_anomaly='fixed_baseline'``):

.. code-block:: python

   # Daily climatology (calculated using the full time series) without detrending
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='fixed_baseline',
       threshold_percentile=95,
       # Additional parameters:
       smooth_days_baseline=11       # Smoothing window for climatology
   )

   # Or where the climatology is calculated for a specific reference period (e.g., 1990-2020)
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='fixed_baseline',
       threshold_percentile=95,
       reference_period=(1990, 2020)  # Climatology from 1990-2020 only
   )

**Characteristics:**
  * Anomaly relative to the daily climatology using full time series (or a specified ``reference_period``)
  * Preserves long-term / climate trends
  * Simple interpretation and fast computation
  * Best for: Baseline comparison studies, trend-inclusive analysis, public outreach

**Detrend Fixed Baseline** (``method_anomaly='detrend_fixed_baseline'``):

.. code-block:: python

   # Polynomial detrending followed by fixed daily climatology
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='detrend_fixed_baseline',
       threshold_percentile=95,
       # Additional parameters:
       detrend_orders=[1],           # Linear detrending (default)
       smooth_days_baseline=11,      # Smoothing window for climatology
       force_zero_mean=True,         # Enforce zero mean
       reference_period=(1990, 2020) # Optional: restrict climatology to a specific reference period
   )

**Characteristics:**
  * Polynomial detrending followed by removing the fixed daily climatology
  * Preserves the full time-series of data, but does not account for trends in the timing of seasonal transitions
  * Removes long-term trends
  * Best for: Climate variability studies with trend removal

**Shifting Baseline** (``method_anomaly='shifting_baseline'``):

.. code-block:: python

   # Rolling climatology from previous years
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='shifting_baseline',
       threshold_percentile=95,
       window_year_baseline=15,      # 15-year rolling baseline
       smooth_days_baseline=21       # 21-day smoothing window
   )

**Characteristics:**
  * More accurate climatology using rolling window
  * Shortens time series by baseline window length
  * Computationally intensive but scientifically rigorous
  * Best for: Research applications, intricate & accurate analysis

Extreme Event Detection Methods
-------------------------------

**Global Extreme** (``method_extreme='global_extreme'``):

.. code-block:: python

   # Single threshold across all time points
   extremes_ds = marEx.preprocess_data(
       sst,
       method_extreme='global_extreme',
       threshold_percentile=95,
       # Optional STD normalisation
       std_normalise=True
   )

**Characteristics:**
  * Uses percentiles from entire time series
  * Single threshold value for all time points
  * Simple interpretation and fast computation
  * Best for: Exploratory analysis

**Hobday Extreme** (``method_extreme='hobday_extreme'``):

.. code-block:: python

   # Day-of-year specific thresholds
   extremes_ds = marEx.preprocess_data(
       sst,
       method_extreme='hobday_extreme',
       threshold_percentile=95,
       window_days_hobday=11,        # 11-day window
       window_spatial_hobday=None    # No spatial clustering (default)
   )

**Characteristics:**
  * Day-of-year specific percentile thresholds
  * Accounts for seasonal variations
  * Follows Hobday et al. (2016) methodology

Spatial Window Enhancement (``window_spatial_hobday``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**New in v3.0+**: The Hobday extreme method supports optional spatial pooling window for more robust, spatially coherent thresholds.

**Algorithm Details**:

For each grid cell ``(i, j)`` and each day-of-year ``d``:

1. **Temporal Sampling**: Collect anomalies from all years within ±``window_days_hobday`` days around day ``d``

   * Traditional: Samples from single cell ``(i, j)``
   * With spatial window: Samples from all cells in neighbourhood

2. **Spatial Pooling** (if ``window_spatial_hobday`` specified):

   * Define spatial window centered at ``(i, j)`` with radius ``r = (window_spatial_hobday - 1) / 2``
   * Pool samples from cells ``(i-r:i+r+1, j-r:j+r+1)``
   * Edge handling: Smaller windows near boundaries

3. **Percentile Calculation**:

   * Use histogram approximation (``method_percentile='approximate'``)
   * Precision controlled by ``precision`` parameter (default 0.01°C)
   * Calculate threshold at ``threshold_percentile`` (e.g., 95th)

**Sample Size Comparison**::

   Configuration                         Sample Count
   ──────────────────────────────────   ─────────────
   Traditional (no spatial window)       N_years × window_days_hobday
   Example: 30 years × 11 days          = 330 samples

   With 5×5 spatial window               N_years × window_days_hobday × 25 cells
   Example: 30 years × 11 days × 25     = 8,250 samples

   With 9×9 spatial window               N_years × window_days_hobday × 81 cells
   Example: 30 years × 11 days × 81     = 26,730 samples

**Example: Enabling Spatial Window**:

.. code-block:: python

   # Very long time-series (50+ years): no spatial pooling needed
   extremes_highres = marEx.preprocess_data(
       sst_0.1deg,
       method_extreme='hobday_extreme',
       window_days_hobday=11,
       window_spatial_hobday=None
   )

   # Short time-series: use spatial pooling to increase robustness of threshold calculation
   extremes_coarse = marEx.preprocess_data(
       sst_2deg,
       method_extreme='hobday_extreme',
       window_days_hobday=11,
       window_spatial_hobday=5         # Increase samples in anomaly distribution using a 5×5 window
   )

   # High threshold percentile (>95%): use spatial pooling to robustly sample distribution tails
   extremes_99th = marEx.preprocess_data(
       sst,
       method_extreme='hobday_extreme',
       threshold_percentile=99,        # Extreme percentiles require more samples
       window_days_hobday=11,
       window_spatial_hobday=7         # Larger window to sample tails (7×7 = 49 cells)
   )


**Performance Implications**:

* **Memory**: Approximate method remains memory-efficient regardless of window size
* **Computation**: Larger windows → more samples → slightly slower but still fast
* **Typical**: 5×5 window adds ~10-15% to computation time vs. no spatial window

**When to Use Spatial Windowing**:

Spatial windowing increases the sample size for percentile calculation, improving statistical robustness.
Note that this is not a spatial smoothing of the data itself, but rather a pooling of samples from
neighbouring grid cells to better estimate the percentile thresholds. This is motivated from a spatial decorrelation
length-scale argument, in the same way Hobday has argued for decorrelation time-scale for the 11-day time window.
This is critical in several scenarios:

**1. Short Time Series** (insufficient samples):

When time series length is limited, sample size for each day-of-year may be inadequate for robust percentile estimation:

.. code-block:: text

   Time Series Length    Samples (no spatial)    Extreme samples    Recommendation
   ──────────────────    ───────────────────     ───────────────    ──────────────
   10 years              110 samples             5 samples          ✓ Use spatial
   20 years              220 samples             11 samples         ✓ Use spatial
   30 years              330 samples             16 samples         ○ Optional
   50+ years             550+ samples            27+ samples        ✗ Not needed

*Guideline*: Use spatial windowing if time series < 30 years for robust 95th percentile estimation.

**2. Extreme Percentiles** (>95%, sampling distribution tails):

Higher percentiles require more samples to characterise the tail of the distribution accurately.
Without sufficient samples, extreme percentile thresholds become unreliable (Smith et al. 2025,
https://doi.org/10.1016/j.pocean.2024.103404).

.. code-block:: text

   Percentile    Min Samples Needed    30-years (no spatial)    30-year with 5×5    Recommendation
   ──────────    ──────────────────    ─────────────────────    ────────────────    ──────────────
   90th          100-200               330 ✓                    8,250 ✓             Either works
   95th          200-400               330 ○                    8,250 ✓             Spatial helps
   97.5th        400-800               330 ✗                    8,250 ✓             ✓ Use spatial
   99th          1000+                 330 ✗                    8,250 ✓             ✓ Use spatial
   99.9th        10000+                330 ✗                    8,250 ✗             ✓ Need 9×9+

*Guideline*: For percentiles >95th, spatial windowing is strongly recommended. For >97.5th, it is essential.

Parameter Reference
===================

Core Parameters
---------------

**threshold_percentile** : float, default=95
  Percentile threshold for extreme event identification (e.g., 95 for 95th percentile)

**method_anomaly** : {'detrend_harmonic', 'fixed_baseline', 'detrend_fixed_baseline', 'shifting_baseline'}, default='shifting_baseline'
  Method for anomaly computation

**method_extreme** : {'global_extreme', 'hobday_extreme'}, default='hobday_extreme'
  Method for extreme identification

**dask_chunks** : dict, default={'time': 25}
  Chunk sizes for Dask arrays for memory management

Anomaly Method Parameters
-------------------------

**Harmonic Detrending Parameters:**

**std_normalise** : bool, default=False
  Whether to normalise anomalies using 30-day rolling standard deviation

**detrend_orders** : list of int, default=[1]
  Polynomial orders for detrending (e.g., [1, 2] for linear + quadratic)

**force_zero_mean** : bool, default=True
  Whether to explicitly enforce zero mean in final anomalies

**Fixed Baseline Parameters:**

**smooth_days_baseline** : int, default=11
  Number of days for smoothing the daily climatology

**reference_period** : tuple of (int, int), optional
  Year range ``(start_year, end_year)`` inclusive for computing the daily climatology.
  If ``None`` (default), uses all available years. Anomalies are computed for the full
  time series regardless. Example: ``reference_period=(1990, 2020)``

**Detrend Fixed Baseline Parameters:**

**detrend_orders** : list of int, default=[1]
  Polynomial orders for detrending (e.g., [1, 2] for linear + quadratic)

**smooth_days_baseline** : int, default=11
  Number of days for smoothing the daily climatology after detrending

**force_zero_mean** : bool, default=True
  Whether to explicitly enforce zero mean in final anomalies

**reference_period** : tuple of (int, int), optional
  Year range ``(start_year, end_year)`` inclusive for computing the daily climatology
  (only affects the climatology step; detrending still uses all data).
  If ``None`` (default), uses all available years.

**Shifting Baseline Parameters:**

**window_year_baseline** : int, default=15
  Number of years for rolling climatology baseline

**smooth_days_baseline** : int, default=21
  Number of days for smoothing the rolling climatology baseline

Extreme Detection Parameters
----------------------------

**Hobday Extreme Parameters:**

**window_days_hobday** : int, default=11
  Window size for day-of-year threshold calculation

**window_spatial_hobday** : int, optional
  Spatial window size for clustering (None = no spatial clustering)

**method_percentile** : {'exact', 'approximate'}, default='approximate'
  Method for percentile calculation

**precision** : float, default=0.01
  Precision for histogram bins in approximate percentile calculation

**max_anomaly** : float, default=5.0
  Maximum anomaly value for approximate percentile calculation

Grid Configuration
------------------

**dimensions** : dict
  Dimension mapping for different grid types:

  * Structured: ``{'time': 'time', 'x': 'lon', 'y': 'lat'}``
  * Unstructured: ``{'time': 'time', 'x': 'ncells'}``

**coordinates** : dict
  Coordinate mapping for different grid types:

  * Structured: ``{'time': 'time', 'x': 'lon', 'y': 'lat'}``
  * Unstructured: ``{'time': 'time', 'x': 'lon', 'y': 'lat'}``

**neighbours** : xarray.DataArray, optional
  Neighbour connectivity array for spatial clustering (unstructured grids)

**cell_areas** : xarray.DataArray, optional
  Cell areas for weighted spatial statistics (unstructured grids)

Advanced Usage Examples
=======================

Method Combinations
-------------------

.. code-block:: python

   # Most rigorous combination (computationally intensive)
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='shifting_baseline',
       method_extreme='hobday_extreme',
       threshold_percentile=90,
       window_year_baseline=20,
       smooth_days_baseline=31,
       window_days_hobday=11
   )

   # Fastest combination (less rigorous)
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='detrend_harmonic',
       method_extreme='global_extreme',
       threshold_percentile=95,
       std_normalise=False
   )

Performance Optimisations
-------------------------

.. code-block:: python

   # Optimised chunking for large datasets
   extremes_ds = marEx.preprocess_data(
       sst,
       threshold_percentile=95,
       dask_chunks={'time': 25, 'lat': 200, 'lon': 200},
       # Use approximate percentiles for speed
       method_percentile='approximate',
       precision=0.05,  # Coarser precision for speed
       max_anomaly=10.0
   )

Multi-Variable Processing
-------------------------

.. code-block:: python

   # Process multiple variables
   variables = ['sst', 'sss', 'chlorophyll']
   extreme_datasets = {}

   for var_name in variables:
       data = xr.open_dataset(f'{var_name}_daily.nc')[var_name]

       extreme_datasets[var_name] = marEx.preprocess_data(
           data,
           threshold_percentile=95,
           method_anomaly='shifting_baseline',
           method_extreme='hobday_extreme'
       )

Integration with Tracking
=========================

Complete Workflow
-----------------

.. code-block:: python

   import xarray as xr
   import marEx

   # Step 1: Load data
   sst = xr.open_dataset('sst_daily.nc', chunks={'time': 365}).sst

   # Step 2: Preprocess extremes
   extremes_ds = marEx.preprocess_data(
       sst,
       method_anomaly='shifting_baseline',
       method_extreme='hobday_extreme',
       threshold_percentile=95,
       window_year_baseline=15,
       smooth_days_baseline=21,
       window_days_hobday=11,
       dask_chunks={'time': 25}
   )

   # Step 3: Track events
   event_tracker = marEx.tracker(
       extremes_ds.extreme_events,
       extremes_ds.mask,
       R_fill=8,
       area_filter_quartile=0.5
   )

   tracked_events = event_tracker.run()

   # Step 4: Visualise results
   config = marEx.PlotConfig(
       title='Marine Heatwave Events',
       plot_IDs=True
   )

   fig, ax, im = tracked_events.ID_field.isel(time=0).plotX.single_plot(config)

Quality Control
===============

Data Validation
---------------

.. code-block:: python

   # Check preprocessing results
   extremes_ds = marEx.preprocess_data(sst, threshold_percentile=95)

   # Validate anomaly statistics
   anomaly_stats = extremes_ds.dat_anomaly.std()
   print(f"Anomaly standard deviation: {anomaly_stats.values:.3f}")

   # Check extreme event frequency
   event_frequency = extremes_ds.extreme_events.mean() * 100
   print(f"Extreme event frequency: {event_frequency.values:.1f}%")

   # Validate mask coverage
   ocean_fraction = extremes_ds.mask.mean() * 100
   print(f"Ocean coverage: {ocean_fraction.values:.1f}%")

Threshold Validation
--------------------

.. code-block:: python

   # For hobday_extreme method, examine thresholds
   if 'thresholds' in extremes_ds:
       # Check seasonal threshold variation
       seasonal_range = (extremes_ds.thresholds.max(dim='dayofyear') -
                        extremes_ds.thresholds.min(dim='dayofyear'))
       print(f"Seasonal threshold range: {seasonal_range.mean().values:.3f}")

       # Plot threshold climatology
       threshold_clim = extremes_ds.thresholds.mean(dim=['lat', 'lon'])
       import matplotlib.pyplot as plt
       plt.plot(threshold_clim.dayofyear, threshold_clim.values)
       plt.xlabel('Day of Year')
       plt.ylabel('Threshold Value')
       plt.title('Seasonal Threshold Climatology')

Error Handling
==============

Common Issues and Solutions
---------------------------

**Memory Errors**:

.. code-block:: python

   # Solution: Optimise chunking and use approximate methods
   extremes_ds = marEx.preprocess_data(
       sst,
       threshold_percentile=95,
       dask_chunks={'time': 15},     # Smaller chunks
       method_percentile='approximate',
       precision=0.1                 # Coarser precision
   )

**Performance Issues**:

.. code-block:: python

   # Solution: Use faster methods for exploration
   extremes_ds = marEx.preprocess_data(
       sst,
       threshold_percentile=95,
       method_anomaly='detrend_harmonic',
       method_extreme='global_extreme',
       std_normalise=False
   )

**Threshold Calculation Issues**:

.. code-block:: python

   # Solution: Adjust window sizes and use spatial clustering
   extremes_ds = marEx.preprocess_data(
       sst,
       method_extreme='hobday_extreme',
       window_days_hobday=21,        # Larger window
       window_spatial_hobday=5,      # Add spatial clustering
       precision=0.01                # Higher precision
   )

**Coordinate System Issues**:

.. code-block:: python

   # Solution: Specify custom dimensions and coordinates
   extremes_ds = marEx.preprocess_data(
       sst,
       threshold_percentile=95,
       dimensions={'time': 'time', 'x': 'longitude', 'y': 'latitude'},
       coordinates={'time': 'time', 'x': 'longitude', 'y': 'latitude'}
   )

Performance Benchmarks
======================

Method Performance Comparison
-----------------------------

.. code-block:: python

   # Relative performance for global 0.25° daily data:

   # Fastest: detrend_harmonic + global_extreme
   # - Processing time: ~0.5 wall-minutes per decade (2 CPU-hours)
   # - Memory usage: ~1 GB
   # - Accuracy: Good for first analysis

   # Balanced: detrend_fixed_baseline + hobday_extreme
   # - Processing time: ~5 wall-minutes per decade (21 CPU-hours)
   # - Memory usage: ~8 GB
   # - Accuracy: Better climatology

   # Most rigorous: shifting_baseline + hobday_extreme
   # - Processing time: ~8 wall-minutes per decade (34 CPU-hours)
   # - Memory usage: ~12 GB
   # - Accuracy: Best for research applications

Scaling Characteristics
-----------------------

.. code-block:: python

   # Scaling with dask (64 cores, 25-day chunks):
   # - Linear scaling up to ~100 cores
   # - Memory usage: ~2-4 GB per core
   # - I/O becomes bottleneck beyond 200 cores
   # - Optimal chunk size depends on data resolution

Best Practices
==============

Method Selection Guidelines
---------------------------

1. **Quick Exploratory Analysis**: Use ``detrend_harmonic`` + ``global_extreme``
2. **Climate Change Research Studies**: Use ``shifting_baseline`` + ``hobday_extreme``
3. **Limited Timeseries**: Use ``detrend_fixed_baseline`` + ``hobday_extreme``

Chunking Guidelines
-------------------

.. code-block:: python

   # Optimal chunking strategies:

   # For global 0.25° data:
   optimal_chunks = {'time': 25, 'lat': 200, 'lon': 200}

   # For regional high-resolution data:
   optimal_chunks = {'time': 50, 'lat': 100, 'lon': 100}

   # For unstructured grids:
   optimal_chunks = {'time': 25, 'ncells': 50000}

See Also
========

* :mod:`marEx.track` - Event tracking module
* :mod:`marEx.plotX` - Visualisation module
* :mod:`marEx.helper` - HPC utilities