Detection Module (marEx.detect)

The marEx.detect module provides functions for data preprocessing, detrending, anomaly detection, and extreme event identification. This module forms the core of the marEx preprocessing pipeline, transforming raw oceanographic data into standardised anomalies and binary extreme event masks.

Overview

The detection module implements a comprehensive workflow for converting raw oceanographic time series data into standardised anomalies and binary extreme event masks. It supports both structured (regular lat/lon grids) and unstructured (irregular mesh) data formats with advanced statistical methods for robust extreme event detection.

Key Features:

  • Dual Anomaly Methods: Detrended baseline vs. shifting baseline approaches

  • Flexible Extreme Detection: Global percentile vs. day-of-year specific thresholds

  • Dask Integration: Memory-efficient processing of large datasets with parallel computation

  • Grid Agnostic: Works seamlessly with both structured and unstructured grids

  • Statistical Rigor: Advanced statistical methods for robust anomaly calculation

Main Functions

preprocess_data(da[, method_anomaly, ...])

Complete preprocessing pipeline for marine extreme event identification.

compute_normalised_anomaly(da[, ...])

Generate normalised anomalies using specified methodology.

rolling_climatology(da[, ...])

Compute rolling climatology efficiently using flox cohorts.

smoothed_rolling_climatology(da[, ...])

Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure.

identify_extremes(da[, method_extreme, ...])

Identify extreme events exceeding a percentile threshold using specified method.

Detailed Documentation

Main Preprocessing Function

marEx.detect.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Complete preprocessing pipeline for marine extreme event identification.

Supports separate methods for anomaly computation and extreme identification:

Anomaly Methods:

  • ‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics

  • ‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years

  • ‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !

  • ‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)

Extreme Methods:

  • ‘global_extreme’: Global-in-time threshold value

  • ‘hobday_extreme’: Local day-of-year specific thresholds with windowing

Parameters:
  • da (xarray.DataArray) – Raw input data

  • method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).

  • method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).

  • threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.

  • window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).

  • window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).

  • window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).

  • std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).

  • detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.

  • force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.

  • method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.

  • precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.

  • dask_chunks (dict, optional) – Chunking specification for distributed computation.

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.

  • coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.

  • neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.

  • cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.

  • use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.

  • verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

Returns:

Processed dataset with anomalies and extreme event identification

Return type:

xarray.Dataset

Examples

Basic usage with gridded SST data for marine heatwave detection:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load and chunk SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Basic preprocessing with default shifting baseline method
>>> result = marEx.preprocess_data(sst, threshold_percentile=90)
>>> print(result)
<xarray.Dataset>
Dimensions:         (time: 1461, lat: 180, lon: 360)
Data variables:
    dat_anomaly     (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask            (lat, lon) bool dask.array<chunksize=(180, 360)>
    extreme_events  (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)>
    thresholds      (lat, lon) float32 dask.array<chunksize=(180, 360)>
>>> # Check which locations have extreme events
>>> print(f"Total extreme events: {result.extreme_events.sum().compute()}")
Total extreme events: 15847

Using shifting baseline method for more accurate climatology:

>>> # Requires at least 15 years of data by default
>>> result_shifting = marEx.preprocess_data(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,  # Use shorter window if needed
...     smooth_days_baseline=31   # Longer smoothing window
... )
>>> # Note: First 10 years will be removed from output

Using Hobday extreme method with day-of-year specific thresholds:

>>> result_hobday = marEx.preprocess_data(
...     sst,
...     method_extreme="hobday_extreme",
...     window_days_hobday=11,  # 11-day window for each day-of-year
...     threshold_percentile=95
... )
>>> print(result_hobday.thresholds.dims)
('dayofyear', 'lat', 'lon')

Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:

>>> result_advanced = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2],  # Linear and quadratic trends
...     std_normalise=True,     # Standardise by rolling std
...     force_zero_mean=True,
...     threshold_percentile=95
... )
>>> # Result includes both raw and standardised anomalies
>>> print('dat_stn' in result_advanced)
True

Processing unstructured data:

>>> # For ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50})
>>> result_unstructured = marEx.preprocess_data(
...     icon_sst,
...     dimensions={"x": "ncells"},   # Must specify the name of the spatial dimension
...     dask_chunks={"time": 50}
... )

Error handling - insufficient data for shifting baseline:

>>> short_data = sst.isel(time=slice(0, 1000))  # Only ~3 years
>>> try:
...     result = marEx.preprocess_data(
...         short_data,
...         method_anomaly="shifting_baseline",
...         window_year_baseline=15
...     )
... except ValueError as e:
...     print(f"Error: {e}")
Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline
requires at least 15 years.

Performance considerations with chunking:

>>> # For large datasets, adjust chunking for memory management
>>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180})
>>> result = marEx.preprocess_data(
...     large_sst,
...     dask_chunks={"time": 25},
...     method_percentile="approximate"  # Use approximate method (Default) for long time-series calculations
... )

Integration with tracking workflow:

>>> # Preprocess data then track events
>>> processed = marEx.preprocess_data(sst, threshold_percentile=95)
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = tracker.run()
>>> print(f"Identified {events.event.max().compute()} distinct events")

Simple fixed baseline approach:

>>> # Basic daily climatology across all years
>>> result_fixed = marEx.preprocess_data(
...     sst,
...     method_anomaly="fixed_baseline",
...     threshold_percentile=95
... )
>>> # Uses all available data for climatology computation

Combined trend removal and fixed climatology:

>>> # Remove long-term trends then compute daily climatology
>>> result_combined = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Linear trend
...     threshold_percentile=95,
...     force_zero_mean=True
... )
>>> # Balances trend removal with simple climatology

Core Processing Functions

marEx.detect.compute_normalised_anomaly(da, method_anomaly='shifting_baseline', dimensions=None, coordinates=None, window_year_baseline=15, smooth_days_baseline=21, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Generate normalised anomalies using specified methodology.

Parameters:
  • da (xarray.DataArray) – Input data with dimensions matching the ‘dimensions’ parameter

  • method_anomaly (str, default='shifting_baseline') –

    Anomaly computation method. Options: - ‘detrend_harmonic’: Detrending with harmonics and polynomials (efficient, biased) - ‘shifting_baseline’: Rolling climatology (accurate, shortens time series) - ‘fixed_baseline’: Daily climatology using full time series (keeps long-term trends in the anomaly) - ‘detrend_fixed_baseline’: Polynomial detrending + fixed climatology (does not shorten time series,

    keeps trends in seasonal timing in the anomaly)

  • dimensions (dict, optional) – Mapping of conceptual dimensions to actual dimension names in the data

  • coordinates (dict, optional) – Mapping of conceptual coordinates to actual coordinate names in the data

  • window_year_baseline (int, default=15) – Number of years for rolling climatology (shifting_baseline only)

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline only)

  • std_normalise (bool, default=False) – Whether to normalise by 30-day rolling standard deviation (detrend_harmonic only)

  • detrend_orders (list, default=[1]) – Polynomial orders for trend removal (detrend_harmonic and detrend_fixed_baseline only)

  • force_zero_mean (bool, default=True) – Explicitly enforce zero mean in final anomalies (detrend_harmonic and detrend_fixed_baseline only)

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless.

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Dataset containing anomalies, mask, and metadata

Return type:

xarray.Dataset

Examples

Basic detrended baseline anomaly computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load chunked SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute anomalies using shifting baseline (default)
>>> result = marEx.compute_normalised_anomaly(sst)
>>> print(result.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
>>> # Check that anomalies have approximately zero mean
>>> print(f"Mean anomaly: {result.dat_anomaly.mean().compute():.6f}")
Mean anomaly: 0.000023

Previous configuration (marEx v2.0 default) of detrended baseline with higher-order polynomials and standardisation. Note: marEx v3.0+ uses shifting_baseline as the default method:

>>> result_advanced = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2, 3],  # Linear, quadratic, cubic trends
...     std_normalise=True,        # Add standardised anomalies
...     force_zero_mean=True
... )
>>> print(result_advanced.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
    dat_stn      (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    STD          (dayofyear, lat, lon) float32 dask.array<chunksize=(366, 180, 360)>
>>> # Standardised anomalies have unit variance
>>> print(f"STD of standardised anomalies: {result_advanced.dat_stn.std().compute():.3f}")

Accurate shifting baseline method for climate-aware anomalies:

>>> result_shifting = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,   # Use 10-year rolling climatology
...     smooth_days_baseline=31    # 31-day smoothing window
... )
>>> # Anomalies computed relative to recent past climatology

Processing unstructured data:

>>> # ICON ocean model with ncells dimension
>>> icon_data = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> result_unstructured = marEx.compute_normalised_anomaly(
...     icon_data,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
... )
>>> print(result_unstructured.dims)
Frozen({'time': 1461, 'ncells': 83886})

Comparison of methods - detrended vs shifting baseline:

>>> # Detrended baseline - faster, slight bias
>>> detrended = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="detrend_harmonic"
... )
>>>
>>> # Shifting baseline - slower, more accurate
>>> shifting = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="shifting_baseline",
...     window_year_baseline=15
... )
>>>
>>> # Compare anomaly magnitudes
>>> print(f"Detrended RMS: {detrended.dat_anomaly.std().compute():.3f}")
>>> print(f"Shifting RMS: {shifting.dat_anomaly.std().compute():.3f}")

Fixed baseline climatology:

>>> # Use full time series for daily climatology
>>> result_fixed = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline"
... )
>>> # Climatology computed from all available years

Fixed baseline with a restricted reference period:

>>> # Compute climatology from 1990-2020 only, but output anomalies for all years
>>> result_ref = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline",
...     reference_period=(1990, 2020)
... )

Fixed detrended baseline:

>>> # Remove long-term trends then compute fixed climatology
>>> result_fixed_detrended = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Remove linear trend
...     force_zero_mean=True
... )
>>> # Combines trend removal with fixed climatology
marEx.detect.rolling_climatology(da, window_year_baseline=15, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute rolling climatology efficiently using flox cohorts. Uses the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic rolling climatology computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load 20 years of SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute 15-year rolling climatology
>>> climatology = marEx.rolling_climatology(sst, window_year_baseline=15)
>>> print(climatology.shape)
(7305, 180, 360)  # Same as input
>>>
>>> # First 15 years will be NaN (insufficient history)
>>> print(f"NaN values in first year: {climatology.isel(time=slice(0, 365)).isnull().all().compute()}")
True

Shorter window for datasets with limited time span:

>>> # For datasets with only 10 years, use shorter window
>>> short_climatology = marEx.rolling_climatology(
...     sst, window_year_baseline=5
... )
>>> # First 5 years will be NaN instead of 15

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_climatology = marEx.rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"}
... )
>>> print(icon_climatology.dims)
Frozen({'time': 7305, 'ncells': 83886})

Comparing with fixed climatology:

>>> # Fixed climatology (traditional approach)
>>> fixed_clim = sst.groupby(sst.time.dt.dayofyear).mean()
>>>
>>> # Rolling climatology (adaptive approach)
>>> rolling_clim = marEx.rolling_climatology(sst)
>>>
>>> # Rolling climatology adapts to climate change
>>> clim_2000 = rolling_clim.sel(time='2000').mean()
>>> clim_2020 = rolling_clim.sel(time='2020').mean()
>>> print(f"Climate change signal: {(clim_2020 - clim_2000).compute():.3f} °C")

Memory considerations for large datasets:

>>> # Ensure appropriate chunking for memory efficiency
>>> large_sst = sst.chunk({'time': 30, 'lat': 45, 'lon': 90})
>>> large_climatology = marEx.rolling_climatology(large_sst)
>>> # Output maintains input chunking structure
marEx.detect.smoothed_rolling_climatology(da, window_year_baseline=15, smooth_days_baseline=21, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • smooth_days_baseline (int, default=21) – Number of days for temporal smoothing window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Smoothed rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic smoothed rolling climatology:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst,
...     window_year_baseline=15,
...     smooth_days_baseline=21
... )
>>> print(smooth_clim.shape)
(7305, 180, 360)

Comparing different smoothing windows:

>>> # Short smoothing - more day-to-day variability
>>> clim_short = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=7
... )
>>>
>>> # Long smoothing - smoother seasonal cycle
>>> clim_long = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=61
... )
>>>
>>> # Compare variability
>>> var_short = clim_short.std(dim='time').mean().compute()
>>> var_long = clim_long.std(dim='time').mean().compute()
>>> print(f"Variability: short={var_short:.3f}, long={var_long:.3f}")

Climatology for anomaly computation:

>>> # Compute smoothed climatology then anomalies
>>> climatology = marEx.smoothed_rolling_climatology(sst)
>>> anomalies = sst - climatology
>>>
>>> # Check that anomalies have reasonable properties
>>> print(f"Anomaly mean: {anomalies.mean().compute():.6f}")
>>> print(f"Anomaly std: {anomalies.std().compute():.3f}")

Unstructured data processing:

>>> # ICON ocean data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_smooth_clim = marEx.smoothed_rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     window_year_baseline=10,
...     smooth_days_baseline=31
... )

Effect of smoothing on seasonal cycle:

>>> # Raw rolling climatology (no temporal smoothing)
>>> raw_clim = marEx.rolling_climatology(sst, window_year_baseline=15)
>>>
>>> # Smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst, window_year_baseline=15, smooth_days_baseline=21
... )
>>>
>>> # Compare seasonal cycle smoothness
>>> # Extract annual cycle for a point
>>> point_raw = raw_clim.isel(lat=90, lon=180).sel(time='2010')
>>> point_smooth = smooth_clim.isel(lat=90, lon=180).sel(time='2010')
>>>
>>> print(f"Raw climatology range: {(point_raw.max() - point_raw.min()).compute():.3f}")
>>> print(f"Smooth climatology range: {(point_smooth.max() - point_smooth.min()).compute():.3f}")

Performance considerations:

>>> # Efficient implementation smooths raw data first, then computes climatology
>>> # This is more memory-efficient than smoothing the climatology
>>> large_sst = sst.chunk({'time': 25, 'lat': 45, 'lon': 90})
>>> efficient_clim = marEx.smoothed_rolling_climatology(large_sst)
marEx.detect.identify_extremes(da, method_extreme='hobday_extreme', threshold_percentile=95, dimensions=None, coordinates=None, window_days_hobday=11, window_spatial_hobday=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Identify extreme events exceeding a percentile threshold using specified method.

Parameters:
  • da (xarray.DataArray) – DataArray containing anomalies

  • method_extreme (str, default='hobday_extreme') – Method for threshold calculation (‘global_extreme’ or ‘hobday_extreme’)

  • threshold_percentile (float, default=95) – Percentile threshold (e.g., 95 for 95th percentile)

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • window_days_hobday (int, default=11) – Window for day-of-year threshold (hobday_extreme only)

  • window_spatial_hobday (int, default=None) – Window for day-of-year threshold spatial clustering (hobday_extreme only)

  • method_percentile (str, default='approximate') – Method for percentile computation (‘exact’ or ‘approximate’)

  • precision (float, default=0.01) – Precision for histogram bins in approximate method

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Tuple of (extremes, thresholds) where extremes is a boolean array identifying extreme events and thresholds contains the threshold values used

Return type:

tuple

Examples

Basic extreme identification with global thresholds:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load anomaly data (from compute_normalised_anomaly)
>>> anomalies = xr.open_dataset('anomalies.nc', chunks={}).dat_anomaly
>>>
>>> # Identify extreme events using global-in-time 95th percentile
>>> extremes, thresholds = marEx.identify_extremes(
...     anomalies,
...     method_extreme="global_extreme",
...     threshold_percentile=95
... )
>>> print(f"Extreme events shape: {extremes.shape}")
Extreme events shape: (1461, 180, 360)
>>> print(f"Thresholds shape: {thresholds.shape}")
Thresholds shape: (180, 360)
>>> # Count total extreme events
>>> total_extremes = extremes.sum().compute()
>>> print(f"Total extreme events: {total_extremes}")

Using day-of-year specific thresholds (cf. Hobday et al. 2016 method):

>>> # More sophisticated threshold calculation
>>> extremes_hobday, thresholds_hobday = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     threshold_percentile=95,
...     window_days_hobday=11  # 11-day window around each day-of-year
...     window_spatial_hobday=3  # 3x3 spatial window for clustering percentile calcuation
... )
>>> print(f"Hobday thresholds shape: {thresholds_hobday.shape}")
Hobday thresholds shape: (366, 180, 360)
>>> # Compare seasonal variation in thresholds
>>> summer_threshold = thresholds_hobday.sel(dayofyear=200).mean()
>>> winter_threshold = thresholds_hobday.sel(dayofyear=50).mean()
>>> print(f"Summer vs Winter thresholds: {summer_threshold:.3f} vs {winter_threshold:.3f}")

Comparison of exact vs approximate percentile methods:

>>> # Approximate method (faster, default)
>>> extremes_approx, thresh_approx = marEx.identify_extremes(
...     anomalies, method_percentile="approximate"
... )
>>>
>>> # Exact method (slower & memory intensive)
>>> extremes_exact, thresh_exact = marEx.identify_extremes(
...     anomalies, method_percentile="exact"
... )
>>>
>>> # Compare threshold precision — ~0.005C
>>> threshold_diff = (thresh_exact - thresh_approx).std().compute()
>>> print(f"Threshold difference (exact vs approx): {threshold_diff:.6f}")

Different percentile thresholds for varying event rarity:

>>> # Conservative threshold - very extreme events only
>>> extremes_98, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=98
... )
>>>
>>> # Moderate threshold - more frequent events
>>> extremes_90, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=90
... )
>>>
>>> # Compare event frequency
>>> print(f"99th percentile events: {extremes_99.sum().compute()}")
>>> print(f"90th percentile events: {extremes_90.sum().compute()}")

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_anomalies = xr.open_dataset('icon_anomalies.nc', chunks={}).dat_anomaly
>>> extremes_unstructured, thresholds_unstructured = marEx.identify_extremes(
...     icon_anomalies,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     threshold_percentile=95
... )
>>> print(f"Unstructured extremes shape: {extremes_unstructured.shape}")

Advanced Hobday method with custom temporal window:

>>> # Longer temporal window for smoother thresholds
>>> extremes_smooth, thresholds_smooth = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     window_days_hobday=31,  # Longer smoothing window
...     threshold_percentile=95
... )
>>>
>>> # Compare threshold smoothness
>>> std_11day = thresholds_hobday.std(dim='dayofyear').mean().compute()
>>> std_31day = thresholds_smooth.std(dim='dayofyear').mean().compute()
>>> print(f"Threshold variability: 11-day={std_11day:.3f}, 31-day={std_31day:.3f}")

Basic Usage Examples

Simple Preprocessing

import xarray as xr
import marEx

# Load sea surface temperature data
sst = xr.open_dataset('sst_daily.nc', chunks={'time': 365}).sst

# Basic preprocessing with default parameters
extremes_ds = marEx.preprocess_data(
    sst,
    threshold_percentile=95
)

# Result contains anomalies and extreme events
print(extremes_ds)

Advanced Preprocessing

# Advanced preprocessing with custom parameters
extremes_ds = marEx.preprocess_data(
    sst,
    threshold_percentile=90,
    method_anomaly='shifting_baseline',      # Use rolling climatology
    method_extreme='hobday_extreme',         # Day-of-year specific thresholds
    window_year_baseline=20,                 # 20-year rolling baseline
    smooth_days_baseline=31,                 # 31-day smoothing
    window_days_hobday=11,                   # 11-day window for thresholds
    window_spatial_hobday=5,                 # 5-cell spatial clustering
    dask_chunks={'time': 25}
)

Unstructured Grid Processing

# For unstructured grids, specify dimensions
dimensions = {'time': 'time', 'x': 'ncells'}
coordinates = {'time': 'time', 'x': 'lon', 'y': 'lat'}

extremes_ds = marEx.preprocess_data(
    sst_unstructured,
    threshold_percentile=95,
    dimensions=dimensions,
    coordinates=coordinates
)

Output Data Structure

The preprocessing function returns an xarray Dataset with the following structure:

# extremes_ds Dataset structure:
xarray.Dataset
Dimensions:     (lat, lon, time, dayofyear)
Coordinates:
    lat         (lat)
    lon         (lon)
    time        (time)
    dayofyear   (dayofyear)    # Only for hobday_extreme method
Data variables:
    dat_anomaly     (time, lat, lon)        float64     # Anomaly field
    mask            (lat, lon)              bool        # Land-sea mask
    extreme_events  (time, lat, lon)        bool        # Binary extreme events
    thresholds      (dayofyear, lat, lon)   float64     # Thresholds

Key Variables:

  • dat_anomaly: Anomaly field (either detrended or from rolling climatology)

  • mask: Deduced land-sea mask (True = ocean, False = land)

  • extreme_events: Binary field locating extreme events (True = extreme)

  • thresholds: Thresholds

Method Comparison

Anomaly Detection Methods

Harmonic Detrending (method_anomaly='detrend_harmonic'):

# Fast polynomial detrending with harmonic components
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='detrend_harmonic',
    threshold_percentile=95,
    # Additional parameters:
    std_normalise=False,          # Optional STD normalisation
    detrend_orders=[1],           # Linear detrending (default)
    force_zero_mean=True          # Enforce zero mean
)
Characteristics:
  • Faster computation using polynomial fitting

  • Uses harmonic components (annual, semi-annual cycles)

  • May introduce biases in variability statistics

  • Best for: Quick analysis

Fixed Baseline (method_anomaly='fixed_baseline'):

# Daily climatology (calculated using the full time series) without detrending
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='fixed_baseline',
    threshold_percentile=95,
    # Additional parameters:
    smooth_days_baseline=11       # Smoothing window for climatology
)

# Or where the climatology is calculated for a specific reference period (e.g., 1990-2020)
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='fixed_baseline',
    threshold_percentile=95,
    reference_period=(1990, 2020)  # Climatology from 1990-2020 only
)
Characteristics:
  • Anomaly relative to the daily climatology using full time series (or a specified reference_period)

  • Preserves long-term / climate trends

  • Simple interpretation and fast computation

  • Best for: Baseline comparison studies, trend-inclusive analysis, public outreach

Detrend Fixed Baseline (method_anomaly='detrend_fixed_baseline'):

# Polynomial detrending followed by fixed daily climatology
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='detrend_fixed_baseline',
    threshold_percentile=95,
    # Additional parameters:
    detrend_orders=[1],           # Linear detrending (default)
    smooth_days_baseline=11,      # Smoothing window for climatology
    force_zero_mean=True,         # Enforce zero mean
    reference_period=(1990, 2020) # Optional: restrict climatology to a specific reference period
)
Characteristics:
  • Polynomial detrending followed by removing the fixed daily climatology

  • Preserves the full time-series of data, but does not account for trends in the timing of seasonal transitions

  • Removes long-term trends

  • Best for: Climate variability studies with trend removal

Shifting Baseline (method_anomaly='shifting_baseline'):

# Rolling climatology from previous years
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='shifting_baseline',
    threshold_percentile=95,
    window_year_baseline=15,      # 15-year rolling baseline
    smooth_days_baseline=21       # 21-day smoothing window
)
Characteristics:
  • More accurate climatology using rolling window

  • Shortens time series by baseline window length

  • Computationally intensive but scientifically rigorous

  • Best for: Research applications, intricate & accurate analysis

Extreme Event Detection Methods

Global Extreme (method_extreme='global_extreme'):

# Single threshold across all time points
extremes_ds = marEx.preprocess_data(
    sst,
    method_extreme='global_extreme',
    threshold_percentile=95,
    # Optional STD normalisation
    std_normalise=True
)
Characteristics:
  • Uses percentiles from entire time series

  • Single threshold value for all time points

  • Simple interpretation and fast computation

  • Best for: Exploratory analysis

Hobday Extreme (method_extreme='hobday_extreme'):

# Day-of-year specific thresholds
extremes_ds = marEx.preprocess_data(
    sst,
    method_extreme='hobday_extreme',
    threshold_percentile=95,
    window_days_hobday=11,        # 11-day window
    window_spatial_hobday=None    # No spatial clustering (default)
)
Characteristics:
  • Day-of-year specific percentile thresholds

  • Accounts for seasonal variations

  • Follows Hobday et al. (2016) methodology

Spatial Window Enhancement (window_spatial_hobday)

New in v3.0+: The Hobday extreme method supports optional spatial pooling window for more robust, spatially coherent thresholds.

Algorithm Details:

For each grid cell (i, j) and each day-of-year d:

  1. Temporal Sampling: Collect anomalies from all years within ±``window_days_hobday`` days around day d

    • Traditional: Samples from single cell (i, j)

    • With spatial window: Samples from all cells in neighbourhood

  2. Spatial Pooling (if window_spatial_hobday specified):

    • Define spatial window centered at (i, j) with radius r = (window_spatial_hobday - 1) / 2

    • Pool samples from cells (i-r:i+r+1, j-r:j+r+1)

    • Edge handling: Smaller windows near boundaries

  3. Percentile Calculation:

    • Use histogram approximation (method_percentile='approximate')

    • Precision controlled by precision parameter (default 0.01°C)

    • Calculate threshold at threshold_percentile (e.g., 95th)

Sample Size Comparison:

Configuration                         Sample Count
──────────────────────────────────   ─────────────
Traditional (no spatial window)       N_years × window_days_hobday
Example: 30 years × 11 days          = 330 samples

With 5×5 spatial window               N_years × window_days_hobday × 25 cells
Example: 30 years × 11 days × 25     = 8,250 samples

With 9×9 spatial window               N_years × window_days_hobday × 81 cells
Example: 30 years × 11 days × 81     = 26,730 samples

Example: Enabling Spatial Window:

# Very long time-series (50+ years): no spatial pooling needed
extremes_highres = marEx.preprocess_data(
    sst_0.1deg,
    method_extreme='hobday_extreme',
    window_days_hobday=11,
    window_spatial_hobday=None
)

# Short time-series: use spatial pooling to increase robustness of threshold calculation
extremes_coarse = marEx.preprocess_data(
    sst_2deg,
    method_extreme='hobday_extreme',
    window_days_hobday=11,
    window_spatial_hobday=5         # Increase samples in anomaly distribution using a 5×5 window
)

# High threshold percentile (>95%): use spatial pooling to robustly sample distribution tails
extremes_99th = marEx.preprocess_data(
    sst,
    method_extreme='hobday_extreme',
    threshold_percentile=99,        # Extreme percentiles require more samples
    window_days_hobday=11,
    window_spatial_hobday=7         # Larger window to sample tails (7×7 = 49 cells)
)

Performance Implications:

  • Memory: Approximate method remains memory-efficient regardless of window size

  • Computation: Larger windows → more samples → slightly slower but still fast

  • Typical: 5×5 window adds ~10-15% to computation time vs. no spatial window

When to Use Spatial Windowing:

Spatial windowing increases the sample size for percentile calculation, improving statistical robustness. Note that this is not a spatial smoothing of the data itself, but rather a pooling of samples from neighbouring grid cells to better estimate the percentile thresholds. This is motivated from a spatial decorrelation length-scale argument, in the same way Hobday has argued for decorrelation time-scale for the 11-day time window. This is critical in several scenarios:

1. Short Time Series (insufficient samples):

When time series length is limited, sample size for each day-of-year may be inadequate for robust percentile estimation:

Time Series Length    Samples (no spatial)    Extreme samples    Recommendation
──────────────────    ───────────────────     ───────────────    ──────────────
10 years              110 samples             5 samples          ✓ Use spatial
20 years              220 samples             11 samples         ✓ Use spatial
30 years              330 samples             16 samples         ○ Optional
50+ years             550+ samples            27+ samples        ✗ Not needed

Guideline: Use spatial windowing if time series < 30 years for robust 95th percentile estimation.

2. Extreme Percentiles (>95%, sampling distribution tails):

Higher percentiles require more samples to characterise the tail of the distribution accurately. Without sufficient samples, extreme percentile thresholds become unreliable (Smith et al. 2025, https://doi.org/10.1016/j.pocean.2024.103404).

Percentile    Min Samples Needed    30-years (no spatial)    30-year with 5×5    Recommendation
──────────    ──────────────────    ─────────────────────    ────────────────    ──────────────
90th          100-200               330 ✓                    8,250 ✓             Either works
95th          200-400               330 ○                    8,250 ✓             Spatial helps
97.5th        400-800               330 ✗                    8,250 ✓             ✓ Use spatial
99th          1000+                 330 ✗                    8,250 ✓             ✓ Use spatial
99.9th        10000+                330 ✗                    8,250 ✗             ✓ Need 9×9+

Guideline: For percentiles >95th, spatial windowing is strongly recommended. For >97.5th, it is essential.

Parameter Reference

Core Parameters

threshold_percentilefloat, default=95

Percentile threshold for extreme event identification (e.g., 95 for 95th percentile)

method_anomaly{‘detrend_harmonic’, ‘fixed_baseline’, ‘detrend_fixed_baseline’, ‘shifting_baseline’}, default=’shifting_baseline’

Method for anomaly computation

method_extreme{‘global_extreme’, ‘hobday_extreme’}, default=’hobday_extreme’

Method for extreme identification

dask_chunksdict, default={‘time’: 25}

Chunk sizes for Dask arrays for memory management

Anomaly Method Parameters

Harmonic Detrending Parameters:

std_normalisebool, default=False

Whether to normalise anomalies using 30-day rolling standard deviation

detrend_orderslist of int, default=[1]

Polynomial orders for detrending (e.g., [1, 2] for linear + quadratic)

force_zero_meanbool, default=True

Whether to explicitly enforce zero mean in final anomalies

Fixed Baseline Parameters:

smooth_days_baselineint, default=11

Number of days for smoothing the daily climatology

reference_periodtuple of (int, int), optional

Year range (start_year, end_year) inclusive for computing the daily climatology. If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020)

Detrend Fixed Baseline Parameters:

detrend_orderslist of int, default=[1]

Polynomial orders for detrending (e.g., [1, 2] for linear + quadratic)

smooth_days_baselineint, default=11

Number of days for smoothing the daily climatology after detrending

force_zero_meanbool, default=True

Whether to explicitly enforce zero mean in final anomalies

reference_periodtuple of (int, int), optional

Year range (start_year, end_year) inclusive for computing the daily climatology (only affects the climatology step; detrending still uses all data). If None (default), uses all available years.

Shifting Baseline Parameters:

window_year_baselineint, default=15

Number of years for rolling climatology baseline

smooth_days_baselineint, default=21

Number of days for smoothing the rolling climatology baseline

Extreme Detection Parameters

Hobday Extreme Parameters:

window_days_hobdayint, default=11

Window size for day-of-year threshold calculation

window_spatial_hobdayint, optional

Spatial window size for clustering (None = no spatial clustering)

method_percentile{‘exact’, ‘approximate’}, default=’approximate’

Method for percentile calculation

precisionfloat, default=0.01

Precision for histogram bins in approximate percentile calculation

max_anomalyfloat, default=5.0

Maximum anomaly value for approximate percentile calculation

Grid Configuration

dimensionsdict

Dimension mapping for different grid types:

  • Structured: {'time': 'time', 'x': 'lon', 'y': 'lat'}

  • Unstructured: {'time': 'time', 'x': 'ncells'}

coordinatesdict

Coordinate mapping for different grid types:

  • Structured: {'time': 'time', 'x': 'lon', 'y': 'lat'}

  • Unstructured: {'time': 'time', 'x': 'lon', 'y': 'lat'}

neighboursxarray.DataArray, optional

Neighbour connectivity array for spatial clustering (unstructured grids)

cell_areasxarray.DataArray, optional

Cell areas for weighted spatial statistics (unstructured grids)

Advanced Usage Examples

Method Combinations

# Most rigorous combination (computationally intensive)
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='shifting_baseline',
    method_extreme='hobday_extreme',
    threshold_percentile=90,
    window_year_baseline=20,
    smooth_days_baseline=31,
    window_days_hobday=11
)

# Fastest combination (less rigorous)
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='detrend_harmonic',
    method_extreme='global_extreme',
    threshold_percentile=95,
    std_normalise=False
)

Performance Optimisations

# Optimised chunking for large datasets
extremes_ds = marEx.preprocess_data(
    sst,
    threshold_percentile=95,
    dask_chunks={'time': 25, 'lat': 200, 'lon': 200},
    # Use approximate percentiles for speed
    method_percentile='approximate',
    precision=0.05,  # Coarser precision for speed
    max_anomaly=10.0
)

Multi-Variable Processing

# Process multiple variables
variables = ['sst', 'sss', 'chlorophyll']
extreme_datasets = {}

for var_name in variables:
    data = xr.open_dataset(f'{var_name}_daily.nc')[var_name]

    extreme_datasets[var_name] = marEx.preprocess_data(
        data,
        threshold_percentile=95,
        method_anomaly='shifting_baseline',
        method_extreme='hobday_extreme'
    )

Integration with Tracking

Complete Workflow

import xarray as xr
import marEx

# Step 1: Load data
sst = xr.open_dataset('sst_daily.nc', chunks={'time': 365}).sst

# Step 2: Preprocess extremes
extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='shifting_baseline',
    method_extreme='hobday_extreme',
    threshold_percentile=95,
    window_year_baseline=15,
    smooth_days_baseline=21,
    window_days_hobday=11,
    dask_chunks={'time': 25}
)

# Step 3: Track events
event_tracker = marEx.tracker(
    extremes_ds.extreme_events,
    extremes_ds.mask,
    R_fill=8,
    area_filter_quartile=0.5
)

tracked_events = event_tracker.run()

# Step 4: Visualise results
config = marEx.PlotConfig(
    title='Marine Heatwave Events',
    plot_IDs=True
)

fig, ax, im = tracked_events.ID_field.isel(time=0).plotX.single_plot(config)

Quality Control

Data Validation

# Check preprocessing results
extremes_ds = marEx.preprocess_data(sst, threshold_percentile=95)

# Validate anomaly statistics
anomaly_stats = extremes_ds.dat_anomaly.std()
print(f"Anomaly standard deviation: {anomaly_stats.values:.3f}")

# Check extreme event frequency
event_frequency = extremes_ds.extreme_events.mean() * 100
print(f"Extreme event frequency: {event_frequency.values:.1f}%")

# Validate mask coverage
ocean_fraction = extremes_ds.mask.mean() * 100
print(f"Ocean coverage: {ocean_fraction.values:.1f}%")

Threshold Validation

# For hobday_extreme method, examine thresholds
if 'thresholds' in extremes_ds:
    # Check seasonal threshold variation
    seasonal_range = (extremes_ds.thresholds.max(dim='dayofyear') -
                     extremes_ds.thresholds.min(dim='dayofyear'))
    print(f"Seasonal threshold range: {seasonal_range.mean().values:.3f}")

    # Plot threshold climatology
    threshold_clim = extremes_ds.thresholds.mean(dim=['lat', 'lon'])
    import matplotlib.pyplot as plt
    plt.plot(threshold_clim.dayofyear, threshold_clim.values)
    plt.xlabel('Day of Year')
    plt.ylabel('Threshold Value')
    plt.title('Seasonal Threshold Climatology')

Error Handling

Common Issues and Solutions

Memory Errors:

# Solution: Optimise chunking and use approximate methods
extremes_ds = marEx.preprocess_data(
    sst,
    threshold_percentile=95,
    dask_chunks={'time': 15},     # Smaller chunks
    method_percentile='approximate',
    precision=0.1                 # Coarser precision
)

Performance Issues:

# Solution: Use faster methods for exploration
extremes_ds = marEx.preprocess_data(
    sst,
    threshold_percentile=95,
    method_anomaly='detrend_harmonic',
    method_extreme='global_extreme',
    std_normalise=False
)

Threshold Calculation Issues:

# Solution: Adjust window sizes and use spatial clustering
extremes_ds = marEx.preprocess_data(
    sst,
    method_extreme='hobday_extreme',
    window_days_hobday=21,        # Larger window
    window_spatial_hobday=5,      # Add spatial clustering
    precision=0.01                # Higher precision
)

Coordinate System Issues:

# Solution: Specify custom dimensions and coordinates
extremes_ds = marEx.preprocess_data(
    sst,
    threshold_percentile=95,
    dimensions={'time': 'time', 'x': 'longitude', 'y': 'latitude'},
    coordinates={'time': 'time', 'x': 'longitude', 'y': 'latitude'}
)

Performance Benchmarks

Method Performance Comparison

# Relative performance for global 0.25° daily data:

# Fastest: detrend_harmonic + global_extreme
# - Processing time: ~0.5 wall-minutes per decade (2 CPU-hours)
# - Memory usage: ~1 GB
# - Accuracy: Good for first analysis

# Balanced: detrend_fixed_baseline + hobday_extreme
# - Processing time: ~5 wall-minutes per decade (21 CPU-hours)
# - Memory usage: ~8 GB
# - Accuracy: Better climatology

# Most rigorous: shifting_baseline + hobday_extreme
# - Processing time: ~8 wall-minutes per decade (34 CPU-hours)
# - Memory usage: ~12 GB
# - Accuracy: Best for research applications

Scaling Characteristics

# Scaling with dask (64 cores, 25-day chunks):
# - Linear scaling up to ~100 cores
# - Memory usage: ~2-4 GB per core
# - I/O becomes bottleneck beyond 200 cores
# - Optimal chunk size depends on data resolution

Best Practices

Method Selection Guidelines

  1. Quick Exploratory Analysis: Use detrend_harmonic + global_extreme

  2. Climate Change Research Studies: Use shifting_baseline + hobday_extreme

  3. Limited Timeseries: Use detrend_fixed_baseline + hobday_extreme

Chunking Guidelines

# Optimal chunking strategies:

# For global 0.25° data:
optimal_chunks = {'time': 25, 'lat': 200, 'lon': 200}

# For regional high-resolution data:
optimal_chunks = {'time': 50, 'lat': 100, 'lon': 100}

# For unstructured grids:
optimal_chunks = {'time': 25, 'ncells': 50000}

See Also