API Reference

This page provides a comprehensive reference for all public functions, classes, and modules in marEx.

Main Package

MarEx: Marine Extremes Detection and Tracking

A Python package for efficient identification and tracking of marine extremes such as Marine Heatwaves (MHWs).

Core Functionality

  • detect: Convert raw time series into standardised anomalies

  • track: Identify and track extreme events through time

Example

>>> import xarray as xr
>>> import marEx
>>> # Load SST data
>>> sst = xr.open_dataset('sst_data.nc').sst
>>> # Preprocess data to identify extreme events
>>> extreme_events_ds = marEx.preprocess_data(sst, threshold_percentile=95)
>>> # Track events through time
>>> events_ds = marEx.tracker(extreme_events_ds.extreme_events, extreme_events_ds.mask,
...                         R_fill=8, area_filter_quartile=0.5).run()
marEx.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Complete preprocessing pipeline for marine extreme event identification.

Supports separate methods for anomaly computation and extreme identification:

Anomaly Methods:

  • ‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics

  • ‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years

  • ‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !

  • ‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)

Extreme Methods:

  • ‘global_extreme’: Global-in-time threshold value

  • ‘hobday_extreme’: Local day-of-year specific thresholds with windowing

Parameters:
  • da (xarray.DataArray) – Raw input data

  • method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).

  • method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).

  • threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.

  • window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).

  • window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).

  • window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).

  • std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).

  • detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.

  • force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.

  • method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.

  • precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.

  • dask_chunks (dict, optional) – Chunking specification for distributed computation.

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.

  • coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.

  • neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.

  • cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.

  • use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.

  • verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

Returns:

Processed dataset with anomalies and extreme event identification

Return type:

xarray.Dataset

Examples

Basic usage with gridded SST data for marine heatwave detection:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load and chunk SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Basic preprocessing with default shifting baseline method
>>> result = marEx.preprocess_data(sst, threshold_percentile=90)
>>> print(result)
<xarray.Dataset>
Dimensions:         (time: 1461, lat: 180, lon: 360)
Data variables:
    dat_anomaly     (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask            (lat, lon) bool dask.array<chunksize=(180, 360)>
    extreme_events  (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)>
    thresholds      (lat, lon) float32 dask.array<chunksize=(180, 360)>
>>> # Check which locations have extreme events
>>> print(f"Total extreme events: {result.extreme_events.sum().compute()}")
Total extreme events: 15847

Using shifting baseline method for more accurate climatology:

>>> # Requires at least 15 years of data by default
>>> result_shifting = marEx.preprocess_data(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,  # Use shorter window if needed
...     smooth_days_baseline=31   # Longer smoothing window
... )
>>> # Note: First 10 years will be removed from output

Using Hobday extreme method with day-of-year specific thresholds:

>>> result_hobday = marEx.preprocess_data(
...     sst,
...     method_extreme="hobday_extreme",
...     window_days_hobday=11,  # 11-day window for each day-of-year
...     threshold_percentile=95
... )
>>> print(result_hobday.thresholds.dims)
('dayofyear', 'lat', 'lon')

Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:

>>> result_advanced = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2],  # Linear and quadratic trends
...     std_normalise=True,     # Standardise by rolling std
...     force_zero_mean=True,
...     threshold_percentile=95
... )
>>> # Result includes both raw and standardised anomalies
>>> print('dat_stn' in result_advanced)
True

Processing unstructured data:

>>> # For ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50})
>>> result_unstructured = marEx.preprocess_data(
...     icon_sst,
...     dimensions={"x": "ncells"},   # Must specify the name of the spatial dimension
...     dask_chunks={"time": 50}
... )

Error handling - insufficient data for shifting baseline:

>>> short_data = sst.isel(time=slice(0, 1000))  # Only ~3 years
>>> try:
...     result = marEx.preprocess_data(
...         short_data,
...         method_anomaly="shifting_baseline",
...         window_year_baseline=15
...     )
... except ValueError as e:
...     print(f"Error: {e}")
Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline
requires at least 15 years.

Performance considerations with chunking:

>>> # For large datasets, adjust chunking for memory management
>>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180})
>>> result = marEx.preprocess_data(
...     large_sst,
...     dask_chunks={"time": 25},
...     method_percentile="approximate"  # Use approximate method (Default) for long time-series calculations
... )

Integration with tracking workflow:

>>> # Preprocess data then track events
>>> processed = marEx.preprocess_data(sst, threshold_percentile=95)
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = tracker.run()
>>> print(f"Identified {events.event.max().compute()} distinct events")

Simple fixed baseline approach:

>>> # Basic daily climatology across all years
>>> result_fixed = marEx.preprocess_data(
...     sst,
...     method_anomaly="fixed_baseline",
...     threshold_percentile=95
... )
>>> # Uses all available data for climatology computation

Combined trend removal and fixed climatology:

>>> # Remove long-term trends then compute daily climatology
>>> result_combined = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Linear trend
...     threshold_percentile=95,
...     force_zero_mean=True
... )
>>> # Balances trend removal with simple climatology
marEx.compute_normalised_anomaly(da, method_anomaly='shifting_baseline', dimensions=None, coordinates=None, window_year_baseline=15, smooth_days_baseline=21, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Generate normalised anomalies using specified methodology.

Parameters:
  • da (xarray.DataArray) – Input data with dimensions matching the ‘dimensions’ parameter

  • method_anomaly (str, default='shifting_baseline') –

    Anomaly computation method. Options: - ‘detrend_harmonic’: Detrending with harmonics and polynomials (efficient, biased) - ‘shifting_baseline’: Rolling climatology (accurate, shortens time series) - ‘fixed_baseline’: Daily climatology using full time series (keeps long-term trends in the anomaly) - ‘detrend_fixed_baseline’: Polynomial detrending + fixed climatology (does not shorten time series,

    keeps trends in seasonal timing in the anomaly)

  • dimensions (dict, optional) – Mapping of conceptual dimensions to actual dimension names in the data

  • coordinates (dict, optional) – Mapping of conceptual coordinates to actual coordinate names in the data

  • window_year_baseline (int, default=15) – Number of years for rolling climatology (shifting_baseline only)

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline only)

  • std_normalise (bool, default=False) – Whether to normalise by 30-day rolling standard deviation (detrend_harmonic only)

  • detrend_orders (list, default=[1]) – Polynomial orders for trend removal (detrend_harmonic and detrend_fixed_baseline only)

  • force_zero_mean (bool, default=True) – Explicitly enforce zero mean in final anomalies (detrend_harmonic and detrend_fixed_baseline only)

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless.

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Dataset containing anomalies, mask, and metadata

Return type:

xarray.Dataset

Examples

Basic detrended baseline anomaly computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load chunked SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute anomalies using shifting baseline (default)
>>> result = marEx.compute_normalised_anomaly(sst)
>>> print(result.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
>>> # Check that anomalies have approximately zero mean
>>> print(f"Mean anomaly: {result.dat_anomaly.mean().compute():.6f}")
Mean anomaly: 0.000023

Previous configuration (marEx v2.0 default) of detrended baseline with higher-order polynomials and standardisation. Note: marEx v3.0+ uses shifting_baseline as the default method:

>>> result_advanced = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2, 3],  # Linear, quadratic, cubic trends
...     std_normalise=True,        # Add standardised anomalies
...     force_zero_mean=True
... )
>>> print(result_advanced.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
    dat_stn      (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    STD          (dayofyear, lat, lon) float32 dask.array<chunksize=(366, 180, 360)>
>>> # Standardised anomalies have unit variance
>>> print(f"STD of standardised anomalies: {result_advanced.dat_stn.std().compute():.3f}")

Accurate shifting baseline method for climate-aware anomalies:

>>> result_shifting = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,   # Use 10-year rolling climatology
...     smooth_days_baseline=31    # 31-day smoothing window
... )
>>> # Anomalies computed relative to recent past climatology

Processing unstructured data:

>>> # ICON ocean model with ncells dimension
>>> icon_data = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> result_unstructured = marEx.compute_normalised_anomaly(
...     icon_data,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
... )
>>> print(result_unstructured.dims)
Frozen({'time': 1461, 'ncells': 83886})

Comparison of methods - detrended vs shifting baseline:

>>> # Detrended baseline - faster, slight bias
>>> detrended = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="detrend_harmonic"
... )
>>>
>>> # Shifting baseline - slower, more accurate
>>> shifting = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="shifting_baseline",
...     window_year_baseline=15
... )
>>>
>>> # Compare anomaly magnitudes
>>> print(f"Detrended RMS: {detrended.dat_anomaly.std().compute():.3f}")
>>> print(f"Shifting RMS: {shifting.dat_anomaly.std().compute():.3f}")

Fixed baseline climatology:

>>> # Use full time series for daily climatology
>>> result_fixed = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline"
... )
>>> # Climatology computed from all available years

Fixed baseline with a restricted reference period:

>>> # Compute climatology from 1990-2020 only, but output anomalies for all years
>>> result_ref = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline",
...     reference_period=(1990, 2020)
... )

Fixed detrended baseline:

>>> # Remove long-term trends then compute fixed climatology
>>> result_fixed_detrended = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Remove linear trend
...     force_zero_mean=True
... )
>>> # Combines trend removal with fixed climatology
marEx.smoothed_rolling_climatology(da, window_year_baseline=15, smooth_days_baseline=21, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • smooth_days_baseline (int, default=21) – Number of days for temporal smoothing window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Smoothed rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic smoothed rolling climatology:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst,
...     window_year_baseline=15,
...     smooth_days_baseline=21
... )
>>> print(smooth_clim.shape)
(7305, 180, 360)

Comparing different smoothing windows:

>>> # Short smoothing - more day-to-day variability
>>> clim_short = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=7
... )
>>>
>>> # Long smoothing - smoother seasonal cycle
>>> clim_long = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=61
... )
>>>
>>> # Compare variability
>>> var_short = clim_short.std(dim='time').mean().compute()
>>> var_long = clim_long.std(dim='time').mean().compute()
>>> print(f"Variability: short={var_short:.3f}, long={var_long:.3f}")

Climatology for anomaly computation:

>>> # Compute smoothed climatology then anomalies
>>> climatology = marEx.smoothed_rolling_climatology(sst)
>>> anomalies = sst - climatology
>>>
>>> # Check that anomalies have reasonable properties
>>> print(f"Anomaly mean: {anomalies.mean().compute():.6f}")
>>> print(f"Anomaly std: {anomalies.std().compute():.3f}")

Unstructured data processing:

>>> # ICON ocean data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_smooth_clim = marEx.smoothed_rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     window_year_baseline=10,
...     smooth_days_baseline=31
... )

Effect of smoothing on seasonal cycle:

>>> # Raw rolling climatology (no temporal smoothing)
>>> raw_clim = marEx.rolling_climatology(sst, window_year_baseline=15)
>>>
>>> # Smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst, window_year_baseline=15, smooth_days_baseline=21
... )
>>>
>>> # Compare seasonal cycle smoothness
>>> # Extract annual cycle for a point
>>> point_raw = raw_clim.isel(lat=90, lon=180).sel(time='2010')
>>> point_smooth = smooth_clim.isel(lat=90, lon=180).sel(time='2010')
>>>
>>> print(f"Raw climatology range: {(point_raw.max() - point_raw.min()).compute():.3f}")
>>> print(f"Smooth climatology range: {(point_smooth.max() - point_smooth.min()).compute():.3f}")

Performance considerations:

>>> # Efficient implementation smooths raw data first, then computes climatology
>>> # This is more memory-efficient than smoothing the climatology
>>> large_sst = sst.chunk({'time': 25, 'lat': 45, 'lon': 90})
>>> efficient_clim = marEx.smoothed_rolling_climatology(large_sst)
marEx.rolling_climatology(da, window_year_baseline=15, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute rolling climatology efficiently using flox cohorts. Uses the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic rolling climatology computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load 20 years of SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute 15-year rolling climatology
>>> climatology = marEx.rolling_climatology(sst, window_year_baseline=15)
>>> print(climatology.shape)
(7305, 180, 360)  # Same as input
>>>
>>> # First 15 years will be NaN (insufficient history)
>>> print(f"NaN values in first year: {climatology.isel(time=slice(0, 365)).isnull().all().compute()}")
True

Shorter window for datasets with limited time span:

>>> # For datasets with only 10 years, use shorter window
>>> short_climatology = marEx.rolling_climatology(
...     sst, window_year_baseline=5
... )
>>> # First 5 years will be NaN instead of 15

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_climatology = marEx.rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"}
... )
>>> print(icon_climatology.dims)
Frozen({'time': 7305, 'ncells': 83886})

Comparing with fixed climatology:

>>> # Fixed climatology (traditional approach)
>>> fixed_clim = sst.groupby(sst.time.dt.dayofyear).mean()
>>>
>>> # Rolling climatology (adaptive approach)
>>> rolling_clim = marEx.rolling_climatology(sst)
>>>
>>> # Rolling climatology adapts to climate change
>>> clim_2000 = rolling_clim.sel(time='2000').mean()
>>> clim_2020 = rolling_clim.sel(time='2020').mean()
>>> print(f"Climate change signal: {(clim_2020 - clim_2000).compute():.3f} °C")

Memory considerations for large datasets:

>>> # Ensure appropriate chunking for memory efficiency
>>> large_sst = sst.chunk({'time': 30, 'lat': 45, 'lon': 90})
>>> large_climatology = marEx.rolling_climatology(large_sst)
>>> # Output maintains input chunking structure
marEx.identify_extremes(da, method_extreme='hobday_extreme', threshold_percentile=95, dimensions=None, coordinates=None, window_days_hobday=11, window_spatial_hobday=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Identify extreme events exceeding a percentile threshold using specified method.

Parameters:
  • da (xarray.DataArray) – DataArray containing anomalies

  • method_extreme (str, default='hobday_extreme') – Method for threshold calculation (‘global_extreme’ or ‘hobday_extreme’)

  • threshold_percentile (float, default=95) – Percentile threshold (e.g., 95 for 95th percentile)

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • window_days_hobday (int, default=11) – Window for day-of-year threshold (hobday_extreme only)

  • window_spatial_hobday (int, default=None) – Window for day-of-year threshold spatial clustering (hobday_extreme only)

  • method_percentile (str, default='approximate') – Method for percentile computation (‘exact’ or ‘approximate’)

  • precision (float, default=0.01) – Precision for histogram bins in approximate method

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Tuple of (extremes, thresholds) where extremes is a boolean array identifying extreme events and thresholds contains the threshold values used

Return type:

tuple

Examples

Basic extreme identification with global thresholds:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load anomaly data (from compute_normalised_anomaly)
>>> anomalies = xr.open_dataset('anomalies.nc', chunks={}).dat_anomaly
>>>
>>> # Identify extreme events using global-in-time 95th percentile
>>> extremes, thresholds = marEx.identify_extremes(
...     anomalies,
...     method_extreme="global_extreme",
...     threshold_percentile=95
... )
>>> print(f"Extreme events shape: {extremes.shape}")
Extreme events shape: (1461, 180, 360)
>>> print(f"Thresholds shape: {thresholds.shape}")
Thresholds shape: (180, 360)
>>> # Count total extreme events
>>> total_extremes = extremes.sum().compute()
>>> print(f"Total extreme events: {total_extremes}")

Using day-of-year specific thresholds (cf. Hobday et al. 2016 method):

>>> # More sophisticated threshold calculation
>>> extremes_hobday, thresholds_hobday = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     threshold_percentile=95,
...     window_days_hobday=11  # 11-day window around each day-of-year
...     window_spatial_hobday=3  # 3x3 spatial window for clustering percentile calcuation
... )
>>> print(f"Hobday thresholds shape: {thresholds_hobday.shape}")
Hobday thresholds shape: (366, 180, 360)
>>> # Compare seasonal variation in thresholds
>>> summer_threshold = thresholds_hobday.sel(dayofyear=200).mean()
>>> winter_threshold = thresholds_hobday.sel(dayofyear=50).mean()
>>> print(f"Summer vs Winter thresholds: {summer_threshold:.3f} vs {winter_threshold:.3f}")

Comparison of exact vs approximate percentile methods:

>>> # Approximate method (faster, default)
>>> extremes_approx, thresh_approx = marEx.identify_extremes(
...     anomalies, method_percentile="approximate"
... )
>>>
>>> # Exact method (slower & memory intensive)
>>> extremes_exact, thresh_exact = marEx.identify_extremes(
...     anomalies, method_percentile="exact"
... )
>>>
>>> # Compare threshold precision — ~0.005C
>>> threshold_diff = (thresh_exact - thresh_approx).std().compute()
>>> print(f"Threshold difference (exact vs approx): {threshold_diff:.6f}")

Different percentile thresholds for varying event rarity:

>>> # Conservative threshold - very extreme events only
>>> extremes_98, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=98
... )
>>>
>>> # Moderate threshold - more frequent events
>>> extremes_90, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=90
... )
>>>
>>> # Compare event frequency
>>> print(f"99th percentile events: {extremes_99.sum().compute()}")
>>> print(f"90th percentile events: {extremes_90.sum().compute()}")

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_anomalies = xr.open_dataset('icon_anomalies.nc', chunks={}).dat_anomaly
>>> extremes_unstructured, thresholds_unstructured = marEx.identify_extremes(
...     icon_anomalies,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     threshold_percentile=95
... )
>>> print(f"Unstructured extremes shape: {extremes_unstructured.shape}")

Advanced Hobday method with custom temporal window:

>>> # Longer temporal window for smoother thresholds
>>> extremes_smooth, thresholds_smooth = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     window_days_hobday=31,  # Longer smoothing window
...     threshold_percentile=95
... )
>>>
>>> # Compare threshold smoothness
>>> std_11day = thresholds_hobday.std(dim='dayofyear').mean().compute()
>>> std_31day = thresholds_smooth.std(dim='dayofyear').mean().compute()
>>> print(f"Threshold variability: 11-day={std_11day:.3f}, 31-day={std_31day:.3f}")
class marEx.tracker(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Bases: object

Tracker identifies and tracks arbitrary binary objects in spatial data through time.

The tracker supports both structured (regular grid) and unstructured data, and seamlessly handles splitting & merging of objects. It identifies connected regions in binary data at each time step, and tracks these as evolving events through time.

Main workflow:

  1. Preprocessing: Fill spatiotemporal holes, filter small objects

  2. Object identification: Label connected components at each time

  3. Tracking: Determine object correspondences across time

  4. Optional splitting & merging: Handle complex event evolution

Parameters:
  • data_bin (xarray.DataArray) – Binary field of extreme points to group, label, and track (True = object, False = background) Must represent and underlying dask array.

  • mask (xarray.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)

  • R_fill (int) – The radius of the kernel used in morphological opening & closing, relating to the largest hole/gap that can be filled. In units of grid cells.

  • area_filter_quartile (float, optional) – The fraction of the smallest objects to discard, i.e. the quantile defining the smallest area object retained. Quantile must be in (0-1) (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.

  • area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile. Use this for fixed minimum area thresholds (e.g., 10 cells minimum).

  • temp_dir (str, optional) – Path to temporary directory for storing intermediate results

  • T_fill (int, default=2) – The permissible temporal gap (in days) between objects for tracking continuity to be maintained (must be even)

  • allow_merging (bool, default=True) – Allow objects to split and merge across time. Apply splitting & merging criteria, track merge events, and maintain original identities of merged objects across time. N.B.: False reverts to classical ndmeasure.label with simplar time connectivity, i.e. Scannell et al.

  • nn_partitioning (bool, default=False) – Implement a better partitioning of merged child objects based on closest parent cell. False reverts to using parent centroids to determine partitioning between new child objects, i.e. Di Sun & Bohai Zhang 2023. N.B.: Centroid-based partitioning has major problems with small merging objects suddenly obtaining unrealistically-large (and often disjoint) fractions of the larger object.

  • overlap_threshold (float, default=0.5) – The fraction of the smaller object’s area that must overlap with the larger object’s area to be considered the same event and continue tracking with the same ID.

  • unstructured_grid (bool, default=False) – Whether data is on an unstructured grid

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Coordinate names for unstructured grids. Should contain ‘x’ and ‘y’ keys for x and y coordinates. May also contain ‘time’ if the time coordinate name is different from the dimension name.

  • neighbours (xarray.DataArray, optional) – For unstructured grid, indicates connectivity between cells

  • cell_areas (xarray.DataArray, optional) – For unstructured grid, area of each cell (required). For structured grid, area of each cell (optional). If not provided, defaults to 1.0 for each cell (resulting in cell counts as areas). Note: Overridden by grid_resolution if provided for structured grids.

  • grid_resolution (float, optional) – Grid resolution in degrees for structured grids only (ignored for unstructured grids). When provided, automatically calculates cell areas using spherical geometry. Overrides any provided cell_areas parameter.

  • max_iteration (int, default=40) – Maximum number of iterations for merging/splitting algorithm

  • checkpoint (str, default='None') – Checkpoint strategy (‘save’, ‘load’, or None)

  • debug (int, default=0) – Debug level (0-2)

  • verbose (bool, optional) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

  • regional_mode (bool, default=False) – Enable regional mode for non-global coordinate ranges. When True, coordinate_units must be specified.

  • coordinate_units (str, optional) – Coordinate units when regional_mode=True. Must be either ‘degrees’ or ‘radians’.

Examples

Basic tracking of marine heatwave events from preprocessed data:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load preprocessed extreme events data
>>> processed = xr.open_dataset('extreme_events.nc', chunks={})
>>> extreme_events = processed.extreme_events  # Boolean array
>>> mask = processed.mask  # Ocean/land mask
>>>
>>> # Initialise tracker with basic parameters
>>> tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,                    # Fill holes up to 8 grid cells
...     area_filter_quartile=0.5     # Remove smallest 50% of objects
...     allow_merging=False          # Basic tracking without splitting/merging
... )
>>>
>>> # Run tracking algorithm
>>> events = tracker.run()
>>> print(f"Identified {events.ID.max().compute()} distinct events")
Identified 1247 distinct events

Using automatic grid area calculation from resolution:

>>> # For regular lat/lon grids, automatically calculate physical areas
>>> grid_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     grid_resolution=0.25  # Grid resolution in degrees
... )
>>> # Cell areas are calculated automatically using spherical geometry
>>> grid_events = grid_tracker.run()

Advanced tracking with merging and splitting enabled:

>>> # More sophisticated tracking with temporal gap filling
>>> advanced_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=12,               # Larger spatial gap filling
...     T_fill=4,                # Fill up to 4-day temporal gaps
...     area_filter_quartile=0.25,  # More aggressive size filtering
...     allow_merging=True,      # Enable split/merge detection
...     overlap_threshold=0.3    # Lower threshold for object linking
... )
>>>
>>> events_advanced, merges_log = advanced_tracker.run(return_merges=True)
>>> print(events_advanced.data_vars)
Data variables:
    event           (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    event_centroid  (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    ID_field        (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    global_ID       (time, ID)              int32           dask.array<chunksize=(25, 1247)>
    area            (time, ID)              float32         dask.array<chunksize=(25, 1247)>
    centroid        (component, time, ID)   float64         dask.array<chunksize=(2, 25, 1247)>
    presence        (time, ID)              bool            dask.array<chunksize=(25, 1247)>
    time_start      (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    time_end        (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    merge_ledger    (time, ID, sibling_ID)  int32           dask.array<chunksize=(25, 1247, 10)>

Processing unstructured ocean model data (ICON):

>>> # Load ICON ocean model data with connectivity
>>> icon_data = xr.open_dataset('icon_extremes.nc', chunks={})
>>> icon_extremes = icon_data.extreme_events  # (time, ncells)
>>> icon_mask = icon_data.mask
>>> neighbours = icon_data.neighbours  # Cell connectivity
>>> cell_areas = icon_data.cell_areas  # Physical areas
>>>
>>> # Track events on unstructured grid
>>> unstructured_tracker = marEx.tracker(
...     icon_extremes,
...     icon_mask,
...     R_fill=5,                                   # 5-neighbor radius for gap filling
...     area_filter_quartile=0.6,                   # Remove 60% of smallest events
...     unstructured_grid=True,                     # Enable unstructured mode
...     dimensions={"x": "ncells"},                 # Must specify the name of the spatial dimension
...     coordinates={"x": "lon", "y": "lat"},       # Spatial coordinate names
...     neighbours=neighbours,                      # Required for unstructured
...     cell_areas=cell_areas                       # Required for area calculations
... )
>>> unstructured_events = unstructured_tracker.run()

Memory management and checkpointing for large datasets:

>>> # Use checkpointing for very large datasets
>>> large_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     temp_dir='/scratch/user/tracking_temp',  # Temporary storage
...     checkpoint='save'             # Save intermediate results
... )
>>> # Processing can be resumed if interrupted
>>> large_events = large_tracker.run()

Comparing different filtering strategies:

>>> # Conservative filtering - keep more events
>>> conservative = marEx.tracker(
...     extreme_events, mask, R_fill=5, area_filter_quartile=0.1
... )
>>> conservative_events = conservative.run()
>>>
>>> # Aggressive filtering - focus on largest events
>>> aggressive = marEx.tracker(
...     extreme_events, mask, R_fill=15, area_filter_quartile=0.8
... )
>>> aggressive_events = aggressive.run()
>>>
>>> print(f"Conservative: {conservative_events.ID.max().compute()} events")
>>> print(f"Aggressive: {aggressive_events.ID.max().compute()} events")

Using absolute area filtering instead of percentile-based:

>>> # Filter objects smaller than 25 grid cells
>>> absolute_tracker = marEx.tracker(
...     extreme_events, mask, R_fill=8, area_filter_absolute=25
... )
>>> absolute_events = absolute_tracker.run()
>>>
>>> # Default behavior (area_filter_quartile=0.5) when no parameters provided
>>> default_tracker = marEx.tracker(extreme_events, mask, R_fill=8)
>>> default_events = default_tracker.run()  # Uses quartile=0.5 filtering

Using physical cell areas for structured grids:

>>> # Load data with irregular grid cell areas
>>> grid_areas = xr.open_dataset('grid_areas.nc').cell_area  # (lat, lon) in m²
>>>
>>> # Track events using physical areas instead of cell counts
>>> physical_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     cell_areas=grid_areas  # Physical areas in m²
... )
>>> events = physical_tracker.run()
>>> # Now events.area contains physical areas in m² instead of cell counts

Integration with full marEx workflow:

>>> # Complete workflow from raw data to tracked events
>>> raw_sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Step 1: Preprocess to identify extremes
>>> processed = marEx.preprocess_data(raw_sst, threshold_percentile=95)
>>>
>>> # Step 2: Track extreme events
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> tracked_events = tracker.run()

Initialise the tracker with parameters and data.

__init__(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Initialise the tracker with parameters and data.

Parameters:
Return type:

None

run(return_merges=False, checkpoint=None)[source]

Run the complete object identification and tracking pipeline.

This method executes the full workflow:

  1. Preprocessing: morphological operations and size filtering

  2. Identification and tracking of objects through time

  3. Computing and attaching statistics to the results

Parameters:
  • return_merges (bool, default=False) – If True, return merge events dataset alongside the main events

  • checkpoint (str, optional) – Override the instance checkpoint setting

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events and their properties

  • merges_ds (xarray.Dataset, optional) – Dataset with merge event information (only if return_merges=True)

Return type:

Dataset | Tuple[Dataset, Dataset]

run_preprocess(checkpoint=None)[source]

Preprocess binary data to prepare for tracking.

This performs morphological operations to fill holes/gaps in both space and time, then filters small objects according to the area_filter_quartile or area_filter_absolute.

Parameters:

checkpoint (str, optional) – Checkpoint strategy override

Returns:

  • data_bin_filtered (xarray.DataArray) – Preprocessed binary data

  • object_stats (tuple) – Statistics about the preprocessing

Return type:

Tuple[DataArray, Tuple[float, int, int, float, float, float]]

run_tracking(data_bin_preprocessed)[source]

Track objects through time to identify events.

Parameters:

data_bin_preprocessed (xarray.DataArray) – Preprocessed binary data

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • N_events_final (int) – Final number of unique events

Return type:

Tuple[Dataset, Dataset, int]

run_stats_attributes(events_ds, merges_ds, object_stats, N_events_final)[source]

Add statistics and attributes to the events dataset.

Parameters:
  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • object_stats (tuple) – Preprocessed object statistics

  • N_events_final (int) – Final number of events

Returns:

events_ds – Dataset with added statistics and attributes

Return type:

xarray.Dataset

compute_area(data_bin)[source]

Compute the total area of binary data at each time.

Parameters:

data_bin (xarray.DataArray) – Binary data

Returns:

area – Total area at each time (units: pixels for structured grid, matching cell_area for unstructured)

Return type:

xarray.DataArray

fill_holes(data_bin, R_fill=None)[source]

Fill holes and gaps using morphological operations.

This performs closing (dilation followed by erosion) to fill small gaps, then opening (erosion followed by dilation) to remove small isolated objects.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to process

  • R_fill (int, optional) – Fill radius override

Returns:

data_bin_filled – Binary data with holes/gaps filled

Return type:

xarray.DataArray

fill_time_gaps(data_bin)[source]

Fill temporal gaps between objects.

Performs binary closing (dilation then erosion) along the time dimension to fill small time gaps between objects.

Parameters:

data_bin (xarray.DataArray) – Binary data to process

Returns:

data_bin_filled – Binary data with temporal gaps filled

Return type:

xarray.DataArray

refresh_dask_graph(data_bin)[source]

Clear and reset the Dask graph via save/load cycle.

This is needed to work around a memory leak bug in Dask where “Unmanaged Memory” builds up within loops.

Parameters:

data_bin (xarray.DataArray) – Data to refresh

Returns:

data_new – Data with fresh Dask graph

Return type:

xarray.DataArray

filter_small_objects(data_bin)[source]

Remove objects smaller than a threshold area.

Parameters:

data_bin (xarray.DataArray) – Binary data to filter

Returns:

  • data_bin_filtered (xarray.DataArray) – Binary data with small objects removed

  • area_threshold (float) – Area threshold used for filtering

  • object_areas (xarray.DataArray) – Areas of all objects pre-filtering

  • N_objects_prefiltered (int) – Number of objects before filtering

  • N_objects_filtered (int) – Number of objects after filtering

Return type:

Tuple[DataArray, float, DataArray, int, int]

identify_objects(data_bin, time_connectivity)[source]

Identify connected regions in binary data.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to identify objects in

  • time_connectivity (bool) – Whether to connect objects across time

Returns:

  • object_id_field (xarray.DataArray) – Field of integer IDs for each object

  • None (NoneType) – Placeholder for compatibility with track_objects

  • N_objects (int) – Number of objects identified

Return type:

Tuple[DataArray, None, int]

calculate_centroid(binary_mask, original_centroid=None)[source]

Calculate object centroid, handling edge cases for periodic boundaries.

Parameters:
  • binary_mask (numpy.ndarray) – 2D binary array where True indicates the object (dimensions are (y,x))

  • original_centroid (tuple, optional) – (y_centroid, x_centroid) from regionprops_table

Returns:

(y_centroid, x_centroid)

Return type:

tuple

calculate_object_properties(object_id_field, properties=None)[source]

Calculate properties of objects from ID field.

Parameters:
  • object_id_field (xarray.DataArray) – Field containing object IDs

  • properties (list, optional) – List of properties to calculate (defaults to [‘label’, ‘area’])

Returns:

object_props – Dataset containing calculated properties with ‘ID’ dimension

Return type:

xarray.Dataset

check_overlap_slice(ids_t0, ids_next)[source]

Find overlapping objects between two consecutive time slices.

Parameters:
Returns:

Array of shape (n_overlaps, 3) with [id_t0, id_next, overlap_area]

Return type:

numpy.ndarray

find_overlapping_objects(object_id_field)[source]

Find all overlapping objects across time.

Parameters:

object_id_field (xarray.DataArray) – Field containing object IDs

Returns:

overlap_objects_list_unique_filtered – Array of object ID pairs that overlap across time, with overlap area The object in the first column precedes the second column in time. The third column contains:

  • For structured grid: number of overlapping pixels (int32)

  • For unstructured grid: total overlapping area in m^2 (float32)

Return type:

(N x 3) numpy.ndarray

enforce_overlap_threshold(overlap_objects_list, object_props)[source]

Filter object pairs based on overlap threshold.

Parameters:
  • overlap_objects_list ((N x 3) numpy.ndarray) – Array of object ID pairs with overlap area

  • object_props (xarray.Dataset) – Object properties including area

Returns:

overlap_objects_list_filtered – Filtered array of object ID pairs that meet the overlap threshold

Return type:

(M x 3) numpy.ndarray

consolidate_object_ids(data_t_minus_2, data_t_minus_1, object_props, timestep)[source]

Consolidate object IDs between t-2 and t-1 to ensure consistent tracking.

This identifies objects at t-1 that are actually continuations of objects from t-2 (but got different IDs due to partitioning) and renames them to maintain consistent IDs across timesteps.

Parameters:
  • data_t_minus_2 (xr.DataArray) – Object field at timestep t-2

  • data_t_minus_1 (xr.DataArray) – Object field at timestep t-1 (will be modified)

  • object_props (xr.Dataset) – Object properties dataset (will be modified)

  • timestep (int) – Current timestep number for logging purposes

Returns:

  • data_t_minus_1_consolidated (xr.DataArray) – Updated t-1 field with consolidated IDs

  • object_props_updated (xr.Dataset) – Updated object properties with merged/deleted objects

Return type:

Tuple[DataArray, Dataset]

Notes

  • Uses self.overlap_threshold for determining consolidation eligibility

  • Updates object properties by recalculating for consolidated objects

  • Removes redundant child objects from object_props

compute_id_time_dict(da, child_objects, max_objects, all_objects=True)[source]

Generate lookup table mapping object IDs to their time index.

Parameters:
  • da (xarray.DataArray) – Field of object IDs

  • child_objects (list or array) – Object IDs to include in the dictionary

  • max_objects (int) – Maximum number of objects

  • all_objects (bool, default=True) – Whether to process all objects or just child_objects

Returns:

time_index_map – Dictionary mapping object IDs to time indices

Return type:

dict

track_objects(data_bin)[source]

Track objects through time to form events.

This is the main tracking method that handles splitting and merging of objects.

Parameters:

data_bin (xarray.DataArray) – Preprocessed binary data: Field of globally unique integer IDs of each element in connected regions. ID = 0 indicates no object.

Returns:

  • split_merged_events_ds (xarray.Dataset) – Dataset containing tracked events

  • merge_events (xarray.Dataset) – Dataset with merge information

  • N_events (int) – Final number of events

Return type:

Tuple[Dataset, Dataset, int]

cluster_rename_objects_and_props(object_id_field_unique, object_props, overlap_objects_list, merge_events)[source]

Cluster the object pairs and relabel to determine final event IDs.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs must not be repeated across time.

  • object_props (xarray.Dataset) – Properties of each object that also need to be relabeled.

  • overlap_objects_list ((N x 2) numpy.ndarray) – Array of object ID pairs that indicate which objects are in the same event. The object in the first column precedes the second column in time.

  • merge_events (xarray.Dataset) – Information about merge events

Returns:

split_merged_events_ds – Dataset with relabeled events and their properties. ID = 0 indicates no object.

Return type:

xarray.Dataset

split_and_merge_objects(object_id_field_unique, object_props)[source]

Implement object splitting and merging logic.

This identifies and processes cases where objects split or merge over time, creating new object IDs as needed.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs are required to be monotonically increasing with time.

  • object_props (xarray.Dataset) – Properties of each object

Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

split_and_merge_objects_parallel(object_id_field_unique, object_props)[source]

Optimised parallel implementation of object splitting and merging.

This version is specifically designed for unstructured grids with more efficient memory handling and better parallelism than the standard split_and_merge_objects method. It processes data in chunks, handles merging events, and efficiently updates object IDs.

Parameters:
Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

marEx.regional_tracker(data_bin, mask, coordinate_units, R_fill, area_filter_quartile=None, area_filter_absolute=None, **kwargs)[source]

Create a tracker instance configured for regional (non-global) data.

This is a convenience function that automatically sets regional_mode=True and requires explicit specification of coordinate units, since auto-detection may fail for regional coordinate ranges.

Parameters:
  • data_bin (xr.DataArray) – Binary data to identify and track objects in (True = object, False = background)

  • mask (xr.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)

  • coordinate_units ({'degrees', 'radians'}) – Units of the coordinate system. Must be specified for regional data.

  • R_fill (int or float) – Radius for filling holes/gaps in spatial domain (in grid cells)

  • area_filter_quartile (float, optional) – Quantile (0-1) for filtering smallest objects (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.

  • area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile.

  • **kwargs – Additional parameters passed to the tracker class

Returns:

Configured tracker instance with regional_mode=True

Return type:

tracker

Examples

Track events in regional Mediterranean Sea data:

>>> import marEx
>>> # For regional data with degree coordinates
>>> regional_tracker = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='degrees',
...     R_fill=5,
...     area_filter_quartile=0.3
... )
>>> events = regional_tracker.run()

Track events in regional data with radian coordinates:

>>> # For model output with radian coordinates
>>> regional_tracker = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='radians',
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = regional_tracker.run()

Using absolute area filtering in regional mode:

>>> # Keep only features larger than 15 grid cells
>>> absolute_regional = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='degrees',
...     R_fill=5,
...     area_filter_absolute=15
... )
>>> events = absolute_regional.run()
marEx.specify_grid(grid_type=None, fpath_tgrid=None, fpath_ckdtree=None)[source]

Set the global grid specification that will be used by all plotters.

Parameters:
  • grid_type (str | None) – str, either ‘gridded’ or ‘unstructured’. If specified, this will be used as the primary method to determine grid type.

  • fpath_tgrid (str | Path | None) – Path to the triangulation grid file

  • fpath_ckdtree (str | Path | None) – Path to the pre-computed KDTree indices directory

Return type:

None

class marEx.PlotConfig(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)[source]

Bases: object

Configuration class for plot parameters

Parameters:
title

Plot title

Type:

str | None

var_units

Variable units for colorbar label

Type:

str

issym

Whether data is symmetric (centers colormap at 0)

Type:

bool

cmap

Colormap name or ListedColormap object

Type:

str | matplotlib.colors.ListedColormap | None

cperc

Percentile range for automatic color limits [min, max]

Type:

List[int]

clim

Manual color limits (vmin, vmax)

Type:

Tuple[float, float] | None

show_colorbar

Whether to display colorbar

Type:

bool

grid_lines

Whether to display grid lines

Type:

bool

grid_labels

Whether to display grid labels

Type:

bool

dimensions

Mapping of conceptual to actual dimension names

Type:

Dict[str, str]

coordinates

Mapping of conceptual to actual coordinate names

Type:

Dict[str, str]

norm

Custom normalization (BoundaryNorm or Normalize)

Type:

matplotlib.colors.BoundaryNorm | matplotlib.colors.Normalize | None

plot_IDs

Whether to plot object IDs with random colors

Type:

bool

extend

Colorbar extension (‘neither’, ‘both’, ‘min’, ‘max’)

Type:

str

verbose

Enable verbose logging

Type:

bool | None

quiet

Enable quiet logging

Type:

bool | None

projection

Cartopy projection for map plots

Type:

Any | None

framerate

Frames per second for animations (default 10)

Type:

int

title: str | None = None
var_units: str = ''
issym: bool = False
cmap: str | ListedColormap | None = None
cperc: List[int] = None
clim: Tuple[float, float] | None = None
show_colorbar: bool = True
grid_lines: bool = True
grid_labels: bool = False
dimensions: Dict[str, str] = None
coordinates: Dict[str, str] = None
norm: BoundaryNorm | Normalize | None = None
plot_IDs: bool = False
extend: str = 'both'
verbose: bool | None = None
quiet: bool | None = None
projection: Any | None = None
framerate: int = 10
__post_init__()[source]

Initialise default values and configure logging.

Return type:

None

__init__(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)
Parameters:
Return type:

None

exception marEx.MarExError(message, details=None, suggestions=None, error_code=None, context=None)[source]

Bases: Exception

Base exception class for all MarEx-specific errors.

This is the root of the MarEx exception hierarchy and provides common functionality for all marEx exceptions including:

  • Structured error context

  • Exception chaining support

  • Consistent error formatting

Parameters:
  • message (str) – Primary error message describing what went wrong

  • details (str, optional) – Additional technical details about the error

  • suggestions (list of str, optional) – Actionable suggestions for resolving the error

  • error_code (str, optional) – Structured error code for programmatic handling

  • context (dict, optional) – Additional context information (e.g., parameter values, data shapes)

Initialise the Error.

__init__(message, details=None, suggestions=None, error_code=None, context=None)[source]

Initialise the Error.

Parameters:
add_suggestion(suggestion)[source]

Add an additional suggestion for resolving the error.

Parameters:

suggestion (str)

Return type:

None

add_context(key, value)[source]

Add additional context information.

Parameters:
Return type:

None

exception marEx.DataValidationError(message, details=None, suggestions=None, error_code='DATA_VALIDATION', context=None)[source]

Bases: MarExError

Raise exception for input data validation issues.

This exception covers problems with input data structure, format, content, or compatibility with marEx processing requirements.

Common scenarios:

  • Non-Dask arrays when Dask is required

  • Missing required coordinates or dimensions

  • Invalid data types or ranges

  • Incompatible chunking strategies

  • Malformed input datasets

Examples

>>> raise DataValidationError(
...     "Input DataArray must be Dask-backed",
...     details="Found numpy array, but marEx requires chunked Dask arrays",
...     suggestions=["Use da.chunk() to convert to Dask array",
                      "Load data with dask chunking: xr.open_dataset(...).chunk()"],
...     context={"data_type": type(data), "shape": data.shape}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='DATA_VALIDATION', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.CoordinateError(message, details=None, suggestions=None, error_code='COORDINATE_ERROR', context=None)[source]

Bases: MarExError

Raise exception for coordinate system problems.

This exception handles issues with geographic coordinates including unit mismatches, invalid ranges, missing coordinate information, and coordinate system inconsistencies.

Common scenarios:

  • Latitude/longitude values outside valid ranges

  • Unit mismatches (degrees vs radians)

  • Missing coordinate dimensions

  • Inconsistent coordinate systems between datasets

  • Auto-detection failures

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='COORDINATE_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.ProcessingError(message, details=None, suggestions=None, error_code='PROCESSING_ERROR', context=None)[source]

Bases: MarExError

Raise exception for computational and algorithmic issues.

This exception covers problems that occur during data processing, including numerical computation errors, algorithm convergence issues, and memory/performance problems.

Common scenarios:

  • Insufficient memory for computation

  • Numerical instability or overflow

  • Algorithm convergence failures

  • Chunking strategy problems

  • Dask computation errors

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='PROCESSING_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.ConfigurationError(message, details=None, suggestions=None, error_code='CONFIGURATION_ERROR', context=None)[source]

Bases: MarExError

Raise exception for parameter and setup issues.

This exception handles problems with function parameters, configuration settings, and setup requirements that prevent proper operation.

Common scenarios:

  • Invalid parameter values or combinations

  • Missing required configuration

  • Incompatible parameter settings

  • Environment setup issues

Examples

>>> raise ConfigurationError(
...     "Invalid threshold percentile value",
...     details="threshold_percentile must be between 0 and 100",
...     suggestions=["Use percentile value between 50-99 for extreme events",
                      "Common values: 90 (moderate), 95 (strong), 99 (severe)"],
...     context={"provided_value": 150, "valid_range": [0, 100]}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='CONFIGURATION_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.DependencyError(message, details=None, suggestions=None, error_code='DEPENDENCY_ERROR', context=None)[source]

Bases: MarExError

Raise exception for missing or incompatible dependencies.

This exception handles issues with optional or required dependencies that are missing, incompatible, or incorrectly configured.

Common scenarios:

  • Missing optional dependencies (JAX, ffmpeg)

  • Version incompatibilities

  • Import failures

  • System dependency issues

Examples

>>> raise DependencyError(
...     "JAX acceleration not available",
...     details="JAX package not found or incompatible version",
...     suggestions=["Install JAX: pip install marEx[full]",
                      "Check CUDA compatibility for GPU acceleration",
                      "Processing will continue with NumPy backend"],
...     context={"requested_feature": "GPU acceleration", "available": False}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='DEPENDENCY_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.TrackingError(message, details=None, suggestions=None, error_code='TRACKING_ERROR', context=None)[source]

Bases: MarExError

Raise exception for object tracking and identification issues.

This exception covers problems specific to the tracking module including binary object identification, temporal linking, and merge/split handling.

Common scenarios:

  • Invalid binary input data

  • Tracking parameter conflicts

  • Temporal continuity issues

  • Memory overflow during tracking

  • Checkpoint/resume failures

Examples

>>> raise TrackingError(
...     "Tracking failed due to excessive memory usage",
...     details="Event fragmentation created >100,000 objects per timestep",
...     suggestions=["Increase area_filter_quartile to remove small events",
                      "Apply stronger spatial smoothing before tracking",
                      "Consider processing shorter time periods"],
...     context={"objects_per_timestep": 150000, "memory_limit_gb": 32}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='TRACKING_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.VisualisationError(message, details=None, suggestions=None, error_code='VISUALISATION_ERROR', context=None)[source]

Bases: MarExError

Raise exception for plotting and visualisation problems.

This exception handles issues with the plotX visualisation system including matplotlib configuration, cartopy projections, and animation generation.

Common scenarios:

  • Missing plotting dependencies

  • Cartopy projection issues

  • Invalid plot configuration

  • Animation encoding failures

  • Grid type detection problems

Examples

>>> raise VisualisationError(
...     "Animation creation failed",
...     details="ffmpeg encoder not found for MP4 generation",
...     suggestions=["Install ffmpeg system package",
                      "Use alternative format: save_format='gif'",
                      "Install ffmpeg via conda: conda install ffmpeg"],
...     context={"requested_format": "mp4", "available_encoders": ["png", "gif"]}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='VISUALISATION_ERROR', context=None)[source]

Initialise the Error.

Parameters:
marEx.create_data_validation_error(message, data_info=None, **kwargs)[source]

Create DataValidationError with common data context.

Parameters:
  • message (str) – Error message

  • data_info (dict, optional) – Dictionary with data information (type, shape, dtype, etc.)

  • **kwargs – Additional arguments passed to DataValidationError

Returns:

Configured exception with data context

Return type:

DataValidationError

marEx.create_coordinate_error(message, coordinate_ranges=None, detected_system=None, **kwargs)[source]

Create CoordinateError with coordinate context.

Parameters:
  • message (str) – Error message

  • coordinate_ranges (dict, optional) – Dictionary with coordinate ranges (e.g., {‘lat’: (-90, 90), ‘lon’: (0, 360)})

  • detected_system (str, optional) – Auto-detected coordinate system

  • **kwargs – Additional arguments passed to CoordinateError

Returns:

Configured exception with coordinate context

Return type:

CoordinateError

marEx.create_processing_error(message, computation_info=None, **kwargs)[source]

Create ProcessingError with computation context.

Parameters:
  • message (str) – Error message

  • computation_info (dict, optional) – Dictionary with computation information (memory usage, chunk sizes, etc.)

  • **kwargs – Additional arguments passed to ProcessingError

Returns:

Configured exception with computation context

Return type:

ProcessingError

marEx.wrap_exception(original_exception, message=None, marex_exception_type=None)[source]

Wrap a generic exception in an appropriate MarEx exception.

This function helps maintain backward compatibility while migrating to the new exception hierarchy by wrapping generic exceptions in appropriate MarEx-specific exceptions.

Parameters:
  • original_exception (Exception) – The original exception to wrap

  • message (str, optional) – Custom message (uses original message if not provided)

  • marex_exception_type (type, optional) – Specific MarEx exception type to use

Returns:

Wrapped exception with original as cause

Return type:

MarExError

marEx.has_dependency(dep_name)[source]

Check if a dependency is available.

Parameters:

dep_name (str)

Return type:

bool

marEx.print_dependency_status()[source]

Print status of all tracked dependencies.

Return type:

None

marEx.get_installation_profile()[source]

Get the current installation profile.

Return type:

str

marEx.configure_logging(level=None, format_str=None, date_format=None, log_file=None, console_output=True, disable_external_loggers=True, verbose=None, quiet=None)[source]

Configure logging for the MarEx package.

Parameters:
  • level (int | str | None) – Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

  • format_str (str | None) – Custom log format string

  • date_format (str | None) – Date format for timestamps

  • log_file (str | Path | None) – Optional file path for logging output

  • console_output (bool) – Whether to log to console

  • disable_external_loggers (bool) – Whether to suppress noisy external library logs

  • verbose (bool | None) – Enable verbose mode (detailed logging with DEBUG level)

  • quiet (bool | None) – Enable quiet mode (minimal logging with WARNING level)

Return type:

None

marEx.set_verbose_mode(verbose=True)[source]

Enable or disable verbose logging mode.

Parameters:

verbose (bool) – If True, enables verbose mode with DEBUG level logging

Return type:

None

marEx.set_quiet_mode(quiet=True)[source]

Enable or disable quiet logging mode.

Parameters:

quiet (bool) – If True, enables quiet mode with WARNING level logging

Return type:

None

marEx.set_normal_logging()[source]

Reset logging to normal mode (INFO level).

Return type:

None

marEx.get_verbosity_level()[source]

Get the current verbosity level.

Returns:

‘quiet’, ‘normal’, or ‘verbose’

Return type:

Current verbosity level

marEx.is_verbose_mode()[source]

Check if verbose mode is currently enabled.

Return type:

bool

marEx.is_quiet_mode()[source]

Check if quiet mode is currently enabled.

Return type:

bool

marEx.get_logger(name='marEx')[source]

Get a configured logger instance for MarEx.

Parameters:

name (str) – Logger name, typically the module name

Returns:

Configured logger instance

Return type:

Logger

marEx.configure_dask(scratch_dir=None, config=None)[source]

Configure Dask with appropriate settings for HPC environments.

Parameters:
  • scratch_dir (str or Path, optional) – Directory to use for temporary files.

  • config (dict, optional) – Additional Dask configuration settings to apply.

Returns:

Temporary directory object that should be kept alive while Dask is in use.

Return type:

TemporaryDirectory

Core Functions

The main entry points for using marEx are:

marEx.preprocess_data(da[, method_anomaly, ...])

Complete preprocessing pipeline for marine extreme event identification.

marEx.tracker(data_bin, mask, R_fill[, ...])

Tracker identifies and tracks arbitrary binary objects in spatial data through time.

marEx.regional_tracker(data_bin, mask, ...)

Create a tracker instance configured for regional (non-global) data.

marEx.specify_grid([grid_type, fpath_tgrid, ...])

Set the global grid specification that will be used by all plotters.

Data Preprocessing

Functions for data preprocessing, anomaly detection, and extreme event identification:

marEx.compute_normalised_anomaly(da[, ...])

Generate normalised anomalies using specified methodology.

marEx.smoothed_rolling_climatology(da[, ...])

Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure.

marEx.rolling_climatology(da[, ...])

Compute rolling climatology efficiently using flox cohorts.

marEx.identify_extremes(da[, ...])

Identify extreme events exceeding a percentile threshold using specified method.

Visualisation

Plotting configuration and utilities:

marEx.PlotConfig([title, var_units, issym, ...])

Configuration class for plot parameters

marEx.specify_grid([grid_type, fpath_tgrid, ...])

Set the global grid specification that will be used by all plotters.

Exception Hierarchy

MarEx provides a structured exception hierarchy for precise error handling:

marEx.MarExError(message[, details, ...])

Base exception class for all MarEx-specific errors.

marEx.DataValidationError(message[, ...])

Raise exception for input data validation issues.

marEx.CoordinateError(message[, details, ...])

Raise exception for coordinate system problems.

marEx.ProcessingError(message[, details, ...])

Raise exception for computational and algorithmic issues.

marEx.ConfigurationError(message[, details, ...])

Raise exception for parameter and setup issues.

marEx.DependencyError(message[, details, ...])

Raise exception for missing or incompatible dependencies.

marEx.TrackingError(message[, details, ...])

Raise exception for object tracking and identification issues.

marEx.VisualisationError(message[, details, ...])

Raise exception for plotting and visualisation problems.

marEx.create_data_validation_error(message)

Create DataValidationError with common data context.

marEx.create_coordinate_error(message[, ...])

Create CoordinateError with coordinate context.

marEx.create_processing_error(message[, ...])

Create ProcessingError with computation context.

marEx.wrap_exception(original_exception[, ...])

Wrap a generic exception in an appropriate MarEx exception.

Logging System

Configurable logging system for development and debugging:

marEx.configure_logging([level, format_str, ...])

Configure logging for the MarEx package.

marEx.set_verbose_mode([verbose])

Enable or disable verbose logging mode.

marEx.set_quiet_mode([quiet])

Enable or disable quiet logging mode.

marEx.set_normal_logging()

Reset logging to normal mode (INFO level).

marEx.get_verbosity_level()

Get the current verbosity level.

marEx.is_verbose_mode()

Check if verbose mode is currently enabled.

marEx.is_quiet_mode()

Check if quiet mode is currently enabled.

marEx.get_logger([name])

Get a configured logger instance for MarEx.

Dependency Management

Utilities for managing optional dependencies:

marEx.has_dependency(dep_name)

Check if a dependency is available.

marEx.print_dependency_status()

Print status of all tracked dependencies.

marEx.get_installation_profile()

Get the current installation profile.

Modules

Detection Module (marEx.detect)

MarEx-Detect: Marine Extremes Detection Module

Preprocessing toolkit for marine extremes identification from scalar oceanographic data. Converts raw time series into standardised anomalies and identifies extreme events (e.g., Marine Heatwaves using Sea Surface Temperature).

Core capabilities:

  • Two preprocessing methodologies: Detrended Baseline and Shifting Baseline

  • Two definitions for extreme events: Global Extreme and Hobday Extreme

  • Threshold-based extreme event identification

  • Efficient processing of both structured (gridded) and unstructured data

Compatible data formats:

  • Structured data: 3D arrays (time, lat, lon)

  • Unstructured data: 2D arrays (time, cell)

marEx.detect.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Complete preprocessing pipeline for marine extreme event identification.

Supports separate methods for anomaly computation and extreme identification:

Anomaly Methods:

  • ‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics

  • ‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years

  • ‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !

  • ‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)

Extreme Methods:

  • ‘global_extreme’: Global-in-time threshold value

  • ‘hobday_extreme’: Local day-of-year specific thresholds with windowing

Parameters:
  • da (xarray.DataArray) – Raw input data

  • method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).

  • method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).

  • threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.

  • window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).

  • window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).

  • window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).

  • std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).

  • detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.

  • force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.

  • method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.

  • precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.

  • dask_chunks (dict, optional) – Chunking specification for distributed computation.

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.

  • coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.

  • neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.

  • cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.

  • use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.

  • verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

Returns:

Processed dataset with anomalies and extreme event identification

Return type:

xarray.Dataset

Examples

Basic usage with gridded SST data for marine heatwave detection:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load and chunk SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Basic preprocessing with default shifting baseline method
>>> result = marEx.preprocess_data(sst, threshold_percentile=90)
>>> print(result)
<xarray.Dataset>
Dimensions:         (time: 1461, lat: 180, lon: 360)
Data variables:
    dat_anomaly     (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask            (lat, lon) bool dask.array<chunksize=(180, 360)>
    extreme_events  (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)>
    thresholds      (lat, lon) float32 dask.array<chunksize=(180, 360)>
>>> # Check which locations have extreme events
>>> print(f"Total extreme events: {result.extreme_events.sum().compute()}")
Total extreme events: 15847

Using shifting baseline method for more accurate climatology:

>>> # Requires at least 15 years of data by default
>>> result_shifting = marEx.preprocess_data(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,  # Use shorter window if needed
...     smooth_days_baseline=31   # Longer smoothing window
... )
>>> # Note: First 10 years will be removed from output

Using Hobday extreme method with day-of-year specific thresholds:

>>> result_hobday = marEx.preprocess_data(
...     sst,
...     method_extreme="hobday_extreme",
...     window_days_hobday=11,  # 11-day window for each day-of-year
...     threshold_percentile=95
... )
>>> print(result_hobday.thresholds.dims)
('dayofyear', 'lat', 'lon')

Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:

>>> result_advanced = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2],  # Linear and quadratic trends
...     std_normalise=True,     # Standardise by rolling std
...     force_zero_mean=True,
...     threshold_percentile=95
... )
>>> # Result includes both raw and standardised anomalies
>>> print('dat_stn' in result_advanced)
True

Processing unstructured data:

>>> # For ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50})
>>> result_unstructured = marEx.preprocess_data(
...     icon_sst,
...     dimensions={"x": "ncells"},   # Must specify the name of the spatial dimension
...     dask_chunks={"time": 50}
... )

Error handling - insufficient data for shifting baseline:

>>> short_data = sst.isel(time=slice(0, 1000))  # Only ~3 years
>>> try:
...     result = marEx.preprocess_data(
...         short_data,
...         method_anomaly="shifting_baseline",
...         window_year_baseline=15
...     )
... except ValueError as e:
...     print(f"Error: {e}")
Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline
requires at least 15 years.

Performance considerations with chunking:

>>> # For large datasets, adjust chunking for memory management
>>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180})
>>> result = marEx.preprocess_data(
...     large_sst,
...     dask_chunks={"time": 25},
...     method_percentile="approximate"  # Use approximate method (Default) for long time-series calculations
... )

Integration with tracking workflow:

>>> # Preprocess data then track events
>>> processed = marEx.preprocess_data(sst, threshold_percentile=95)
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = tracker.run()
>>> print(f"Identified {events.event.max().compute()} distinct events")

Simple fixed baseline approach:

>>> # Basic daily climatology across all years
>>> result_fixed = marEx.preprocess_data(
...     sst,
...     method_anomaly="fixed_baseline",
...     threshold_percentile=95
... )
>>> # Uses all available data for climatology computation

Combined trend removal and fixed climatology:

>>> # Remove long-term trends then compute daily climatology
>>> result_combined = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Linear trend
...     threshold_percentile=95,
...     force_zero_mean=True
... )
>>> # Balances trend removal with simple climatology
marEx.detect.compute_normalised_anomaly(da, method_anomaly='shifting_baseline', dimensions=None, coordinates=None, window_year_baseline=15, smooth_days_baseline=21, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Generate normalised anomalies using specified methodology.

Parameters:
  • da (xarray.DataArray) – Input data with dimensions matching the ‘dimensions’ parameter

  • method_anomaly (str, default='shifting_baseline') –

    Anomaly computation method. Options: - ‘detrend_harmonic’: Detrending with harmonics and polynomials (efficient, biased) - ‘shifting_baseline’: Rolling climatology (accurate, shortens time series) - ‘fixed_baseline’: Daily climatology using full time series (keeps long-term trends in the anomaly) - ‘detrend_fixed_baseline’: Polynomial detrending + fixed climatology (does not shorten time series,

    keeps trends in seasonal timing in the anomaly)

  • dimensions (dict, optional) – Mapping of conceptual dimensions to actual dimension names in the data

  • coordinates (dict, optional) – Mapping of conceptual coordinates to actual coordinate names in the data

  • window_year_baseline (int, default=15) – Number of years for rolling climatology (shifting_baseline only)

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline only)

  • std_normalise (bool, default=False) – Whether to normalise by 30-day rolling standard deviation (detrend_harmonic only)

  • detrend_orders (list, default=[1]) – Polynomial orders for trend removal (detrend_harmonic and detrend_fixed_baseline only)

  • force_zero_mean (bool, default=True) – Explicitly enforce zero mean in final anomalies (detrend_harmonic and detrend_fixed_baseline only)

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless.

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Dataset containing anomalies, mask, and metadata

Return type:

xarray.Dataset

Examples

Basic detrended baseline anomaly computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load chunked SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute anomalies using shifting baseline (default)
>>> result = marEx.compute_normalised_anomaly(sst)
>>> print(result.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
>>> # Check that anomalies have approximately zero mean
>>> print(f"Mean anomaly: {result.dat_anomaly.mean().compute():.6f}")
Mean anomaly: 0.000023

Previous configuration (marEx v2.0 default) of detrended baseline with higher-order polynomials and standardisation. Note: marEx v3.0+ uses shifting_baseline as the default method:

>>> result_advanced = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2, 3],  # Linear, quadratic, cubic trends
...     std_normalise=True,        # Add standardised anomalies
...     force_zero_mean=True
... )
>>> print(result_advanced.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
    dat_stn      (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    STD          (dayofyear, lat, lon) float32 dask.array<chunksize=(366, 180, 360)>
>>> # Standardised anomalies have unit variance
>>> print(f"STD of standardised anomalies: {result_advanced.dat_stn.std().compute():.3f}")

Accurate shifting baseline method for climate-aware anomalies:

>>> result_shifting = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,   # Use 10-year rolling climatology
...     smooth_days_baseline=31    # 31-day smoothing window
... )
>>> # Anomalies computed relative to recent past climatology

Processing unstructured data:

>>> # ICON ocean model with ncells dimension
>>> icon_data = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> result_unstructured = marEx.compute_normalised_anomaly(
...     icon_data,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
... )
>>> print(result_unstructured.dims)
Frozen({'time': 1461, 'ncells': 83886})

Comparison of methods - detrended vs shifting baseline:

>>> # Detrended baseline - faster, slight bias
>>> detrended = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="detrend_harmonic"
... )
>>>
>>> # Shifting baseline - slower, more accurate
>>> shifting = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="shifting_baseline",
...     window_year_baseline=15
... )
>>>
>>> # Compare anomaly magnitudes
>>> print(f"Detrended RMS: {detrended.dat_anomaly.std().compute():.3f}")
>>> print(f"Shifting RMS: {shifting.dat_anomaly.std().compute():.3f}")

Fixed baseline climatology:

>>> # Use full time series for daily climatology
>>> result_fixed = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline"
... )
>>> # Climatology computed from all available years

Fixed baseline with a restricted reference period:

>>> # Compute climatology from 1990-2020 only, but output anomalies for all years
>>> result_ref = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline",
...     reference_period=(1990, 2020)
... )

Fixed detrended baseline:

>>> # Remove long-term trends then compute fixed climatology
>>> result_fixed_detrended = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Remove linear trend
...     force_zero_mean=True
... )
>>> # Combines trend removal with fixed climatology
marEx.detect.identify_extremes(da, method_extreme='hobday_extreme', threshold_percentile=95, dimensions=None, coordinates=None, window_days_hobday=11, window_spatial_hobday=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Identify extreme events exceeding a percentile threshold using specified method.

Parameters:
  • da (xarray.DataArray) – DataArray containing anomalies

  • method_extreme (str, default='hobday_extreme') – Method for threshold calculation (‘global_extreme’ or ‘hobday_extreme’)

  • threshold_percentile (float, default=95) – Percentile threshold (e.g., 95 for 95th percentile)

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • window_days_hobday (int, default=11) – Window for day-of-year threshold (hobday_extreme only)

  • window_spatial_hobday (int, default=None) – Window for day-of-year threshold spatial clustering (hobday_extreme only)

  • method_percentile (str, default='approximate') – Method for percentile computation (‘exact’ or ‘approximate’)

  • precision (float, default=0.01) – Precision for histogram bins in approximate method

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Tuple of (extremes, thresholds) where extremes is a boolean array identifying extreme events and thresholds contains the threshold values used

Return type:

tuple

Examples

Basic extreme identification with global thresholds:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load anomaly data (from compute_normalised_anomaly)
>>> anomalies = xr.open_dataset('anomalies.nc', chunks={}).dat_anomaly
>>>
>>> # Identify extreme events using global-in-time 95th percentile
>>> extremes, thresholds = marEx.identify_extremes(
...     anomalies,
...     method_extreme="global_extreme",
...     threshold_percentile=95
... )
>>> print(f"Extreme events shape: {extremes.shape}")
Extreme events shape: (1461, 180, 360)
>>> print(f"Thresholds shape: {thresholds.shape}")
Thresholds shape: (180, 360)
>>> # Count total extreme events
>>> total_extremes = extremes.sum().compute()
>>> print(f"Total extreme events: {total_extremes}")

Using day-of-year specific thresholds (cf. Hobday et al. 2016 method):

>>> # More sophisticated threshold calculation
>>> extremes_hobday, thresholds_hobday = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     threshold_percentile=95,
...     window_days_hobday=11  # 11-day window around each day-of-year
...     window_spatial_hobday=3  # 3x3 spatial window for clustering percentile calcuation
... )
>>> print(f"Hobday thresholds shape: {thresholds_hobday.shape}")
Hobday thresholds shape: (366, 180, 360)
>>> # Compare seasonal variation in thresholds
>>> summer_threshold = thresholds_hobday.sel(dayofyear=200).mean()
>>> winter_threshold = thresholds_hobday.sel(dayofyear=50).mean()
>>> print(f"Summer vs Winter thresholds: {summer_threshold:.3f} vs {winter_threshold:.3f}")

Comparison of exact vs approximate percentile methods:

>>> # Approximate method (faster, default)
>>> extremes_approx, thresh_approx = marEx.identify_extremes(
...     anomalies, method_percentile="approximate"
... )
>>>
>>> # Exact method (slower & memory intensive)
>>> extremes_exact, thresh_exact = marEx.identify_extremes(
...     anomalies, method_percentile="exact"
... )
>>>
>>> # Compare threshold precision — ~0.005C
>>> threshold_diff = (thresh_exact - thresh_approx).std().compute()
>>> print(f"Threshold difference (exact vs approx): {threshold_diff:.6f}")

Different percentile thresholds for varying event rarity:

>>> # Conservative threshold - very extreme events only
>>> extremes_98, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=98
... )
>>>
>>> # Moderate threshold - more frequent events
>>> extremes_90, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=90
... )
>>>
>>> # Compare event frequency
>>> print(f"99th percentile events: {extremes_99.sum().compute()}")
>>> print(f"90th percentile events: {extremes_90.sum().compute()}")

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_anomalies = xr.open_dataset('icon_anomalies.nc', chunks={}).dat_anomaly
>>> extremes_unstructured, thresholds_unstructured = marEx.identify_extremes(
...     icon_anomalies,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     threshold_percentile=95
... )
>>> print(f"Unstructured extremes shape: {extremes_unstructured.shape}")

Advanced Hobday method with custom temporal window:

>>> # Longer temporal window for smoother thresholds
>>> extremes_smooth, thresholds_smooth = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     window_days_hobday=31,  # Longer smoothing window
...     threshold_percentile=95
... )
>>>
>>> # Compare threshold smoothness
>>> std_11day = thresholds_hobday.std(dim='dayofyear').mean().compute()
>>> std_31day = thresholds_smooth.std(dim='dayofyear').mean().compute()
>>> print(f"Threshold variability: 11-day={std_11day:.3f}, 31-day={std_31day:.3f}")
marEx.detect.rolling_climatology(da, window_year_baseline=15, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute rolling climatology efficiently using flox cohorts. Uses the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic rolling climatology computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load 20 years of SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute 15-year rolling climatology
>>> climatology = marEx.rolling_climatology(sst, window_year_baseline=15)
>>> print(climatology.shape)
(7305, 180, 360)  # Same as input
>>>
>>> # First 15 years will be NaN (insufficient history)
>>> print(f"NaN values in first year: {climatology.isel(time=slice(0, 365)).isnull().all().compute()}")
True

Shorter window for datasets with limited time span:

>>> # For datasets with only 10 years, use shorter window
>>> short_climatology = marEx.rolling_climatology(
...     sst, window_year_baseline=5
... )
>>> # First 5 years will be NaN instead of 15

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_climatology = marEx.rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"}
... )
>>> print(icon_climatology.dims)
Frozen({'time': 7305, 'ncells': 83886})

Comparing with fixed climatology:

>>> # Fixed climatology (traditional approach)
>>> fixed_clim = sst.groupby(sst.time.dt.dayofyear).mean()
>>>
>>> # Rolling climatology (adaptive approach)
>>> rolling_clim = marEx.rolling_climatology(sst)
>>>
>>> # Rolling climatology adapts to climate change
>>> clim_2000 = rolling_clim.sel(time='2000').mean()
>>> clim_2020 = rolling_clim.sel(time='2020').mean()
>>> print(f"Climate change signal: {(clim_2020 - clim_2000).compute():.3f} °C")

Memory considerations for large datasets:

>>> # Ensure appropriate chunking for memory efficiency
>>> large_sst = sst.chunk({'time': 30, 'lat': 45, 'lon': 90})
>>> large_climatology = marEx.rolling_climatology(large_sst)
>>> # Output maintains input chunking structure
marEx.detect.smoothed_rolling_climatology(da, window_year_baseline=15, smooth_days_baseline=21, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • smooth_days_baseline (int, default=21) – Number of days for temporal smoothing window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Smoothed rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic smoothed rolling climatology:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst,
...     window_year_baseline=15,
...     smooth_days_baseline=21
... )
>>> print(smooth_clim.shape)
(7305, 180, 360)

Comparing different smoothing windows:

>>> # Short smoothing - more day-to-day variability
>>> clim_short = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=7
... )
>>>
>>> # Long smoothing - smoother seasonal cycle
>>> clim_long = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=61
... )
>>>
>>> # Compare variability
>>> var_short = clim_short.std(dim='time').mean().compute()
>>> var_long = clim_long.std(dim='time').mean().compute()
>>> print(f"Variability: short={var_short:.3f}, long={var_long:.3f}")

Climatology for anomaly computation:

>>> # Compute smoothed climatology then anomalies
>>> climatology = marEx.smoothed_rolling_climatology(sst)
>>> anomalies = sst - climatology
>>>
>>> # Check that anomalies have reasonable properties
>>> print(f"Anomaly mean: {anomalies.mean().compute():.6f}")
>>> print(f"Anomaly std: {anomalies.std().compute():.3f}")

Unstructured data processing:

>>> # ICON ocean data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_smooth_clim = marEx.smoothed_rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     window_year_baseline=10,
...     smooth_days_baseline=31
... )

Effect of smoothing on seasonal cycle:

>>> # Raw rolling climatology (no temporal smoothing)
>>> raw_clim = marEx.rolling_climatology(sst, window_year_baseline=15)
>>>
>>> # Smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst, window_year_baseline=15, smooth_days_baseline=21
... )
>>>
>>> # Compare seasonal cycle smoothness
>>> # Extract annual cycle for a point
>>> point_raw = raw_clim.isel(lat=90, lon=180).sel(time='2010')
>>> point_smooth = smooth_clim.isel(lat=90, lon=180).sel(time='2010')
>>>
>>> print(f"Raw climatology range: {(point_raw.max() - point_raw.min()).compute():.3f}")
>>> print(f"Smooth climatology range: {(point_smooth.max() - point_smooth.min()).compute():.3f}")

Performance considerations:

>>> # Efficient implementation smooths raw data first, then computes climatology
>>> # This is more memory-efficient than smoothing the climatology
>>> large_sst = sst.chunk({'time': 25, 'lat': 45, 'lon': 90})
>>> efficient_clim = marEx.smoothed_rolling_climatology(large_sst)
marEx.detect.add_decimal_year(da, dim='time', coord=None)[source]

Add decimal year coordinate to DataArray for trend analysis.

Parameters:
  • da (xarray.DataArray) – Input data with datetime coordinate

  • dim (str, optional) – Name of the time dimension

  • coord (str, optional) – Name of the time coordinate (if different from dimension name)

Returns:

Input data with added ‘decimal_year’ coordinate

Return type:

xarray.DataArray

Functions

marEx.detect.preprocess_data(da[, ...])

Complete preprocessing pipeline for marine extreme event identification.

marEx.detect.compute_normalised_anomaly(da)

Generate normalised anomalies using specified methodology.

marEx.detect.smoothed_rolling_climatology(da)

Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure.

marEx.detect.rolling_climatology(da[, ...])

Compute rolling climatology efficiently using flox cohorts.

marEx.detect.identify_extremes(da[, ...])

Identify extreme events exceeding a percentile threshold using specified method.

Tracking Module (marEx.track)

MarEx-Track: Marine Extreme Event Identification, Tracking, and Splitting/Merging Module

MarEx identifies and tracks extreme events in oceanographic data across time, supporting both structured (regular grid) and unstructured datasets. It can identify discrete objects at single time points and track them as evolving events through time, seamlessly handling splitting and merging.

This package provides algorithms to:

  • Identify binary objects in spatial data at each time step

  • Track these objects across time to form coherent events

  • Handle merging and splitting of objects over time

  • Calculate and maintain object/event properties through time

  • Filter by size criteria to focus on significant events

Key terminology:

  • Object: A connected region in binary data at a single time point

  • Event: One or more objects tracked through time and identified as the same entity

class marEx.track.tracker(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Bases: object

Tracker identifies and tracks arbitrary binary objects in spatial data through time.

The tracker supports both structured (regular grid) and unstructured data, and seamlessly handles splitting & merging of objects. It identifies connected regions in binary data at each time step, and tracks these as evolving events through time.

Main workflow:

  1. Preprocessing: Fill spatiotemporal holes, filter small objects

  2. Object identification: Label connected components at each time

  3. Tracking: Determine object correspondences across time

  4. Optional splitting & merging: Handle complex event evolution

Parameters:
  • data_bin (xarray.DataArray) – Binary field of extreme points to group, label, and track (True = object, False = background) Must represent and underlying dask array.

  • mask (xarray.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)

  • R_fill (int) – The radius of the kernel used in morphological opening & closing, relating to the largest hole/gap that can be filled. In units of grid cells.

  • area_filter_quartile (float, optional) – The fraction of the smallest objects to discard, i.e. the quantile defining the smallest area object retained. Quantile must be in (0-1) (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.

  • area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile. Use this for fixed minimum area thresholds (e.g., 10 cells minimum).

  • temp_dir (str, optional) – Path to temporary directory for storing intermediate results

  • T_fill (int, default=2) – The permissible temporal gap (in days) between objects for tracking continuity to be maintained (must be even)

  • allow_merging (bool, default=True) – Allow objects to split and merge across time. Apply splitting & merging criteria, track merge events, and maintain original identities of merged objects across time. N.B.: False reverts to classical ndmeasure.label with simplar time connectivity, i.e. Scannell et al.

  • nn_partitioning (bool, default=False) – Implement a better partitioning of merged child objects based on closest parent cell. False reverts to using parent centroids to determine partitioning between new child objects, i.e. Di Sun & Bohai Zhang 2023. N.B.: Centroid-based partitioning has major problems with small merging objects suddenly obtaining unrealistically-large (and often disjoint) fractions of the larger object.

  • overlap_threshold (float, default=0.5) – The fraction of the smaller object’s area that must overlap with the larger object’s area to be considered the same event and continue tracking with the same ID.

  • unstructured_grid (bool, default=False) – Whether data is on an unstructured grid

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Coordinate names for unstructured grids. Should contain ‘x’ and ‘y’ keys for x and y coordinates. May also contain ‘time’ if the time coordinate name is different from the dimension name.

  • neighbours (xarray.DataArray, optional) – For unstructured grid, indicates connectivity between cells

  • cell_areas (xarray.DataArray, optional) – For unstructured grid, area of each cell (required). For structured grid, area of each cell (optional). If not provided, defaults to 1.0 for each cell (resulting in cell counts as areas). Note: Overridden by grid_resolution if provided for structured grids.

  • grid_resolution (float, optional) – Grid resolution in degrees for structured grids only (ignored for unstructured grids). When provided, automatically calculates cell areas using spherical geometry. Overrides any provided cell_areas parameter.

  • max_iteration (int, default=40) – Maximum number of iterations for merging/splitting algorithm

  • checkpoint (str, default='None') – Checkpoint strategy (‘save’, ‘load’, or None)

  • debug (int, default=0) – Debug level (0-2)

  • verbose (bool, optional) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

  • regional_mode (bool, default=False) – Enable regional mode for non-global coordinate ranges. When True, coordinate_units must be specified.

  • coordinate_units (str, optional) – Coordinate units when regional_mode=True. Must be either ‘degrees’ or ‘radians’.

Examples

Basic tracking of marine heatwave events from preprocessed data:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load preprocessed extreme events data
>>> processed = xr.open_dataset('extreme_events.nc', chunks={})
>>> extreme_events = processed.extreme_events  # Boolean array
>>> mask = processed.mask  # Ocean/land mask
>>>
>>> # Initialise tracker with basic parameters
>>> tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,                    # Fill holes up to 8 grid cells
...     area_filter_quartile=0.5     # Remove smallest 50% of objects
...     allow_merging=False          # Basic tracking without splitting/merging
... )
>>>
>>> # Run tracking algorithm
>>> events = tracker.run()
>>> print(f"Identified {events.ID.max().compute()} distinct events")
Identified 1247 distinct events

Using automatic grid area calculation from resolution:

>>> # For regular lat/lon grids, automatically calculate physical areas
>>> grid_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     grid_resolution=0.25  # Grid resolution in degrees
... )
>>> # Cell areas are calculated automatically using spherical geometry
>>> grid_events = grid_tracker.run()

Advanced tracking with merging and splitting enabled:

>>> # More sophisticated tracking with temporal gap filling
>>> advanced_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=12,               # Larger spatial gap filling
...     T_fill=4,                # Fill up to 4-day temporal gaps
...     area_filter_quartile=0.25,  # More aggressive size filtering
...     allow_merging=True,      # Enable split/merge detection
...     overlap_threshold=0.3    # Lower threshold for object linking
... )
>>>
>>> events_advanced, merges_log = advanced_tracker.run(return_merges=True)
>>> print(events_advanced.data_vars)
Data variables:
    event           (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    event_centroid  (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    ID_field        (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    global_ID       (time, ID)              int32           dask.array<chunksize=(25, 1247)>
    area            (time, ID)              float32         dask.array<chunksize=(25, 1247)>
    centroid        (component, time, ID)   float64         dask.array<chunksize=(2, 25, 1247)>
    presence        (time, ID)              bool            dask.array<chunksize=(25, 1247)>
    time_start      (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    time_end        (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    merge_ledger    (time, ID, sibling_ID)  int32           dask.array<chunksize=(25, 1247, 10)>

Processing unstructured ocean model data (ICON):

>>> # Load ICON ocean model data with connectivity
>>> icon_data = xr.open_dataset('icon_extremes.nc', chunks={})
>>> icon_extremes = icon_data.extreme_events  # (time, ncells)
>>> icon_mask = icon_data.mask
>>> neighbours = icon_data.neighbours  # Cell connectivity
>>> cell_areas = icon_data.cell_areas  # Physical areas
>>>
>>> # Track events on unstructured grid
>>> unstructured_tracker = marEx.tracker(
...     icon_extremes,
...     icon_mask,
...     R_fill=5,                                   # 5-neighbor radius for gap filling
...     area_filter_quartile=0.6,                   # Remove 60% of smallest events
...     unstructured_grid=True,                     # Enable unstructured mode
...     dimensions={"x": "ncells"},                 # Must specify the name of the spatial dimension
...     coordinates={"x": "lon", "y": "lat"},       # Spatial coordinate names
...     neighbours=neighbours,                      # Required for unstructured
...     cell_areas=cell_areas                       # Required for area calculations
... )
>>> unstructured_events = unstructured_tracker.run()

Memory management and checkpointing for large datasets:

>>> # Use checkpointing for very large datasets
>>> large_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     temp_dir='/scratch/user/tracking_temp',  # Temporary storage
...     checkpoint='save'             # Save intermediate results
... )
>>> # Processing can be resumed if interrupted
>>> large_events = large_tracker.run()

Comparing different filtering strategies:

>>> # Conservative filtering - keep more events
>>> conservative = marEx.tracker(
...     extreme_events, mask, R_fill=5, area_filter_quartile=0.1
... )
>>> conservative_events = conservative.run()
>>>
>>> # Aggressive filtering - focus on largest events
>>> aggressive = marEx.tracker(
...     extreme_events, mask, R_fill=15, area_filter_quartile=0.8
... )
>>> aggressive_events = aggressive.run()
>>>
>>> print(f"Conservative: {conservative_events.ID.max().compute()} events")
>>> print(f"Aggressive: {aggressive_events.ID.max().compute()} events")

Using absolute area filtering instead of percentile-based:

>>> # Filter objects smaller than 25 grid cells
>>> absolute_tracker = marEx.tracker(
...     extreme_events, mask, R_fill=8, area_filter_absolute=25
... )
>>> absolute_events = absolute_tracker.run()
>>>
>>> # Default behavior (area_filter_quartile=0.5) when no parameters provided
>>> default_tracker = marEx.tracker(extreme_events, mask, R_fill=8)
>>> default_events = default_tracker.run()  # Uses quartile=0.5 filtering

Using physical cell areas for structured grids:

>>> # Load data with irregular grid cell areas
>>> grid_areas = xr.open_dataset('grid_areas.nc').cell_area  # (lat, lon) in m²
>>>
>>> # Track events using physical areas instead of cell counts
>>> physical_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     cell_areas=grid_areas  # Physical areas in m²
... )
>>> events = physical_tracker.run()
>>> # Now events.area contains physical areas in m² instead of cell counts

Integration with full marEx workflow:

>>> # Complete workflow from raw data to tracked events
>>> raw_sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Step 1: Preprocess to identify extremes
>>> processed = marEx.preprocess_data(raw_sst, threshold_percentile=95)
>>>
>>> # Step 2: Track extreme events
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> tracked_events = tracker.run()

Initialise the tracker with parameters and data.

__init__(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Initialise the tracker with parameters and data.

Parameters:
Return type:

None

run(return_merges=False, checkpoint=None)[source]

Run the complete object identification and tracking pipeline.

This method executes the full workflow:

  1. Preprocessing: morphological operations and size filtering

  2. Identification and tracking of objects through time

  3. Computing and attaching statistics to the results

Parameters:
  • return_merges (bool, default=False) – If True, return merge events dataset alongside the main events

  • checkpoint (str, optional) – Override the instance checkpoint setting

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events and their properties

  • merges_ds (xarray.Dataset, optional) – Dataset with merge event information (only if return_merges=True)

Return type:

Dataset | Tuple[Dataset, Dataset]

run_preprocess(checkpoint=None)[source]

Preprocess binary data to prepare for tracking.

This performs morphological operations to fill holes/gaps in both space and time, then filters small objects according to the area_filter_quartile or area_filter_absolute.

Parameters:

checkpoint (str, optional) – Checkpoint strategy override

Returns:

  • data_bin_filtered (xarray.DataArray) – Preprocessed binary data

  • object_stats (tuple) – Statistics about the preprocessing

Return type:

Tuple[DataArray, Tuple[float, int, int, float, float, float]]

run_tracking(data_bin_preprocessed)[source]

Track objects through time to identify events.

Parameters:

data_bin_preprocessed (xarray.DataArray) – Preprocessed binary data

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • N_events_final (int) – Final number of unique events

Return type:

Tuple[Dataset, Dataset, int]

run_stats_attributes(events_ds, merges_ds, object_stats, N_events_final)[source]

Add statistics and attributes to the events dataset.

Parameters:
  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • object_stats (tuple) – Preprocessed object statistics

  • N_events_final (int) – Final number of events

Returns:

events_ds – Dataset with added statistics and attributes

Return type:

xarray.Dataset

compute_area(data_bin)[source]

Compute the total area of binary data at each time.

Parameters:

data_bin (xarray.DataArray) – Binary data

Returns:

area – Total area at each time (units: pixels for structured grid, matching cell_area for unstructured)

Return type:

xarray.DataArray

fill_holes(data_bin, R_fill=None)[source]

Fill holes and gaps using morphological operations.

This performs closing (dilation followed by erosion) to fill small gaps, then opening (erosion followed by dilation) to remove small isolated objects.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to process

  • R_fill (int, optional) – Fill radius override

Returns:

data_bin_filled – Binary data with holes/gaps filled

Return type:

xarray.DataArray

fill_time_gaps(data_bin)[source]

Fill temporal gaps between objects.

Performs binary closing (dilation then erosion) along the time dimension to fill small time gaps between objects.

Parameters:

data_bin (xarray.DataArray) – Binary data to process

Returns:

data_bin_filled – Binary data with temporal gaps filled

Return type:

xarray.DataArray

refresh_dask_graph(data_bin)[source]

Clear and reset the Dask graph via save/load cycle.

This is needed to work around a memory leak bug in Dask where “Unmanaged Memory” builds up within loops.

Parameters:

data_bin (xarray.DataArray) – Data to refresh

Returns:

data_new – Data with fresh Dask graph

Return type:

xarray.DataArray

filter_small_objects(data_bin)[source]

Remove objects smaller than a threshold area.

Parameters:

data_bin (xarray.DataArray) – Binary data to filter

Returns:

  • data_bin_filtered (xarray.DataArray) – Binary data with small objects removed

  • area_threshold (float) – Area threshold used for filtering

  • object_areas (xarray.DataArray) – Areas of all objects pre-filtering

  • N_objects_prefiltered (int) – Number of objects before filtering

  • N_objects_filtered (int) – Number of objects after filtering

Return type:

Tuple[DataArray, float, DataArray, int, int]

identify_objects(data_bin, time_connectivity)[source]

Identify connected regions in binary data.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to identify objects in

  • time_connectivity (bool) – Whether to connect objects across time

Returns:

  • object_id_field (xarray.DataArray) – Field of integer IDs for each object

  • None (NoneType) – Placeholder for compatibility with track_objects

  • N_objects (int) – Number of objects identified

Return type:

Tuple[DataArray, None, int]

calculate_centroid(binary_mask, original_centroid=None)[source]

Calculate object centroid, handling edge cases for periodic boundaries.

Parameters:
  • binary_mask (numpy.ndarray) – 2D binary array where True indicates the object (dimensions are (y,x))

  • original_centroid (tuple, optional) – (y_centroid, x_centroid) from regionprops_table

Returns:

(y_centroid, x_centroid)

Return type:

tuple

calculate_object_properties(object_id_field, properties=None)[source]

Calculate properties of objects from ID field.

Parameters:
  • object_id_field (xarray.DataArray) – Field containing object IDs

  • properties (list, optional) – List of properties to calculate (defaults to [‘label’, ‘area’])

Returns:

object_props – Dataset containing calculated properties with ‘ID’ dimension

Return type:

xarray.Dataset

check_overlap_slice(ids_t0, ids_next)[source]

Find overlapping objects between two consecutive time slices.

Parameters:
Returns:

Array of shape (n_overlaps, 3) with [id_t0, id_next, overlap_area]

Return type:

numpy.ndarray

find_overlapping_objects(object_id_field)[source]

Find all overlapping objects across time.

Parameters:

object_id_field (xarray.DataArray) – Field containing object IDs

Returns:

overlap_objects_list_unique_filtered – Array of object ID pairs that overlap across time, with overlap area The object in the first column precedes the second column in time. The third column contains:

  • For structured grid: number of overlapping pixels (int32)

  • For unstructured grid: total overlapping area in m^2 (float32)

Return type:

(N x 3) numpy.ndarray

enforce_overlap_threshold(overlap_objects_list, object_props)[source]

Filter object pairs based on overlap threshold.

Parameters:
  • overlap_objects_list ((N x 3) numpy.ndarray) – Array of object ID pairs with overlap area

  • object_props (xarray.Dataset) – Object properties including area

Returns:

overlap_objects_list_filtered – Filtered array of object ID pairs that meet the overlap threshold

Return type:

(M x 3) numpy.ndarray

consolidate_object_ids(data_t_minus_2, data_t_minus_1, object_props, timestep)[source]

Consolidate object IDs between t-2 and t-1 to ensure consistent tracking.

This identifies objects at t-1 that are actually continuations of objects from t-2 (but got different IDs due to partitioning) and renames them to maintain consistent IDs across timesteps.

Parameters:
  • data_t_minus_2 (xr.DataArray) – Object field at timestep t-2

  • data_t_minus_1 (xr.DataArray) – Object field at timestep t-1 (will be modified)

  • object_props (xr.Dataset) – Object properties dataset (will be modified)

  • timestep (int) – Current timestep number for logging purposes

Returns:

  • data_t_minus_1_consolidated (xr.DataArray) – Updated t-1 field with consolidated IDs

  • object_props_updated (xr.Dataset) – Updated object properties with merged/deleted objects

Return type:

Tuple[DataArray, Dataset]

Notes

  • Uses self.overlap_threshold for determining consolidation eligibility

  • Updates object properties by recalculating for consolidated objects

  • Removes redundant child objects from object_props

compute_id_time_dict(da, child_objects, max_objects, all_objects=True)[source]

Generate lookup table mapping object IDs to their time index.

Parameters:
  • da (xarray.DataArray) – Field of object IDs

  • child_objects (list or array) – Object IDs to include in the dictionary

  • max_objects (int) – Maximum number of objects

  • all_objects (bool, default=True) – Whether to process all objects or just child_objects

Returns:

time_index_map – Dictionary mapping object IDs to time indices

Return type:

dict

track_objects(data_bin)[source]

Track objects through time to form events.

This is the main tracking method that handles splitting and merging of objects.

Parameters:

data_bin (xarray.DataArray) – Preprocessed binary data: Field of globally unique integer IDs of each element in connected regions. ID = 0 indicates no object.

Returns:

  • split_merged_events_ds (xarray.Dataset) – Dataset containing tracked events

  • merge_events (xarray.Dataset) – Dataset with merge information

  • N_events (int) – Final number of events

Return type:

Tuple[Dataset, Dataset, int]

cluster_rename_objects_and_props(object_id_field_unique, object_props, overlap_objects_list, merge_events)[source]

Cluster the object pairs and relabel to determine final event IDs.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs must not be repeated across time.

  • object_props (xarray.Dataset) – Properties of each object that also need to be relabeled.

  • overlap_objects_list ((N x 2) numpy.ndarray) – Array of object ID pairs that indicate which objects are in the same event. The object in the first column precedes the second column in time.

  • merge_events (xarray.Dataset) – Information about merge events

Returns:

split_merged_events_ds – Dataset with relabeled events and their properties. ID = 0 indicates no object.

Return type:

xarray.Dataset

split_and_merge_objects(object_id_field_unique, object_props)[source]

Implement object splitting and merging logic.

This identifies and processes cases where objects split or merge over time, creating new object IDs as needed.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs are required to be monotonically increasing with time.

  • object_props (xarray.Dataset) – Properties of each object

Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

split_and_merge_objects_parallel(object_id_field_unique, object_props)[source]

Optimised parallel implementation of object splitting and merging.

This version is specifically designed for unstructured grids with more efficient memory handling and better parallelism than the standard split_and_merge_objects method. It processes data in chunks, handles merging events, and efficiently updates object IDs.

Parameters:
Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

marEx.track.wrapped_euclidian_distance_mask_parallel(mask_values, parent_centroids_values, Nx, wrap)[source]

Optimised function for computing wrapped Euclidean distances.

Efficiently calculates distances between points in a binary mask and a set of centroids, accounting for periodic boundaries in the x dimension.

Parameters:
  • mask_values (np.ndarray) – 2D boolean array where True indicates points to calculate distances for

  • parent_centroids_values (np.ndarray) – Array of shape (n_parents, 2) containing (y, x) coordinates of parent centroids

  • Nx (int) – Size of the x-dimension for periodic boundary wrapping

  • wrap (bool) – Whether to treat x-dimension as periodic and wrap

Returns:

distances – Array of shape (n_true_points, n_parents) with minimum distances

Return type:

np.ndarray

marEx.track.create_grid_index_arrays(points_y, points_x, grid_size, ny, nx)[source]

Create a grid-based spatial index for efficient point lookup.

This function divides space into a grid and assigns points to grid cells for more efficient spatial queries compared to brute force comparisons.

Parameters:
  • points_y (np.ndarray) – Coordinates of points to index

  • points_x (np.ndarray) – Coordinates of points to index

  • grid_size (int) – Size of each grid cell

  • ny (int) – Dimensions of the overall grid

  • nx (int) – Dimensions of the overall grid

Returns:

  • grid_points (np.ndarray) – 3D array mapping grid cells to point indices

  • grid_counts (np.ndarray) – 2D array with count of points in each grid cell

Return type:

Tuple[ndarray[tuple[Any, …], dtype[int32]], ndarray[tuple[Any, …], dtype[int32]]]

marEx.track.wrapped_euclidian_distance_points(y1, x1, y2, x2, nx, half_nx, wrap)[source]

Calculate distance with periodic boundary conditions in x dimension.

Parameters:
  • y1 (float) – Coordinates of first point

  • x1 (float) – Coordinates of first point

  • y2 (float) – Coordinates of second point

  • x2 (float) – Coordinates of second point

  • nx (int) – Size of x dimension

  • half_nx (float) – Half the size of x dimension

  • wrap (bool) – Whether to apply periodic boundary conditions in x

Returns:

Euclidean distance accounting for periodic boundary in x (or not)

Return type:

float

marEx.track.partition_nn_grid(child_mask, parent_masks, child_ids, parent_centroids, Nx, max_distance=20, wrap=True)[source]

Partition a child object based on nearest parent object points.

This implementation uses spatial indexing and highly-threaded processing for efficient distance calculations. The algorithm assigns each point in the child object to the closest parent object.

Parameters:
  • child_mask (np.ndarray) – Binary mask of the child object

  • parent_masks (np.ndarray) – List of binary masks for each parent object

  • child_ids (np.ndarray) – List of IDs to assign to partitions

  • parent_centroids (np.ndarray) – Array of shape (n_parents, 2) with parent centroids

  • Nx (int) – Size of x dimension for periodic boundaries

  • max_distance (int, default=20) – Maximum search distance

  • wrap (bool, default=True) – Whether to apply periodic boundary conditions in the x dimension

Returns:

new_labels – Array containing assigned child_ids for each point

Return type:

np.ndarray

marEx.track.partition_nn_unstructured(child_mask, parent_masks, child_ids, parent_centroids, neighbours_int, lat, lon, max_distance=20)[source]

Partition a child object on an unstructured grid based on nearest parent points.

This function implements an efficient algorithm for assigning each cell in a child object to the nearest parent object, using graph traversal and spatial distances. It is optimised for unstructured grids.

Parameters:
  • child_mask (np.ndarray) – 1D boolean array where True indicates points in the child object

  • parent_masks (np.ndarray) – 2D boolean array of shape (n_parents, n_points) where True indicates points in each parent object

  • child_ids (np.ndarray) – 1D array containing the IDs to assign to each partition of the child object

  • parent_centroids (np.ndarray) – Array of shape (n_parents, 2) containing (lat, lon) coordinates of parent centroids in degrees

  • neighbours_int (np.ndarray) – 2D array of shape (3, n_points) containing indices of neighboring cells for each point

  • lat (np.ndarray) – Latitude/longitude arrays in degrees

  • lon (np.ndarray) – Latitude/longitude arrays in degrees

  • max_distance (int, default=20) – Maximum number of edge hops to search for parent points

Returns:

new_labels – 1D array containing the assigned child_ids for each True point in child_mask

Return type:

np.ndarray

marEx.track.partition_nn_unstructured_optimised(child_mask, parent_frontiers, parent_centroids, neighbours_int, lat, lon, max_distance=20)[source]

Memory-optimised nearest neighbor partitioning for unstructured grids.

This version uses more efficient memory management compared to partition_nn_unstructured, making it suitable for very large grids. It uses a compact representation of parent frontiers to reduce memory usage during graph traversal.

Parameters:
  • child_mask (np.ndarray) – 1D boolean array indicating which cells belong to the child object

  • parent_frontiers (np.ndarray) – 1D uint8 array with parent indices (255 for unvisited points)

  • parent_centroids (np.ndarray) – Array of shape (n_parents, 2) containing (lat, lon) coordinates

  • neighbours_int (np.ndarray) – 2D array of shape (3, n_points) containing indices of neighboring cells

  • lat (np.ndarray) – 1D arrays of latitude/longitude in degrees

  • lon (np.ndarray) – 1D arrays of latitude/longitude in degrees

  • max_distance (int, default=20) – Maximum number of edge hops to search for parent points

Returns:

result – 1D array containing the assigned parent indices for points in child_mask

Return type:

np.ndarray

marEx.track.partition_centroid_unstructured(child_mask, parent_centroids, child_ids, lat, lon)[source]

Partition a child object based on closest parent centroids on an unstructured grid.

This function assigns each cell in the child object to the parent with the closest centroid, using great circle distances on a spherical grid.

Parameters:
  • child_mask (np.ndarray) – 1D boolean array indicating which cells belong to the child object

  • parent_centroids (np.ndarray) – Array of shape (n_parents, 2) containing (lat, lon) coordinates of parent centroids in degrees

  • child_ids (np.ndarray) – Array of IDs to assign to each partition of the child object

  • lat (np.ndarray) – Latitude/longitude arrays in degrees

  • lon (np.ndarray) – Latitude/longitude arrays in degrees

Returns:

new_labels – 1D array containing assigned child_ids for cells in child_mask

Return type:

np.ndarray

marEx.track.sparse_bool_power(vec, sp_data, indices, indptr, exponent)[source]

Efficient sparse boolean matrix power operation.

This function implements a fast sparse matrix power operation for boolean matrices, avoiding memory leaks present in scipy+Dask implementations. It’s used for efficient morphological operations on unstructured grids.

Parameters:
  • vec (np.ndarray) – Boolean vector to multiply

  • sp_data (np.ndarray) – Sparse matrix in CSR format

  • indices (np.ndarray) – Sparse matrix in CSR format

  • indptr (np.ndarray) – Sparse matrix in CSR format

  • exponent (int) – Number of times to apply the matrix

Returns:

Result of (sparse_matrix ^ exponent) * vec

Return type:

np.ndarray

marEx.track.regional_tracker(data_bin, mask, coordinate_units, R_fill, area_filter_quartile=None, area_filter_absolute=None, **kwargs)[source]

Create a tracker instance configured for regional (non-global) data.

This is a convenience function that automatically sets regional_mode=True and requires explicit specification of coordinate units, since auto-detection may fail for regional coordinate ranges.

Parameters:
  • data_bin (xr.DataArray) – Binary data to identify and track objects in (True = object, False = background)

  • mask (xr.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)

  • coordinate_units ({'degrees', 'radians'}) – Units of the coordinate system. Must be specified for regional data.

  • R_fill (int or float) – Radius for filling holes/gaps in spatial domain (in grid cells)

  • area_filter_quartile (float, optional) – Quantile (0-1) for filtering smallest objects (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.

  • area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile.

  • **kwargs – Additional parameters passed to the tracker class

Returns:

Configured tracker instance with regional_mode=True

Return type:

tracker

Examples

Track events in regional Mediterranean Sea data:

>>> import marEx
>>> # For regional data with degree coordinates
>>> regional_tracker = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='degrees',
...     R_fill=5,
...     area_filter_quartile=0.3
... )
>>> events = regional_tracker.run()

Track events in regional data with radian coordinates:

>>> # For model output with radian coordinates
>>> regional_tracker = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='radians',
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = regional_tracker.run()

Using absolute area filtering in regional mode:

>>> # Keep only features larger than 15 grid cells
>>> absolute_regional = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='degrees',
...     R_fill=5,
...     area_filter_absolute=15
... )
>>> events = absolute_regional.run()

Classes

marEx.track.tracker(data_bin, mask, R_fill)

Tracker identifies and tracks arbitrary binary objects in spatial data through time.

Functions

marEx.track.regional_tracker(data_bin, mask, ...)

Create a tracker instance configured for regional (non-global) data.

Plotting Module (marEx.plotX)

MarEx-PlotX: Marine Extremes Visualisation Module

Comprehensive visualisation module for marine extreme events supporting both structured (regular grid) and unstructured oceanographic data. Provides automated grid detection and specialised plotting capabilities optimised for each data structure type.

Core capabilities:

  • Polymorphic plotting with automatic grid type detection

  • Single plot generation with customisable styling and projections

  • Multi-panel plotting for comparative analysis

  • Animation generation for temporal visualisation

  • Memory-efficient handling of large oceanographic datasets

Supported data formats:

  • Structured data: 3D arrays (time, lat, lon) for regular rectangular grids

  • Unstructured data: 2D arrays (time, cell) for irregular triangular meshes

class marEx.plotX.PlotXAccessor(xarray_obj)[source]

Bases: object

Xarray accessor for plotX functionality with support for custom dimensions and coordinates.

Initialise the PlotXAccessor.

Parameters:

xarray_obj (DataArray)

__init__(xarray_obj)[source]

Initialise the PlotXAccessor.

Parameters:

xarray_obj (DataArray)

__call__(dimensions=None, coordinates=None)[source]

Create a plotter instance with optional custom dimensions and coordinates.

Parameters:
  • dimensions (Dict[str, str] | None) – Optional mapping of conceptual dimensions to actual dimension names

  • coordinates (Dict[str, str] | None) – Optional mapping of conceptual coordinates to actual coordinate names

Returns:

Appropriate plotter instance for the data structure

Return type:

GriddedPlotter | UnstructuredPlotter

single_plot(config, **kwargs)[source]

Create a single plot with default dimension detection.

Parameters:

config (PlotConfig)

multi_plot(config, **kwargs)[source]

Create multiple plots with default dimension detection.

Parameters:

config (PlotConfig)

animate(config, **kwargs)[source]

Create animation with default dimension detection.

Parameters:

config (PlotConfig)

marEx.plotX.specify_grid(grid_type=None, fpath_tgrid=None, fpath_ckdtree=None)[source]

Set the global grid specification that will be used by all plotters.

Parameters:
  • grid_type (str | None) – str, either ‘gridded’ or ‘unstructured’. If specified, this will be used as the primary method to determine grid type.

  • fpath_tgrid (str | Path | None) – Path to the triangulation grid file

  • fpath_ckdtree (str | Path | None) – Path to the pre-computed KDTree indices directory

Return type:

None

Configuration Classes

marEx.plotX.PlotConfig([title, var_units, ...])

Configuration class for plot parameters

Functions

marEx.plotX.specify_grid([grid_type, ...])

Set the global grid specification that will be used by all plotters.

Submodules

Helper Module (marEx.helper)

HPC Dask Helper: Utilities for High-Performance Computing with Dask

This module provides utilities for setting up and managing Dask clusters in HPC environments, with specific support for the DKRZ Levante Supercomputer.

marEx.helper.configure_dask(scratch_dir=None, config=None)[source]

Configure Dask with appropriate settings for HPC environments.

Parameters:
  • scratch_dir (str or Path, optional) – Directory to use for temporary files.

  • config (dict, optional) – Additional Dask configuration settings to apply.

Returns:

Temporary directory object that should be kept alive while Dask is in use.

Return type:

TemporaryDirectory

marEx.helper.get_cluster_info(client)[source]

Get and print cluster connection information.

Parameters:

client (Client) – Dask client connected to a cluster.

Returns:

Dictionary containing connection information.

Return type:

dict

Examples

Basic cluster info retrieval:

>>> import marEx
>>>
>>> # Start local cluster
>>> client = marEx.start_local_cluster(n_workers=2)
>>>
>>> # Get connection information
>>> info = marEx.get_cluster_info(client)
Hostname: login01
Forward Port: login01:8787
Dashboard Link: localhost:8787/status
>>>
>>> print(f"Connect via: {info['dashboard_link']}")
Connect via: localhost:8787/status
>>> client.close()

SSH tunneling for remote access:

>>> # Start cluster on HPC system
>>> client = marEx.start_distributed_cluster(
...     n_workers=8, workers_per_node=4, dashboard_address=8889
... )
>>>
>>> # Get tunneling information
>>> info = marEx.get_cluster_info(client)
Hostname: levante-login01
Forward Port: levante-login01:8889
Dashboard Link: localhost:8889/status
>>>
>>> # Use this info to set up SSH tunnel:
>>> # ssh -L 8889:localhost:8889 levante-login01.dkrz.de
>>> # Then access dashboard at localhost:8889/status
>>>
>>> client.close()

Monitoring cluster status:

>>> client = marEx.start_local_cluster(n_workers=4)
>>> info = marEx.get_cluster_info(client)
>>>
>>> # Access cluster details
>>> print(f"Dashboard URL: {client.dashboard_link}")
>>> print(f"Cluster address: {client.cluster.scheduler_address}")
>>> print(f"Total threads: {client.nthreads()}")
>>>
>>> # Use port for programmatic access
>>> import requests
>>> try:
...     response = requests.get(f"http://localhost:{info['port']}/info")
...     print(f"Scheduler info: {response.status_code}")
... except:
...     print("Dashboard not accessible")
>>>
>>> client.close()
marEx.helper.start_local_cluster(n_workers=4, threads_per_worker=1, scratch_dir=None, verbose=None, quiet=None, **kwargs)[source]

Start a local Dask cluster.

Parameters:
  • n_workers (int, default=4) – Number of worker processes to start.

  • threads_per_worker (int, default=1) – Number of threads per worker.

  • scratch_dir (str or Path, optional) – Directory to use for temporary files.

  • verbose (bool, optional) – Enable verbose logging with detailed cluster startup information. If None, uses current global logging configuration.

  • quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

  • **kwargs – Additional keyword arguments to pass to LocalCluster.

Returns:

Dask client connected to the local cluster.

Return type:

Client

Examples

Basic local cluster for development:

>>> import marEx
>>>
>>> # Start simple local cluster
>>> client = marEx.start_local_cluster(n_workers=2, threads_per_worker=1)
>>> print(client)
<Client: 'tcp://127.0.0.1:xxxxx' processes=2 threads=2, memory=15.7 GB>
>>>
>>> # Check cluster status
>>> print(f"Dashboard: {client.dashboard_link}")
Dashboard: http://127.0.0.1:8787/status
>>>
>>> client.close()

Optimised cluster for CPU-intensive work:

>>> # Use one worker per physical core
>>> client = marEx.start_local_cluster(
...     n_workers=8,           # Number of physical cores
...     threads_per_worker=1   # Avoid hyperthreading for compute
... )
>>>
>>> # Process data with the cluster
>>> import xarray as xr
>>> data = xr.open_dataset('large_data.nc').chunk({'time': 25})
>>> result = data.mean().compute()
>>> client.close()

Memory-optimised cluster:

>>> # Configure for large datasets
>>> client = marEx.start_local_cluster(
...     n_workers=4,
...     threads_per_worker=2,
...     memory_limit='8GB',      # Limit memory per worker
...     scratch_dir='/tmp/dask'  # Fast (e.g. Lustre) local storage
... )

Integration with marEx preprocessing:

>>> # Start cluster then process data
>>> client = marEx.start_local_cluster(n_workers=16)
>>>
>>> # Load and preprocess SST data
>>> sst = xr.open_dataset('sst_data.nc').sst.chunk({'time': 30})
>>> processed = marEx.preprocess_data(sst, threshold_percentile=95)
>>>
>>> # Track events using the cluster
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = tracker.run()
>>>
>>> print(f"Processed {len(sst.time)} time steps with {client.nthreads()} threads")
>>> client.close()

Custom worker configuration:

>>> # Advanced configuration for specific workloads
>>> client = marEx.start_local_cluster(
...     n_workers=4,
...     threads_per_worker=2,
...     processes=True,         # Use separate processes (default)
...     silence_logs=False,     # Keep logs for debugging
...     dashboard_address=':8787'  # Specific dashboard port
... )
marEx.helper.start_distributed_cluster(n_workers, workers_per_node, runtime=9, node_memory=256, dashboard_address=8889, queue='compute', scratch_dir=None, account=None, verbose=None, quiet=None, **kwargs)[source]

Start a distributed Dask cluster on a SLURM-based supercomputer.

Parameters:
  • n_workers (int) – Total number of workers to request.

  • workers_per_node (int) – Number of workers per node.

  • runtime (int, default=9) – Maximum runtime in minutes.

  • node_memory (int, default=256) – Memory per node in GB (256, 512, or 1024).

  • dashboard_address (int, default=8889) – Port for the Dask dashboard.

  • queue (str, default='compute') – SLURM queue to submit jobs to.

  • scratch_dir (str or Path, optional) – Directory to use for temporary files.

  • account (str, optional) – SLURM account to charge. Defaults to DKRZ_ACCOUNT.

  • **kwargs – Additional keyword arguments to pass to SLURMCluster.

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Dask client connected to the distributed cluster.

Return type:

Client

Examples

Basic SLURM cluster for medium-scale processing:

>>> import marEx
>>>
>>> # Start cluster with 16 workers on 4 nodes (4 workers per node)
>>> client = marEx.start_distributed_cluster(
...     n_workers=16,
...     workers_per_node=4,
...     runtime=60,      # 1 hour
...     node_memory=128   # 128GB nodes
... )
>>> print(f"Cluster: {client}")
>>> # Access dashboard via SSH tunnel
>>> cluster_info = marEx.get_cluster_info(client)
>>> client.close()

Processing large marEx workflow on HPC:

>>> # Start cluster for full marEx pipeline
>>> client = marEx.start_distributed_cluster(
...     n_workers=32,
...     workers_per_node=8,
...     runtime=60,  # 1 hour
...     node_memory=256   # 256GB nodes
... )
>>>
>>> # Load very large SST dataset
>>> import xarray as xr
>>> sst = xr.open_zarr('/work/data/large_sst.zarr').chunk({'time': 25})
>>>
>>> # Preprocess with distributed computing
>>> processed = marEx.preprocess_data(
...     sst,
...     method_anomaly="shifting_baseline",
...     threshold_percentile=95
... )
>>>
>>> # Track events using full cluster
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=12,
...     area_filter_quartile=0.3
... )
>>> events = tracker.run()
>>>
>>> # Save results
>>> events.to_zarr('/work/results/tracked_events.zarr')
>>> client.close()

Custom SLURM configuration:

>>> # Advanced SLURM configuration
>>> client = marEx.start_distributed_cluster(
...     n_workers=64,
...     workers_per_node=32,
...     runtime=20,           # 20 minutes
...     node_memory=256,
...     account='myproject',   # Custom account
...     queue='debug',           # debug queue
... )

Dashboard access and monitoring:

>>> # Start cluster and set up monitoring
>>> client = marEx.start_distributed_cluster(
...     n_workers=16,
...     workers_per_node=4,
...     dashboard_address=8890  # Custom dashboard port
... )
>>>
>>> # Get connection info for SSH tunneling
>>> info = marEx.get_cluster_info(client)
>>> print(f"SSH tunnel: ssh -L {info['port']}:localhost:{info['port']} {info['hostname']}")
>>> print(f"Dashboard: {info['dashboard_link']}")
>>>
>>> # Monitor cluster performance
>>> print(f"Workers: {len(client.scheduler_info()['workers'])}")
>>> print(f"Total memory: {sum(w['memory_limit'] for w in client.scheduler_info()['workers'].values()) / 1e9:.1f} GB")

Memory optimisation strategies:

>>> # Optimise for different workload types
>>>
>>> # Memory-intensive: fewer workers per node
>>> memory_cluster = marEx.start_distributed_cluster(
...     n_workers=8, workers_per_node=2, node_memory=512
... )
>>>
>>> # CPU-intensive: more workers per node
>>> cpu_cluster = marEx.start_distributed_cluster(
...     n_workers=64, workers_per_node=16, node_memory=256
... )
marEx.helper.checkpoint_to_zarr(data, name='checkpoint', cleanup=False, timedim='time')[source]

Save and reload a Dask-backed xarray object to break graph dependencies.

This function materialises a Dask array/dataset to a temporary file and immediately reloads it, thereby breaking the computational graph. This prevents expensive recomputations when the same data is used multiple times downstream.

Parameters:
  • data (xarray.DataArray or xarray.Dataset) – Dask-backed xarray object to checkpoint

  • name (str, default='checkpoint') – Name prefix for the temporary file (for logging/debugging)

  • cleanup (bool, default=False) – Whether to delete the temporary file after reloading. By default (False), temp files are kept for the session and cleaned up by the OS temp directory manager. Set to True to immediately delete after reload.

  • timedim (str, default='time') – Name of the time dimension for chunking adjustments

Returns:

Reloaded data with broken graph dependencies

Return type:

xarray.DataArray or xarray.Dataset

Examples

>>> import marEx
>>> anomalies = marEx.compute_normalised_anomaly(sst)
>>> anomalies_checkpointed = marEx.helper.checkpoint_to_zarr(
...     anomalies, name="anomalies"
... )
marEx.helper.fix_dask_tuple_array(da)[source]

Fix a dask array that has tuple (i.e. task) references in its chunks. This addresses a longstanding issue/bug when dask arrays are saved to Zarr. Process chunk by chunk to maintain memory efficiency.

Parameters:

da (xarray.DataArray) – DataArray with Dask array backend that may have tuple chunk references

Returns:

DataArray with materialised chunks that can be safely saved to Zarr

Return type:

xarray.DataArray

HPC Cluster Management

marEx.helper.start_distributed_cluster(...)

Start a distributed Dask cluster on a SLURM-based supercomputer.

marEx.helper.start_local_cluster([...])

Start a local Dask cluster.

marEx.helper.configure_dask([scratch_dir, ...])

Configure Dask with appropriate settings for HPC environments.

marEx.helper.get_cluster_info(client)

Get and print cluster connection information.

marEx.helper.fix_dask_tuple_array(da)

Fix a dask array that has tuple (i.e. task) references in its chunks.

Detailed API Documentation

Detection and Preprocessing

marEx.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Complete preprocessing pipeline for marine extreme event identification.

Supports separate methods for anomaly computation and extreme identification:

Anomaly Methods:

  • ‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics

  • ‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years

  • ‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !

  • ‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)

Extreme Methods:

  • ‘global_extreme’: Global-in-time threshold value

  • ‘hobday_extreme’: Local day-of-year specific thresholds with windowing

Parameters:
  • da (xarray.DataArray) – Raw input data

  • method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).

  • method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).

  • threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.

  • window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).

  • window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).

  • window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).

  • std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).

  • detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.

  • force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.

  • method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.

  • precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.

  • dask_chunks (dict, optional) – Chunking specification for distributed computation.

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.

  • coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.

  • neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.

  • cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.

  • use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.

  • verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

Returns:

Processed dataset with anomalies and extreme event identification

Return type:

xarray.Dataset

Examples

Basic usage with gridded SST data for marine heatwave detection:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load and chunk SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Basic preprocessing with default shifting baseline method
>>> result = marEx.preprocess_data(sst, threshold_percentile=90)
>>> print(result)
<xarray.Dataset>
Dimensions:         (time: 1461, lat: 180, lon: 360)
Data variables:
    dat_anomaly     (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask            (lat, lon) bool dask.array<chunksize=(180, 360)>
    extreme_events  (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)>
    thresholds      (lat, lon) float32 dask.array<chunksize=(180, 360)>
>>> # Check which locations have extreme events
>>> print(f"Total extreme events: {result.extreme_events.sum().compute()}")
Total extreme events: 15847

Using shifting baseline method for more accurate climatology:

>>> # Requires at least 15 years of data by default
>>> result_shifting = marEx.preprocess_data(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,  # Use shorter window if needed
...     smooth_days_baseline=31   # Longer smoothing window
... )
>>> # Note: First 10 years will be removed from output

Using Hobday extreme method with day-of-year specific thresholds:

>>> result_hobday = marEx.preprocess_data(
...     sst,
...     method_extreme="hobday_extreme",
...     window_days_hobday=11,  # 11-day window for each day-of-year
...     threshold_percentile=95
... )
>>> print(result_hobday.thresholds.dims)
('dayofyear', 'lat', 'lon')

Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:

>>> result_advanced = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2],  # Linear and quadratic trends
...     std_normalise=True,     # Standardise by rolling std
...     force_zero_mean=True,
...     threshold_percentile=95
... )
>>> # Result includes both raw and standardised anomalies
>>> print('dat_stn' in result_advanced)
True

Processing unstructured data:

>>> # For ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50})
>>> result_unstructured = marEx.preprocess_data(
...     icon_sst,
...     dimensions={"x": "ncells"},   # Must specify the name of the spatial dimension
...     dask_chunks={"time": 50}
... )

Error handling - insufficient data for shifting baseline:

>>> short_data = sst.isel(time=slice(0, 1000))  # Only ~3 years
>>> try:
...     result = marEx.preprocess_data(
...         short_data,
...         method_anomaly="shifting_baseline",
...         window_year_baseline=15
...     )
... except ValueError as e:
...     print(f"Error: {e}")
Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline
requires at least 15 years.

Performance considerations with chunking:

>>> # For large datasets, adjust chunking for memory management
>>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180})
>>> result = marEx.preprocess_data(
...     large_sst,
...     dask_chunks={"time": 25},
...     method_percentile="approximate"  # Use approximate method (Default) for long time-series calculations
... )

Integration with tracking workflow:

>>> # Preprocess data then track events
>>> processed = marEx.preprocess_data(sst, threshold_percentile=95)
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = tracker.run()
>>> print(f"Identified {events.event.max().compute()} distinct events")

Simple fixed baseline approach:

>>> # Basic daily climatology across all years
>>> result_fixed = marEx.preprocess_data(
...     sst,
...     method_anomaly="fixed_baseline",
...     threshold_percentile=95
... )
>>> # Uses all available data for climatology computation

Combined trend removal and fixed climatology:

>>> # Remove long-term trends then compute daily climatology
>>> result_combined = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Linear trend
...     threshold_percentile=95,
...     force_zero_mean=True
... )
>>> # Balances trend removal with simple climatology
marEx.compute_normalised_anomaly(da, method_anomaly='shifting_baseline', dimensions=None, coordinates=None, window_year_baseline=15, smooth_days_baseline=21, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Generate normalised anomalies using specified methodology.

Parameters:
  • da (xarray.DataArray) – Input data with dimensions matching the ‘dimensions’ parameter

  • method_anomaly (str, default='shifting_baseline') –

    Anomaly computation method. Options: - ‘detrend_harmonic’: Detrending with harmonics and polynomials (efficient, biased) - ‘shifting_baseline’: Rolling climatology (accurate, shortens time series) - ‘fixed_baseline’: Daily climatology using full time series (keeps long-term trends in the anomaly) - ‘detrend_fixed_baseline’: Polynomial detrending + fixed climatology (does not shorten time series,

    keeps trends in seasonal timing in the anomaly)

  • dimensions (dict, optional) – Mapping of conceptual dimensions to actual dimension names in the data

  • coordinates (dict, optional) – Mapping of conceptual coordinates to actual coordinate names in the data

  • window_year_baseline (int, default=15) – Number of years for rolling climatology (shifting_baseline only)

  • smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline only)

  • std_normalise (bool, default=False) – Whether to normalise by 30-day rolling standard deviation (detrend_harmonic only)

  • detrend_orders (list, default=[1]) – Polynomial orders for trend removal (detrend_harmonic and detrend_fixed_baseline only)

  • force_zero_mean (bool, default=True) – Explicitly enforce zero mean in final anomalies (detrend_harmonic and detrend_fixed_baseline only)

  • reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless.

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Dataset containing anomalies, mask, and metadata

Return type:

xarray.Dataset

Examples

Basic detrended baseline anomaly computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load chunked SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute anomalies using shifting baseline (default)
>>> result = marEx.compute_normalised_anomaly(sst)
>>> print(result.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
>>> # Check that anomalies have approximately zero mean
>>> print(f"Mean anomaly: {result.dat_anomaly.mean().compute():.6f}")
Mean anomaly: 0.000023

Previous configuration (marEx v2.0 default) of detrended baseline with higher-order polynomials and standardisation. Note: marEx v3.0+ uses shifting_baseline as the default method:

>>> result_advanced = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2, 3],  # Linear, quadratic, cubic trends
...     std_normalise=True,        # Add standardised anomalies
...     force_zero_mean=True
... )
>>> print(result_advanced.data_vars)
Data variables:
    dat_anomaly  (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask         (lat, lon) bool dask.array<chunksize=(180, 360)>
    dat_stn      (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    STD          (dayofyear, lat, lon) float32 dask.array<chunksize=(366, 180, 360)>
>>> # Standardised anomalies have unit variance
>>> print(f"STD of standardised anomalies: {result_advanced.dat_stn.std().compute():.3f}")

Accurate shifting baseline method for climate-aware anomalies:

>>> result_shifting = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,   # Use 10-year rolling climatology
...     smooth_days_baseline=31    # 31-day smoothing window
... )
>>> # Anomalies computed relative to recent past climatology

Processing unstructured data:

>>> # ICON ocean model with ncells dimension
>>> icon_data = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> result_unstructured = marEx.compute_normalised_anomaly(
...     icon_data,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
... )
>>> print(result_unstructured.dims)
Frozen({'time': 1461, 'ncells': 83886})

Comparison of methods - detrended vs shifting baseline:

>>> # Detrended baseline - faster, slight bias
>>> detrended = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="detrend_harmonic"
... )
>>>
>>> # Shifting baseline - slower, more accurate
>>> shifting = marEx.compute_normalised_anomaly(
...     sst, method_anomaly="shifting_baseline",
...     window_year_baseline=15
... )
>>>
>>> # Compare anomaly magnitudes
>>> print(f"Detrended RMS: {detrended.dat_anomaly.std().compute():.3f}")
>>> print(f"Shifting RMS: {shifting.dat_anomaly.std().compute():.3f}")

Fixed baseline climatology:

>>> # Use full time series for daily climatology
>>> result_fixed = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline"
... )
>>> # Climatology computed from all available years

Fixed baseline with a restricted reference period:

>>> # Compute climatology from 1990-2020 only, but output anomalies for all years
>>> result_ref = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="fixed_baseline",
...     reference_period=(1990, 2020)
... )

Fixed detrended baseline:

>>> # Remove long-term trends then compute fixed climatology
>>> result_fixed_detrended = marEx.compute_normalised_anomaly(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Remove linear trend
...     force_zero_mean=True
... )
>>> # Combines trend removal with fixed climatology
marEx.smoothed_rolling_climatology(da, window_year_baseline=15, smooth_days_baseline=21, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • smooth_days_baseline (int, default=21) – Number of days for temporal smoothing window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Smoothed rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic smoothed rolling climatology:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst,
...     window_year_baseline=15,
...     smooth_days_baseline=21
... )
>>> print(smooth_clim.shape)
(7305, 180, 360)

Comparing different smoothing windows:

>>> # Short smoothing - more day-to-day variability
>>> clim_short = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=7
... )
>>>
>>> # Long smoothing - smoother seasonal cycle
>>> clim_long = marEx.smoothed_rolling_climatology(
...     sst, smooth_days_baseline=61
... )
>>>
>>> # Compare variability
>>> var_short = clim_short.std(dim='time').mean().compute()
>>> var_long = clim_long.std(dim='time').mean().compute()
>>> print(f"Variability: short={var_short:.3f}, long={var_long:.3f}")

Climatology for anomaly computation:

>>> # Compute smoothed climatology then anomalies
>>> climatology = marEx.smoothed_rolling_climatology(sst)
>>> anomalies = sst - climatology
>>>
>>> # Check that anomalies have reasonable properties
>>> print(f"Anomaly mean: {anomalies.mean().compute():.6f}")
>>> print(f"Anomaly std: {anomalies.std().compute():.3f}")

Unstructured data processing:

>>> # ICON ocean data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_smooth_clim = marEx.smoothed_rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     window_year_baseline=10,
...     smooth_days_baseline=31
... )

Effect of smoothing on seasonal cycle:

>>> # Raw rolling climatology (no temporal smoothing)
>>> raw_clim = marEx.rolling_climatology(sst, window_year_baseline=15)
>>>
>>> # Smoothed rolling climatology
>>> smooth_clim = marEx.smoothed_rolling_climatology(
...     sst, window_year_baseline=15, smooth_days_baseline=21
... )
>>>
>>> # Compare seasonal cycle smoothness
>>> # Extract annual cycle for a point
>>> point_raw = raw_clim.isel(lat=90, lon=180).sel(time='2010')
>>> point_smooth = smooth_clim.isel(lat=90, lon=180).sel(time='2010')
>>>
>>> print(f"Raw climatology range: {(point_raw.max() - point_raw.min()).compute():.3f}")
>>> print(f"Smooth climatology range: {(point_smooth.max() - point_smooth.min()).compute():.3f}")

Performance considerations:

>>> # Efficient implementation smooths raw data first, then computes climatology
>>> # This is more memory-efficient than smoothing the climatology
>>> large_sst = sst.chunk({'time': 25, 'lat': 45, 'lon': 90})
>>> efficient_clim = marEx.smoothed_rolling_climatology(large_sst)
marEx.rolling_climatology(da, window_year_baseline=15, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]

Compute rolling climatology efficiently using flox cohorts. Uses the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.

Parameters:
  • da (xarray.DataArray) – Input data with time coordinate

  • window_year_baseline (int, default=15) – Number of years to include in each climatology window

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • use_temp_checkpoints (bool)

Returns:

Rolling climatology with same shape as input data

Return type:

xarray.DataArray

Examples

Basic rolling climatology computation:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load 20 years of SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Compute 15-year rolling climatology
>>> climatology = marEx.rolling_climatology(sst, window_year_baseline=15)
>>> print(climatology.shape)
(7305, 180, 360)  # Same as input
>>>
>>> # First 15 years will be NaN (insufficient history)
>>> print(f"NaN values in first year: {climatology.isel(time=slice(0, 365)).isnull().all().compute()}")
True

Shorter window for datasets with limited time span:

>>> # For datasets with only 10 years, use shorter window
>>> short_climatology = marEx.rolling_climatology(
...     sst, window_year_baseline=5
... )
>>> # First 5 years will be NaN instead of 15

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25})
>>> icon_climatology = marEx.rolling_climatology(
...     icon_sst,
...     dimensions={"time": "time", "x": "ncells"}
...     coordinates={"time": "time", "x": "lon", "y": "lat"}
... )
>>> print(icon_climatology.dims)
Frozen({'time': 7305, 'ncells': 83886})

Comparing with fixed climatology:

>>> # Fixed climatology (traditional approach)
>>> fixed_clim = sst.groupby(sst.time.dt.dayofyear).mean()
>>>
>>> # Rolling climatology (adaptive approach)
>>> rolling_clim = marEx.rolling_climatology(sst)
>>>
>>> # Rolling climatology adapts to climate change
>>> clim_2000 = rolling_clim.sel(time='2000').mean()
>>> clim_2020 = rolling_clim.sel(time='2020').mean()
>>> print(f"Climate change signal: {(clim_2020 - clim_2000).compute():.3f} °C")

Memory considerations for large datasets:

>>> # Ensure appropriate chunking for memory efficiency
>>> large_sst = sst.chunk({'time': 30, 'lat': 45, 'lon': 90})
>>> large_climatology = marEx.rolling_climatology(large_sst)
>>> # Output maintains input chunking structure
marEx.identify_extremes(da, method_extreme='hobday_extreme', threshold_percentile=95, dimensions=None, coordinates=None, window_days_hobday=11, window_spatial_hobday=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Identify extreme events exceeding a percentile threshold using specified method.

Parameters:
  • da (xarray.DataArray) – DataArray containing anomalies

  • method_extreme (str, default='hobday_extreme') – Method for threshold calculation (‘global_extreme’ or ‘hobday_extreme’)

  • threshold_percentile (float, default=95) – Percentile threshold (e.g., 95 for 95th percentile)

  • dimensions (dict, optional) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Mapping of coordinates to names in the data

  • window_days_hobday (int, default=11) – Window for day-of-year threshold (hobday_extreme only)

  • window_spatial_hobday (int, default=None) – Window for day-of-year threshold spatial clustering (hobday_extreme only)

  • method_percentile (str, default='approximate') – Method for percentile computation (‘exact’ or ‘approximate’)

  • precision (float, default=0.01) – Precision for histogram bins in approximate method

  • max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning

  • use_temp_checkpoints (bool)

  • verbose (bool | None)

  • quiet (bool | None)

Returns:

Tuple of (extremes, thresholds) where extremes is a boolean array identifying extreme events and thresholds contains the threshold values used

Return type:

tuple

Examples

Basic extreme identification with global thresholds:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load anomaly data (from compute_normalised_anomaly)
>>> anomalies = xr.open_dataset('anomalies.nc', chunks={}).dat_anomaly
>>>
>>> # Identify extreme events using global-in-time 95th percentile
>>> extremes, thresholds = marEx.identify_extremes(
...     anomalies,
...     method_extreme="global_extreme",
...     threshold_percentile=95
... )
>>> print(f"Extreme events shape: {extremes.shape}")
Extreme events shape: (1461, 180, 360)
>>> print(f"Thresholds shape: {thresholds.shape}")
Thresholds shape: (180, 360)
>>> # Count total extreme events
>>> total_extremes = extremes.sum().compute()
>>> print(f"Total extreme events: {total_extremes}")

Using day-of-year specific thresholds (cf. Hobday et al. 2016 method):

>>> # More sophisticated threshold calculation
>>> extremes_hobday, thresholds_hobday = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     threshold_percentile=95,
...     window_days_hobday=11  # 11-day window around each day-of-year
...     window_spatial_hobday=3  # 3x3 spatial window for clustering percentile calcuation
... )
>>> print(f"Hobday thresholds shape: {thresholds_hobday.shape}")
Hobday thresholds shape: (366, 180, 360)
>>> # Compare seasonal variation in thresholds
>>> summer_threshold = thresholds_hobday.sel(dayofyear=200).mean()
>>> winter_threshold = thresholds_hobday.sel(dayofyear=50).mean()
>>> print(f"Summer vs Winter thresholds: {summer_threshold:.3f} vs {winter_threshold:.3f}")

Comparison of exact vs approximate percentile methods:

>>> # Approximate method (faster, default)
>>> extremes_approx, thresh_approx = marEx.identify_extremes(
...     anomalies, method_percentile="approximate"
... )
>>>
>>> # Exact method (slower & memory intensive)
>>> extremes_exact, thresh_exact = marEx.identify_extremes(
...     anomalies, method_percentile="exact"
... )
>>>
>>> # Compare threshold precision — ~0.005C
>>> threshold_diff = (thresh_exact - thresh_approx).std().compute()
>>> print(f"Threshold difference (exact vs approx): {threshold_diff:.6f}")

Different percentile thresholds for varying event rarity:

>>> # Conservative threshold - very extreme events only
>>> extremes_98, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=98
... )
>>>
>>> # Moderate threshold - more frequent events
>>> extremes_90, _ = marEx.identify_extremes(
...     anomalies, threshold_percentile=90
... )
>>>
>>> # Compare event frequency
>>> print(f"99th percentile events: {extremes_99.sum().compute()}")
>>> print(f"90th percentile events: {extremes_90.sum().compute()}")

Processing unstructured data:

>>> # ICON ocean model data
>>> icon_anomalies = xr.open_dataset('icon_anomalies.nc', chunks={}).dat_anomaly
>>> extremes_unstructured, thresholds_unstructured = marEx.identify_extremes(
...     icon_anomalies,
...     dimensions={"time": "time", "x": "ncells"},
...     coordinates={"time": "time", "x": "lon", "y": "lat"},
...     threshold_percentile=95
... )
>>> print(f"Unstructured extremes shape: {extremes_unstructured.shape}")

Advanced Hobday method with custom temporal window:

>>> # Longer temporal window for smoother thresholds
>>> extremes_smooth, thresholds_smooth = marEx.identify_extremes(
...     anomalies,
...     method_extreme="hobday_extreme",
...     window_days_hobday=31,  # Longer smoothing window
...     threshold_percentile=95
... )
>>>
>>> # Compare threshold smoothness
>>> std_11day = thresholds_hobday.std(dim='dayofyear').mean().compute()
>>> std_31day = thresholds_smooth.std(dim='dayofyear').mean().compute()
>>> print(f"Threshold variability: 11-day={std_11day:.3f}, 31-day={std_31day:.3f}")

Event Tracking

class marEx.tracker(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Bases: object

Tracker identifies and tracks arbitrary binary objects in spatial data through time.

The tracker supports both structured (regular grid) and unstructured data, and seamlessly handles splitting & merging of objects. It identifies connected regions in binary data at each time step, and tracks these as evolving events through time.

Main workflow:

  1. Preprocessing: Fill spatiotemporal holes, filter small objects

  2. Object identification: Label connected components at each time

  3. Tracking: Determine object correspondences across time

  4. Optional splitting & merging: Handle complex event evolution

Parameters:
  • data_bin (xarray.DataArray) – Binary field of extreme points to group, label, and track (True = object, False = background) Must represent and underlying dask array.

  • mask (xarray.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)

  • R_fill (int) – The radius of the kernel used in morphological opening & closing, relating to the largest hole/gap that can be filled. In units of grid cells.

  • area_filter_quartile (float, optional) – The fraction of the smallest objects to discard, i.e. the quantile defining the smallest area object retained. Quantile must be in (0-1) (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.

  • area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile. Use this for fixed minimum area thresholds (e.g., 10 cells minimum).

  • temp_dir (str, optional) – Path to temporary directory for storing intermediate results

  • T_fill (int, default=2) – The permissible temporal gap (in days) between objects for tracking continuity to be maintained (must be even)

  • allow_merging (bool, default=True) – Allow objects to split and merge across time. Apply splitting & merging criteria, track merge events, and maintain original identities of merged objects across time. N.B.: False reverts to classical ndmeasure.label with simplar time connectivity, i.e. Scannell et al.

  • nn_partitioning (bool, default=False) – Implement a better partitioning of merged child objects based on closest parent cell. False reverts to using parent centroids to determine partitioning between new child objects, i.e. Di Sun & Bohai Zhang 2023. N.B.: Centroid-based partitioning has major problems with small merging objects suddenly obtaining unrealistically-large (and often disjoint) fractions of the larger object.

  • overlap_threshold (float, default=0.5) – The fraction of the smaller object’s area that must overlap with the larger object’s area to be considered the same event and continue tracking with the same ID.

  • unstructured_grid (bool, default=False) – Whether data is on an unstructured grid

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Coordinate names for unstructured grids. Should contain ‘x’ and ‘y’ keys for x and y coordinates. May also contain ‘time’ if the time coordinate name is different from the dimension name.

  • neighbours (xarray.DataArray, optional) – For unstructured grid, indicates connectivity between cells

  • cell_areas (xarray.DataArray, optional) – For unstructured grid, area of each cell (required). For structured grid, area of each cell (optional). If not provided, defaults to 1.0 for each cell (resulting in cell counts as areas). Note: Overridden by grid_resolution if provided for structured grids.

  • grid_resolution (float, optional) – Grid resolution in degrees for structured grids only (ignored for unstructured grids). When provided, automatically calculates cell areas using spherical geometry. Overrides any provided cell_areas parameter.

  • max_iteration (int, default=40) – Maximum number of iterations for merging/splitting algorithm

  • checkpoint (str, default='None') – Checkpoint strategy (‘save’, ‘load’, or None)

  • debug (int, default=0) – Debug level (0-2)

  • verbose (bool, optional) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

  • regional_mode (bool, default=False) – Enable regional mode for non-global coordinate ranges. When True, coordinate_units must be specified.

  • coordinate_units (str, optional) – Coordinate units when regional_mode=True. Must be either ‘degrees’ or ‘radians’.

Examples

Basic tracking of marine heatwave events from preprocessed data:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load preprocessed extreme events data
>>> processed = xr.open_dataset('extreme_events.nc', chunks={})
>>> extreme_events = processed.extreme_events  # Boolean array
>>> mask = processed.mask  # Ocean/land mask
>>>
>>> # Initialise tracker with basic parameters
>>> tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,                    # Fill holes up to 8 grid cells
...     area_filter_quartile=0.5     # Remove smallest 50% of objects
...     allow_merging=False          # Basic tracking without splitting/merging
... )
>>>
>>> # Run tracking algorithm
>>> events = tracker.run()
>>> print(f"Identified {events.ID.max().compute()} distinct events")
Identified 1247 distinct events

Using automatic grid area calculation from resolution:

>>> # For regular lat/lon grids, automatically calculate physical areas
>>> grid_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     grid_resolution=0.25  # Grid resolution in degrees
... )
>>> # Cell areas are calculated automatically using spherical geometry
>>> grid_events = grid_tracker.run()

Advanced tracking with merging and splitting enabled:

>>> # More sophisticated tracking with temporal gap filling
>>> advanced_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=12,               # Larger spatial gap filling
...     T_fill=4,                # Fill up to 4-day temporal gaps
...     area_filter_quartile=0.25,  # More aggressive size filtering
...     allow_merging=True,      # Enable split/merge detection
...     overlap_threshold=0.3    # Lower threshold for object linking
... )
>>>
>>> events_advanced, merges_log = advanced_tracker.run(return_merges=True)
>>> print(events_advanced.data_vars)
Data variables:
    event           (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    event_centroid  (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    ID_field        (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    global_ID       (time, ID)              int32           dask.array<chunksize=(25, 1247)>
    area            (time, ID)              float32         dask.array<chunksize=(25, 1247)>
    centroid        (component, time, ID)   float64         dask.array<chunksize=(2, 25, 1247)>
    presence        (time, ID)              bool            dask.array<chunksize=(25, 1247)>
    time_start      (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    time_end        (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    merge_ledger    (time, ID, sibling_ID)  int32           dask.array<chunksize=(25, 1247, 10)>

Processing unstructured ocean model data (ICON):

>>> # Load ICON ocean model data with connectivity
>>> icon_data = xr.open_dataset('icon_extremes.nc', chunks={})
>>> icon_extremes = icon_data.extreme_events  # (time, ncells)
>>> icon_mask = icon_data.mask
>>> neighbours = icon_data.neighbours  # Cell connectivity
>>> cell_areas = icon_data.cell_areas  # Physical areas
>>>
>>> # Track events on unstructured grid
>>> unstructured_tracker = marEx.tracker(
...     icon_extremes,
...     icon_mask,
...     R_fill=5,                                   # 5-neighbor radius for gap filling
...     area_filter_quartile=0.6,                   # Remove 60% of smallest events
...     unstructured_grid=True,                     # Enable unstructured mode
...     dimensions={"x": "ncells"},                 # Must specify the name of the spatial dimension
...     coordinates={"x": "lon", "y": "lat"},       # Spatial coordinate names
...     neighbours=neighbours,                      # Required for unstructured
...     cell_areas=cell_areas                       # Required for area calculations
... )
>>> unstructured_events = unstructured_tracker.run()

Memory management and checkpointing for large datasets:

>>> # Use checkpointing for very large datasets
>>> large_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     temp_dir='/scratch/user/tracking_temp',  # Temporary storage
...     checkpoint='save'             # Save intermediate results
... )
>>> # Processing can be resumed if interrupted
>>> large_events = large_tracker.run()

Comparing different filtering strategies:

>>> # Conservative filtering - keep more events
>>> conservative = marEx.tracker(
...     extreme_events, mask, R_fill=5, area_filter_quartile=0.1
... )
>>> conservative_events = conservative.run()
>>>
>>> # Aggressive filtering - focus on largest events
>>> aggressive = marEx.tracker(
...     extreme_events, mask, R_fill=15, area_filter_quartile=0.8
... )
>>> aggressive_events = aggressive.run()
>>>
>>> print(f"Conservative: {conservative_events.ID.max().compute()} events")
>>> print(f"Aggressive: {aggressive_events.ID.max().compute()} events")

Using absolute area filtering instead of percentile-based:

>>> # Filter objects smaller than 25 grid cells
>>> absolute_tracker = marEx.tracker(
...     extreme_events, mask, R_fill=8, area_filter_absolute=25
... )
>>> absolute_events = absolute_tracker.run()
>>>
>>> # Default behavior (area_filter_quartile=0.5) when no parameters provided
>>> default_tracker = marEx.tracker(extreme_events, mask, R_fill=8)
>>> default_events = default_tracker.run()  # Uses quartile=0.5 filtering

Using physical cell areas for structured grids:

>>> # Load data with irregular grid cell areas
>>> grid_areas = xr.open_dataset('grid_areas.nc').cell_area  # (lat, lon) in m²
>>>
>>> # Track events using physical areas instead of cell counts
>>> physical_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     cell_areas=grid_areas  # Physical areas in m²
... )
>>> events = physical_tracker.run()
>>> # Now events.area contains physical areas in m² instead of cell counts

Integration with full marEx workflow:

>>> # Complete workflow from raw data to tracked events
>>> raw_sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Step 1: Preprocess to identify extremes
>>> processed = marEx.preprocess_data(raw_sst, threshold_percentile=95)
>>>
>>> # Step 2: Track extreme events
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> tracked_events = tracker.run()

Initialise the tracker with parameters and data.

__init__(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Initialise the tracker with parameters and data.

Parameters:
Return type:

None

run(return_merges=False, checkpoint=None)[source]

Run the complete object identification and tracking pipeline.

This method executes the full workflow:

  1. Preprocessing: morphological operations and size filtering

  2. Identification and tracking of objects through time

  3. Computing and attaching statistics to the results

Parameters:
  • return_merges (bool, default=False) – If True, return merge events dataset alongside the main events

  • checkpoint (str, optional) – Override the instance checkpoint setting

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events and their properties

  • merges_ds (xarray.Dataset, optional) – Dataset with merge event information (only if return_merges=True)

Return type:

Dataset | Tuple[Dataset, Dataset]

run_preprocess(checkpoint=None)[source]

Preprocess binary data to prepare for tracking.

This performs morphological operations to fill holes/gaps in both space and time, then filters small objects according to the area_filter_quartile or area_filter_absolute.

Parameters:

checkpoint (str, optional) – Checkpoint strategy override

Returns:

  • data_bin_filtered (xarray.DataArray) – Preprocessed binary data

  • object_stats (tuple) – Statistics about the preprocessing

Return type:

Tuple[DataArray, Tuple[float, int, int, float, float, float]]

run_tracking(data_bin_preprocessed)[source]

Track objects through time to identify events.

Parameters:

data_bin_preprocessed (xarray.DataArray) – Preprocessed binary data

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • N_events_final (int) – Final number of unique events

Return type:

Tuple[Dataset, Dataset, int]

run_stats_attributes(events_ds, merges_ds, object_stats, N_events_final)[source]

Add statistics and attributes to the events dataset.

Parameters:
  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • object_stats (tuple) – Preprocessed object statistics

  • N_events_final (int) – Final number of events

Returns:

events_ds – Dataset with added statistics and attributes

Return type:

xarray.Dataset

compute_area(data_bin)[source]

Compute the total area of binary data at each time.

Parameters:

data_bin (xarray.DataArray) – Binary data

Returns:

area – Total area at each time (units: pixels for structured grid, matching cell_area for unstructured)

Return type:

xarray.DataArray

fill_holes(data_bin, R_fill=None)[source]

Fill holes and gaps using morphological operations.

This performs closing (dilation followed by erosion) to fill small gaps, then opening (erosion followed by dilation) to remove small isolated objects.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to process

  • R_fill (int, optional) – Fill radius override

Returns:

data_bin_filled – Binary data with holes/gaps filled

Return type:

xarray.DataArray

fill_time_gaps(data_bin)[source]

Fill temporal gaps between objects.

Performs binary closing (dilation then erosion) along the time dimension to fill small time gaps between objects.

Parameters:

data_bin (xarray.DataArray) – Binary data to process

Returns:

data_bin_filled – Binary data with temporal gaps filled

Return type:

xarray.DataArray

refresh_dask_graph(data_bin)[source]

Clear and reset the Dask graph via save/load cycle.

This is needed to work around a memory leak bug in Dask where “Unmanaged Memory” builds up within loops.

Parameters:

data_bin (xarray.DataArray) – Data to refresh

Returns:

data_new – Data with fresh Dask graph

Return type:

xarray.DataArray

filter_small_objects(data_bin)[source]

Remove objects smaller than a threshold area.

Parameters:

data_bin (xarray.DataArray) – Binary data to filter

Returns:

  • data_bin_filtered (xarray.DataArray) – Binary data with small objects removed

  • area_threshold (float) – Area threshold used for filtering

  • object_areas (xarray.DataArray) – Areas of all objects pre-filtering

  • N_objects_prefiltered (int) – Number of objects before filtering

  • N_objects_filtered (int) – Number of objects after filtering

Return type:

Tuple[DataArray, float, DataArray, int, int]

identify_objects(data_bin, time_connectivity)[source]

Identify connected regions in binary data.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to identify objects in

  • time_connectivity (bool) – Whether to connect objects across time

Returns:

  • object_id_field (xarray.DataArray) – Field of integer IDs for each object

  • None (NoneType) – Placeholder for compatibility with track_objects

  • N_objects (int) – Number of objects identified

Return type:

Tuple[DataArray, None, int]

calculate_centroid(binary_mask, original_centroid=None)[source]

Calculate object centroid, handling edge cases for periodic boundaries.

Parameters:
  • binary_mask (numpy.ndarray) – 2D binary array where True indicates the object (dimensions are (y,x))

  • original_centroid (tuple, optional) – (y_centroid, x_centroid) from regionprops_table

Returns:

(y_centroid, x_centroid)

Return type:

tuple

calculate_object_properties(object_id_field, properties=None)[source]

Calculate properties of objects from ID field.

Parameters:
  • object_id_field (xarray.DataArray) – Field containing object IDs

  • properties (list, optional) – List of properties to calculate (defaults to [‘label’, ‘area’])

Returns:

object_props – Dataset containing calculated properties with ‘ID’ dimension

Return type:

xarray.Dataset

check_overlap_slice(ids_t0, ids_next)[source]

Find overlapping objects between two consecutive time slices.

Parameters:
Returns:

Array of shape (n_overlaps, 3) with [id_t0, id_next, overlap_area]

Return type:

numpy.ndarray

find_overlapping_objects(object_id_field)[source]

Find all overlapping objects across time.

Parameters:

object_id_field (xarray.DataArray) – Field containing object IDs

Returns:

overlap_objects_list_unique_filtered – Array of object ID pairs that overlap across time, with overlap area The object in the first column precedes the second column in time. The third column contains:

  • For structured grid: number of overlapping pixels (int32)

  • For unstructured grid: total overlapping area in m^2 (float32)

Return type:

(N x 3) numpy.ndarray

enforce_overlap_threshold(overlap_objects_list, object_props)[source]

Filter object pairs based on overlap threshold.

Parameters:
  • overlap_objects_list ((N x 3) numpy.ndarray) – Array of object ID pairs with overlap area

  • object_props (xarray.Dataset) – Object properties including area

Returns:

overlap_objects_list_filtered – Filtered array of object ID pairs that meet the overlap threshold

Return type:

(M x 3) numpy.ndarray

consolidate_object_ids(data_t_minus_2, data_t_minus_1, object_props, timestep)[source]

Consolidate object IDs between t-2 and t-1 to ensure consistent tracking.

This identifies objects at t-1 that are actually continuations of objects from t-2 (but got different IDs due to partitioning) and renames them to maintain consistent IDs across timesteps.

Parameters:
  • data_t_minus_2 (xr.DataArray) – Object field at timestep t-2

  • data_t_minus_1 (xr.DataArray) – Object field at timestep t-1 (will be modified)

  • object_props (xr.Dataset) – Object properties dataset (will be modified)

  • timestep (int) – Current timestep number for logging purposes

Returns:

  • data_t_minus_1_consolidated (xr.DataArray) – Updated t-1 field with consolidated IDs

  • object_props_updated (xr.Dataset) – Updated object properties with merged/deleted objects

Return type:

Tuple[DataArray, Dataset]

Notes

  • Uses self.overlap_threshold for determining consolidation eligibility

  • Updates object properties by recalculating for consolidated objects

  • Removes redundant child objects from object_props

compute_id_time_dict(da, child_objects, max_objects, all_objects=True)[source]

Generate lookup table mapping object IDs to their time index.

Parameters:
  • da (xarray.DataArray) – Field of object IDs

  • child_objects (list or array) – Object IDs to include in the dictionary

  • max_objects (int) – Maximum number of objects

  • all_objects (bool, default=True) – Whether to process all objects or just child_objects

Returns:

time_index_map – Dictionary mapping object IDs to time indices

Return type:

dict

track_objects(data_bin)[source]

Track objects through time to form events.

This is the main tracking method that handles splitting and merging of objects.

Parameters:

data_bin (xarray.DataArray) – Preprocessed binary data: Field of globally unique integer IDs of each element in connected regions. ID = 0 indicates no object.

Returns:

  • split_merged_events_ds (xarray.Dataset) – Dataset containing tracked events

  • merge_events (xarray.Dataset) – Dataset with merge information

  • N_events (int) – Final number of events

Return type:

Tuple[Dataset, Dataset, int]

cluster_rename_objects_and_props(object_id_field_unique, object_props, overlap_objects_list, merge_events)[source]

Cluster the object pairs and relabel to determine final event IDs.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs must not be repeated across time.

  • object_props (xarray.Dataset) – Properties of each object that also need to be relabeled.

  • overlap_objects_list ((N x 2) numpy.ndarray) – Array of object ID pairs that indicate which objects are in the same event. The object in the first column precedes the second column in time.

  • merge_events (xarray.Dataset) – Information about merge events

Returns:

split_merged_events_ds – Dataset with relabeled events and their properties. ID = 0 indicates no object.

Return type:

xarray.Dataset

split_and_merge_objects(object_id_field_unique, object_props)[source]

Implement object splitting and merging logic.

This identifies and processes cases where objects split or merge over time, creating new object IDs as needed.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs are required to be monotonically increasing with time.

  • object_props (xarray.Dataset) – Properties of each object

Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

split_and_merge_objects_parallel(object_id_field_unique, object_props)[source]

Optimised parallel implementation of object splitting and merging.

This version is specifically designed for unstructured grids with more efficient memory handling and better parallelism than the standard split_and_merge_objects method. It processes data in chunks, handles merging events, and efficiently updates object IDs.

Parameters:
Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

marEx.regional_tracker(data_bin, mask, coordinate_units, R_fill, area_filter_quartile=None, area_filter_absolute=None, **kwargs)[source]

Create a tracker instance configured for regional (non-global) data.

This is a convenience function that automatically sets regional_mode=True and requires explicit specification of coordinate units, since auto-detection may fail for regional coordinate ranges.

Parameters:
  • data_bin (xr.DataArray) – Binary data to identify and track objects in (True = object, False = background)

  • mask (xr.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)

  • coordinate_units ({'degrees', 'radians'}) – Units of the coordinate system. Must be specified for regional data.

  • R_fill (int or float) – Radius for filling holes/gaps in spatial domain (in grid cells)

  • area_filter_quartile (float, optional) – Quantile (0-1) for filtering smallest objects (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.

  • area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile.

  • **kwargs – Additional parameters passed to the tracker class

Returns:

Configured tracker instance with regional_mode=True

Return type:

tracker

Examples

Track events in regional Mediterranean Sea data:

>>> import marEx
>>> # For regional data with degree coordinates
>>> regional_tracker = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='degrees',
...     R_fill=5,
...     area_filter_quartile=0.3
... )
>>> events = regional_tracker.run()

Track events in regional data with radian coordinates:

>>> # For model output with radian coordinates
>>> regional_tracker = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='radians',
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = regional_tracker.run()

Using absolute area filtering in regional mode:

>>> # Keep only features larger than 15 grid cells
>>> absolute_regional = marEx.regional_tracker(
...     extreme_events,
...     mask,
...     coordinate_units='degrees',
...     R_fill=5,
...     area_filter_absolute=15
... )
>>> events = absolute_regional.run()

Visualisation

class marEx.PlotConfig(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)[source]

Bases: object

Configuration class for plot parameters

Parameters:
title

Plot title

Type:

str | None

var_units

Variable units for colorbar label

Type:

str

issym

Whether data is symmetric (centers colormap at 0)

Type:

bool

cmap

Colormap name or ListedColormap object

Type:

str | matplotlib.colors.ListedColormap | None

cperc

Percentile range for automatic color limits [min, max]

Type:

List[int]

clim

Manual color limits (vmin, vmax)

Type:

Tuple[float, float] | None

show_colorbar

Whether to display colorbar

Type:

bool

grid_lines

Whether to display grid lines

Type:

bool

grid_labels

Whether to display grid labels

Type:

bool

dimensions

Mapping of conceptual to actual dimension names

Type:

Dict[str, str]

coordinates

Mapping of conceptual to actual coordinate names

Type:

Dict[str, str]

norm

Custom normalization (BoundaryNorm or Normalize)

Type:

matplotlib.colors.BoundaryNorm | matplotlib.colors.Normalize | None

plot_IDs

Whether to plot object IDs with random colors

Type:

bool

extend

Colorbar extension (‘neither’, ‘both’, ‘min’, ‘max’)

Type:

str

verbose

Enable verbose logging

Type:

bool | None

quiet

Enable quiet logging

Type:

bool | None

projection

Cartopy projection for map plots

Type:

Any | None

framerate

Frames per second for animations (default 10)

Type:

int

title: str | None = None
var_units: str = ''
issym: bool = False
cmap: str | ListedColormap | None = None
cperc: List[int] = None
clim: Tuple[float, float] | None = None
show_colorbar: bool = True
grid_lines: bool = True
grid_labels: bool = False
dimensions: Dict[str, str] = None
coordinates: Dict[str, str] = None
norm: BoundaryNorm | Normalize | None = None
plot_IDs: bool = False
extend: str = 'both'
verbose: bool | None = None
quiet: bool | None = None
projection: Any | None = None
framerate: int = 10
__post_init__()[source]

Initialise default values and configure logging.

Return type:

None

__init__(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)
Parameters:
Return type:

None

marEx.specify_grid(grid_type=None, fpath_tgrid=None, fpath_ckdtree=None)[source]

Set the global grid specification that will be used by all plotters.

Parameters:
  • grid_type (str | None) – str, either ‘gridded’ or ‘unstructured’. If specified, this will be used as the primary method to determine grid type.

  • fpath_tgrid (str | Path | None) – Path to the triangulation grid file

  • fpath_ckdtree (str | Path | None) – Path to the pre-computed KDTree indices directory

Return type:

None

Exception Hierarchy

exception marEx.MarExError(message, details=None, suggestions=None, error_code=None, context=None)[source]

Bases: Exception

Base exception class for all MarEx-specific errors.

This is the root of the MarEx exception hierarchy and provides common functionality for all marEx exceptions including:

  • Structured error context

  • Exception chaining support

  • Consistent error formatting

Parameters:
  • message (str) – Primary error message describing what went wrong

  • details (str, optional) – Additional technical details about the error

  • suggestions (list of str, optional) – Actionable suggestions for resolving the error

  • error_code (str, optional) – Structured error code for programmatic handling

  • context (dict, optional) – Additional context information (e.g., parameter values, data shapes)

Initialise the Error.

__init__(message, details=None, suggestions=None, error_code=None, context=None)[source]

Initialise the Error.

Parameters:
add_suggestion(suggestion)[source]

Add an additional suggestion for resolving the error.

Parameters:

suggestion (str)

Return type:

None

add_context(key, value)[source]

Add additional context information.

Parameters:
Return type:

None

exception marEx.DataValidationError(message, details=None, suggestions=None, error_code='DATA_VALIDATION', context=None)[source]

Bases: MarExError

Raise exception for input data validation issues.

This exception covers problems with input data structure, format, content, or compatibility with marEx processing requirements.

Common scenarios:

  • Non-Dask arrays when Dask is required

  • Missing required coordinates or dimensions

  • Invalid data types or ranges

  • Incompatible chunking strategies

  • Malformed input datasets

Examples

>>> raise DataValidationError(
...     "Input DataArray must be Dask-backed",
...     details="Found numpy array, but marEx requires chunked Dask arrays",
...     suggestions=["Use da.chunk() to convert to Dask array",
                      "Load data with dask chunking: xr.open_dataset(...).chunk()"],
...     context={"data_type": type(data), "shape": data.shape}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='DATA_VALIDATION', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.CoordinateError(message, details=None, suggestions=None, error_code='COORDINATE_ERROR', context=None)[source]

Bases: MarExError

Raise exception for coordinate system problems.

This exception handles issues with geographic coordinates including unit mismatches, invalid ranges, missing coordinate information, and coordinate system inconsistencies.

Common scenarios:

  • Latitude/longitude values outside valid ranges

  • Unit mismatches (degrees vs radians)

  • Missing coordinate dimensions

  • Inconsistent coordinate systems between datasets

  • Auto-detection failures

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='COORDINATE_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.ProcessingError(message, details=None, suggestions=None, error_code='PROCESSING_ERROR', context=None)[source]

Bases: MarExError

Raise exception for computational and algorithmic issues.

This exception covers problems that occur during data processing, including numerical computation errors, algorithm convergence issues, and memory/performance problems.

Common scenarios:

  • Insufficient memory for computation

  • Numerical instability or overflow

  • Algorithm convergence failures

  • Chunking strategy problems

  • Dask computation errors

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='PROCESSING_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.ConfigurationError(message, details=None, suggestions=None, error_code='CONFIGURATION_ERROR', context=None)[source]

Bases: MarExError

Raise exception for parameter and setup issues.

This exception handles problems with function parameters, configuration settings, and setup requirements that prevent proper operation.

Common scenarios:

  • Invalid parameter values or combinations

  • Missing required configuration

  • Incompatible parameter settings

  • Environment setup issues

Examples

>>> raise ConfigurationError(
...     "Invalid threshold percentile value",
...     details="threshold_percentile must be between 0 and 100",
...     suggestions=["Use percentile value between 50-99 for extreme events",
                      "Common values: 90 (moderate), 95 (strong), 99 (severe)"],
...     context={"provided_value": 150, "valid_range": [0, 100]}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='CONFIGURATION_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.DependencyError(message, details=None, suggestions=None, error_code='DEPENDENCY_ERROR', context=None)[source]

Bases: MarExError

Raise exception for missing or incompatible dependencies.

This exception handles issues with optional or required dependencies that are missing, incompatible, or incorrectly configured.

Common scenarios:

  • Missing optional dependencies (JAX, ffmpeg)

  • Version incompatibilities

  • Import failures

  • System dependency issues

Examples

>>> raise DependencyError(
...     "JAX acceleration not available",
...     details="JAX package not found or incompatible version",
...     suggestions=["Install JAX: pip install marEx[full]",
                      "Check CUDA compatibility for GPU acceleration",
                      "Processing will continue with NumPy backend"],
...     context={"requested_feature": "GPU acceleration", "available": False}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='DEPENDENCY_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.TrackingError(message, details=None, suggestions=None, error_code='TRACKING_ERROR', context=None)[source]

Bases: MarExError

Raise exception for object tracking and identification issues.

This exception covers problems specific to the tracking module including binary object identification, temporal linking, and merge/split handling.

Common scenarios:

  • Invalid binary input data

  • Tracking parameter conflicts

  • Temporal continuity issues

  • Memory overflow during tracking

  • Checkpoint/resume failures

Examples

>>> raise TrackingError(
...     "Tracking failed due to excessive memory usage",
...     details="Event fragmentation created >100,000 objects per timestep",
...     suggestions=["Increase area_filter_quartile to remove small events",
                      "Apply stronger spatial smoothing before tracking",
                      "Consider processing shorter time periods"],
...     context={"objects_per_timestep": 150000, "memory_limit_gb": 32}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='TRACKING_ERROR', context=None)[source]

Initialise the Error.

Parameters:
exception marEx.VisualisationError(message, details=None, suggestions=None, error_code='VISUALISATION_ERROR', context=None)[source]

Bases: MarExError

Raise exception for plotting and visualisation problems.

This exception handles issues with the plotX visualisation system including matplotlib configuration, cartopy projections, and animation generation.

Common scenarios:

  • Missing plotting dependencies

  • Cartopy projection issues

  • Invalid plot configuration

  • Animation encoding failures

  • Grid type detection problems

Examples

>>> raise VisualisationError(
...     "Animation creation failed",
...     details="ffmpeg encoder not found for MP4 generation",
...     suggestions=["Install ffmpeg system package",
                      "Use alternative format: save_format='gif'",
                      "Install ffmpeg via conda: conda install ffmpeg"],
...     context={"requested_format": "mp4", "available_encoders": ["png", "gif"]}
... )

Initialise the Error.

Parameters:
__init__(message, details=None, suggestions=None, error_code='VISUALISATION_ERROR', context=None)[source]

Initialise the Error.

Parameters:
marEx.create_data_validation_error(message, data_info=None, **kwargs)[source]

Create DataValidationError with common data context.

Parameters:
  • message (str) – Error message

  • data_info (dict, optional) – Dictionary with data information (type, shape, dtype, etc.)

  • **kwargs – Additional arguments passed to DataValidationError

Returns:

Configured exception with data context

Return type:

DataValidationError

marEx.create_coordinate_error(message, coordinate_ranges=None, detected_system=None, **kwargs)[source]

Create CoordinateError with coordinate context.

Parameters:
  • message (str) – Error message

  • coordinate_ranges (dict, optional) – Dictionary with coordinate ranges (e.g., {‘lat’: (-90, 90), ‘lon’: (0, 360)})

  • detected_system (str, optional) – Auto-detected coordinate system

  • **kwargs – Additional arguments passed to CoordinateError

Returns:

Configured exception with coordinate context

Return type:

CoordinateError

marEx.create_processing_error(message, computation_info=None, **kwargs)[source]

Create ProcessingError with computation context.

Parameters:
  • message (str) – Error message

  • computation_info (dict, optional) – Dictionary with computation information (memory usage, chunk sizes, etc.)

  • **kwargs – Additional arguments passed to ProcessingError

Returns:

Configured exception with computation context

Return type:

ProcessingError

marEx.wrap_exception(original_exception, message=None, marex_exception_type=None)[source]

Wrap a generic exception in an appropriate MarEx exception.

This function helps maintain backward compatibility while migrating to the new exception hierarchy by wrapping generic exceptions in appropriate MarEx-specific exceptions.

Parameters:
  • original_exception (Exception) – The original exception to wrap

  • message (str, optional) – Custom message (uses original message if not provided)

  • marex_exception_type (type, optional) – Specific MarEx exception type to use

Returns:

Wrapped exception with original as cause

Return type:

MarExError

Logging System

marEx.configure_logging(level=None, format_str=None, date_format=None, log_file=None, console_output=True, disable_external_loggers=True, verbose=None, quiet=None)[source]

Configure logging for the MarEx package.

Parameters:
  • level (int | str | None) – Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

  • format_str (str | None) – Custom log format string

  • date_format (str | None) – Date format for timestamps

  • log_file (str | Path | None) – Optional file path for logging output

  • console_output (bool) – Whether to log to console

  • disable_external_loggers (bool) – Whether to suppress noisy external library logs

  • verbose (bool | None) – Enable verbose mode (detailed logging with DEBUG level)

  • quiet (bool | None) – Enable quiet mode (minimal logging with WARNING level)

Return type:

None

marEx.set_verbose_mode(verbose=True)[source]

Enable or disable verbose logging mode.

Parameters:

verbose (bool) – If True, enables verbose mode with DEBUG level logging

Return type:

None

marEx.set_quiet_mode(quiet=True)[source]

Enable or disable quiet logging mode.

Parameters:

quiet (bool) – If True, enables quiet mode with WARNING level logging

Return type:

None

marEx.set_normal_logging()[source]

Reset logging to normal mode (INFO level).

Return type:

None

marEx.get_verbosity_level()[source]

Get the current verbosity level.

Returns:

‘quiet’, ‘normal’, or ‘verbose’

Return type:

Current verbosity level

marEx.is_verbose_mode()[source]

Check if verbose mode is currently enabled.

Return type:

bool

marEx.is_quiet_mode()[source]

Check if quiet mode is currently enabled.

Return type:

bool

marEx.get_logger(name='marEx')[source]

Get a configured logger instance for MarEx.

Parameters:

name (str) – Logger name, typically the module name

Returns:

Configured logger instance

Return type:

Logger

Dependency Management

marEx.has_dependency(dep_name)[source]

Check if a dependency is available.

Parameters:

dep_name (str)

Return type:

bool

marEx.print_dependency_status()[source]

Print status of all tracked dependencies.

Return type:

None

marEx.get_installation_profile()[source]

Get the current installation profile.

Return type:

str