API Reference
This page provides a comprehensive reference for all public functions, classes, and modules in marEx.
Main Package
MarEx: Marine Extremes Detection and Tracking
A Python package for efficient identification and tracking of marine extremes such as Marine Heatwaves (MHWs).
Core Functionality
detect: Convert raw time series into standardised anomalies
track: Identify and track extreme events through time
Example
>>> import xarray as xr
>>> import marEx
>>> # Load SST data
>>> sst = xr.open_dataset('sst_data.nc').sst
>>> # Preprocess data to identify extreme events
>>> extreme_events_ds = marEx.preprocess_data(sst, threshold_percentile=95)
>>> # Track events through time
>>> events_ds = marEx.tracker(extreme_events_ds.extreme_events, extreme_events_ds.mask,
... R_fill=8, area_filter_quartile=0.5).run()
- marEx.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Complete preprocessing pipeline for marine extreme event identification.
Supports separate methods for anomaly computation and extreme identification:
Anomaly Methods:
‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics
‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years
‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !
‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)
Extreme Methods:
‘global_extreme’: Global-in-time threshold value
‘hobday_extreme’: Local day-of-year specific thresholds with windowing
- Parameters:
da (xarray.DataArray) – Raw input data
method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).
method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).
threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.
window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).
smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).
window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).
window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).
std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).
detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.
force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).
reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.
method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.
precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.
max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.
dask_chunks (dict, optional) – Chunking specification for distributed computation.
dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.
coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.
neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.
cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.
use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.
verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.
quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.
- Returns:
Processed dataset with anomalies and extreme event identification
- Return type:
Examples
Basic usage with gridded SST data for marine heatwave detection:
>>> import xarray as xr >>> import marEx >>> >>> # Load and chunk SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Basic preprocessing with default shifting baseline method >>> result = marEx.preprocess_data(sst, threshold_percentile=90) >>> print(result) <xarray.Dataset> Dimensions: (time: 1461, lat: 180, lon: 360) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)> extreme_events (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)> thresholds (lat, lon) float32 dask.array<chunksize=(180, 360)>
>>> # Check which locations have extreme events >>> print(f"Total extreme events: {result.extreme_events.sum().compute()}") Total extreme events: 15847
Using shifting baseline method for more accurate climatology:
>>> # Requires at least 15 years of data by default >>> result_shifting = marEx.preprocess_data( ... sst, ... method_anomaly="shifting_baseline", ... window_year_baseline=10, # Use shorter window if needed ... smooth_days_baseline=31 # Longer smoothing window ... ) >>> # Note: First 10 years will be removed from output
Using Hobday extreme method with day-of-year specific thresholds:
>>> result_hobday = marEx.preprocess_data( ... sst, ... method_extreme="hobday_extreme", ... window_days_hobday=11, # 11-day window for each day-of-year ... threshold_percentile=95 ... ) >>> print(result_hobday.thresholds.dims) ('dayofyear', 'lat', 'lon')
Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:
>>> result_advanced = marEx.preprocess_data( ... sst, ... method_anomaly="detrend_harmonic", ... detrend_orders=[1, 2], # Linear and quadratic trends ... std_normalise=True, # Standardise by rolling std ... force_zero_mean=True, ... threshold_percentile=95 ... ) >>> # Result includes both raw and standardised anomalies >>> print('dat_stn' in result_advanced) True
Processing unstructured data:
>>> # For ICON ocean model data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50}) >>> result_unstructured = marEx.preprocess_data( ... icon_sst, ... dimensions={"x": "ncells"}, # Must specify the name of the spatial dimension ... dask_chunks={"time": 50} ... )
Error handling - insufficient data for shifting baseline:
>>> short_data = sst.isel(time=slice(0, 1000)) # Only ~3 years >>> try: ... result = marEx.preprocess_data( ... short_data, ... method_anomaly="shifting_baseline", ... window_year_baseline=15 ... ) ... except ValueError as e: ... print(f"Error: {e}") Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline requires at least 15 years.
Performance considerations with chunking:
>>> # For large datasets, adjust chunking for memory management >>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180}) >>> result = marEx.preprocess_data( ... large_sst, ... dask_chunks={"time": 25}, ... method_percentile="approximate" # Use approximate method (Default) for long time-series calculations ... )
Integration with tracking workflow:
>>> # Preprocess data then track events >>> processed = marEx.preprocess_data(sst, threshold_percentile=95) >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> events = tracker.run() >>> print(f"Identified {events.event.max().compute()} distinct events")
Simple fixed baseline approach:
>>> # Basic daily climatology across all years >>> result_fixed = marEx.preprocess_data( ... sst, ... method_anomaly="fixed_baseline", ... threshold_percentile=95 ... ) >>> # Uses all available data for climatology computation
Combined trend removal and fixed climatology:
>>> # Remove long-term trends then compute daily climatology >>> result_combined = marEx.preprocess_data( ... sst, ... method_anomaly="detrend_fixed_baseline", ... detrend_orders=[1], # Linear trend ... threshold_percentile=95, ... force_zero_mean=True ... ) >>> # Balances trend removal with simple climatology
- marEx.compute_normalised_anomaly(da, method_anomaly='shifting_baseline', dimensions=None, coordinates=None, window_year_baseline=15, smooth_days_baseline=21, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Generate normalised anomalies using specified methodology.
- Parameters:
da (xarray.DataArray) – Input data with dimensions matching the ‘dimensions’ parameter
method_anomaly (str, default='shifting_baseline') –
Anomaly computation method. Options: - ‘detrend_harmonic’: Detrending with harmonics and polynomials (efficient, biased) - ‘shifting_baseline’: Rolling climatology (accurate, shortens time series) - ‘fixed_baseline’: Daily climatology using full time series (keeps long-term trends in the anomaly) - ‘detrend_fixed_baseline’: Polynomial detrending + fixed climatology (does not shorten time series,
keeps trends in seasonal timing in the anomaly)
dimensions (dict, optional) – Mapping of conceptual dimensions to actual dimension names in the data
coordinates (dict, optional) – Mapping of conceptual coordinates to actual coordinate names in the data
window_year_baseline (int, default=15) – Number of years for rolling climatology (shifting_baseline only)
smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline only)
std_normalise (bool, default=False) – Whether to normalise by 30-day rolling standard deviation (detrend_harmonic only)
detrend_orders (list, default=[1]) – Polynomial orders for trend removal (detrend_harmonic and detrend_fixed_baseline only)
force_zero_mean (bool, default=True) – Explicitly enforce zero mean in final anomalies (detrend_harmonic and detrend_fixed_baseline only)
reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless.
use_temp_checkpoints (bool)
verbose (bool | None)
quiet (bool | None)
- Returns:
Dataset containing anomalies, mask, and metadata
- Return type:
Examples
Basic detrended baseline anomaly computation:
>>> import xarray as xr >>> import marEx >>> >>> # Load chunked SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute anomalies using shifting baseline (default) >>> result = marEx.compute_normalised_anomaly(sst) >>> print(result.data_vars) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)>
>>> # Check that anomalies have approximately zero mean >>> print(f"Mean anomaly: {result.dat_anomaly.mean().compute():.6f}") Mean anomaly: 0.000023
Previous configuration (marEx v2.0 default) of detrended baseline with higher-order polynomials and standardisation. Note: marEx v3.0+ uses shifting_baseline as the default method:
>>> result_advanced = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="detrend_harmonic", ... detrend_orders=[1, 2, 3], # Linear, quadratic, cubic trends ... std_normalise=True, # Add standardised anomalies ... force_zero_mean=True ... ) >>> print(result_advanced.data_vars) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)> dat_stn (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> STD (dayofyear, lat, lon) float32 dask.array<chunksize=(366, 180, 360)>
>>> # Standardised anomalies have unit variance >>> print(f"STD of standardised anomalies: {result_advanced.dat_stn.std().compute():.3f}")
Accurate shifting baseline method for climate-aware anomalies:
>>> result_shifting = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="shifting_baseline", ... window_year_baseline=10, # Use 10-year rolling climatology ... smooth_days_baseline=31 # 31-day smoothing window ... ) >>> # Anomalies computed relative to recent past climatology
Processing unstructured data:
>>> # ICON ocean model with ncells dimension >>> icon_data = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> result_unstructured = marEx.compute_normalised_anomaly( ... icon_data, ... dimensions={"time": "time", "x": "ncells"} ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... ) >>> print(result_unstructured.dims) Frozen({'time': 1461, 'ncells': 83886})
Comparison of methods - detrended vs shifting baseline:
>>> # Detrended baseline - faster, slight bias >>> detrended = marEx.compute_normalised_anomaly( ... sst, method_anomaly="detrend_harmonic" ... ) >>> >>> # Shifting baseline - slower, more accurate >>> shifting = marEx.compute_normalised_anomaly( ... sst, method_anomaly="shifting_baseline", ... window_year_baseline=15 ... ) >>> >>> # Compare anomaly magnitudes >>> print(f"Detrended RMS: {detrended.dat_anomaly.std().compute():.3f}") >>> print(f"Shifting RMS: {shifting.dat_anomaly.std().compute():.3f}")
Fixed baseline climatology:
>>> # Use full time series for daily climatology >>> result_fixed = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="fixed_baseline" ... ) >>> # Climatology computed from all available years
Fixed baseline with a restricted reference period:
>>> # Compute climatology from 1990-2020 only, but output anomalies for all years >>> result_ref = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="fixed_baseline", ... reference_period=(1990, 2020) ... )
Fixed detrended baseline:
>>> # Remove long-term trends then compute fixed climatology >>> result_fixed_detrended = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="detrend_fixed_baseline", ... detrend_orders=[1], # Remove linear trend ... force_zero_mean=True ... ) >>> # Combines trend removal with fixed climatology
- marEx.smoothed_rolling_climatology(da, window_year_baseline=15, smooth_days_baseline=21, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]
Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.
- Parameters:
da (xarray.DataArray) – Input data with time coordinate
window_year_baseline (int, default=15) – Number of years to include in each climatology window
smooth_days_baseline (int, default=21) – Number of days for temporal smoothing window
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
use_temp_checkpoints (bool)
- Returns:
Smoothed rolling climatology with same shape as input data
- Return type:
Examples
Basic smoothed rolling climatology:
>>> import xarray as xr >>> import marEx >>> >>> # Load SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute smoothed rolling climatology >>> smooth_clim = marEx.smoothed_rolling_climatology( ... sst, ... window_year_baseline=15, ... smooth_days_baseline=21 ... ) >>> print(smooth_clim.shape) (7305, 180, 360)
Comparing different smoothing windows:
>>> # Short smoothing - more day-to-day variability >>> clim_short = marEx.smoothed_rolling_climatology( ... sst, smooth_days_baseline=7 ... ) >>> >>> # Long smoothing - smoother seasonal cycle >>> clim_long = marEx.smoothed_rolling_climatology( ... sst, smooth_days_baseline=61 ... ) >>> >>> # Compare variability >>> var_short = clim_short.std(dim='time').mean().compute() >>> var_long = clim_long.std(dim='time').mean().compute() >>> print(f"Variability: short={var_short:.3f}, long={var_long:.3f}")
Climatology for anomaly computation:
>>> # Compute smoothed climatology then anomalies >>> climatology = marEx.smoothed_rolling_climatology(sst) >>> anomalies = sst - climatology >>> >>> # Check that anomalies have reasonable properties >>> print(f"Anomaly mean: {anomalies.mean().compute():.6f}") >>> print(f"Anomaly std: {anomalies.std().compute():.3f}")
Unstructured data processing:
>>> # ICON ocean data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> icon_smooth_clim = marEx.smoothed_rolling_climatology( ... icon_sst, ... dimensions={"time": "time", "x": "ncells"}, ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... window_year_baseline=10, ... smooth_days_baseline=31 ... )
Effect of smoothing on seasonal cycle:
>>> # Raw rolling climatology (no temporal smoothing) >>> raw_clim = marEx.rolling_climatology(sst, window_year_baseline=15) >>> >>> # Smoothed rolling climatology >>> smooth_clim = marEx.smoothed_rolling_climatology( ... sst, window_year_baseline=15, smooth_days_baseline=21 ... ) >>> >>> # Compare seasonal cycle smoothness >>> # Extract annual cycle for a point >>> point_raw = raw_clim.isel(lat=90, lon=180).sel(time='2010') >>> point_smooth = smooth_clim.isel(lat=90, lon=180).sel(time='2010') >>> >>> print(f"Raw climatology range: {(point_raw.max() - point_raw.min()).compute():.3f}") >>> print(f"Smooth climatology range: {(point_smooth.max() - point_smooth.min()).compute():.3f}")
Performance considerations:
>>> # Efficient implementation smooths raw data first, then computes climatology >>> # This is more memory-efficient than smoothing the climatology >>> large_sst = sst.chunk({'time': 25, 'lat': 45, 'lon': 90}) >>> efficient_clim = marEx.smoothed_rolling_climatology(large_sst)
- marEx.rolling_climatology(da, window_year_baseline=15, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]
Compute rolling climatology efficiently using flox cohorts. Uses the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.
- Parameters:
da (xarray.DataArray) – Input data with time coordinate
window_year_baseline (int, default=15) – Number of years to include in each climatology window
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
use_temp_checkpoints (bool)
- Returns:
Rolling climatology with same shape as input data
- Return type:
Examples
Basic rolling climatology computation:
>>> import xarray as xr >>> import marEx >>> >>> # Load 20 years of SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute 15-year rolling climatology >>> climatology = marEx.rolling_climatology(sst, window_year_baseline=15) >>> print(climatology.shape) (7305, 180, 360) # Same as input >>> >>> # First 15 years will be NaN (insufficient history) >>> print(f"NaN values in first year: {climatology.isel(time=slice(0, 365)).isnull().all().compute()}") True
Shorter window for datasets with limited time span:
>>> # For datasets with only 10 years, use shorter window >>> short_climatology = marEx.rolling_climatology( ... sst, window_year_baseline=5 ... ) >>> # First 5 years will be NaN instead of 15
Processing unstructured data:
>>> # ICON ocean model data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> icon_climatology = marEx.rolling_climatology( ... icon_sst, ... dimensions={"time": "time", "x": "ncells"} ... coordinates={"time": "time", "x": "lon", "y": "lat"} ... ) >>> print(icon_climatology.dims) Frozen({'time': 7305, 'ncells': 83886})
Comparing with fixed climatology:
>>> # Fixed climatology (traditional approach) >>> fixed_clim = sst.groupby(sst.time.dt.dayofyear).mean() >>> >>> # Rolling climatology (adaptive approach) >>> rolling_clim = marEx.rolling_climatology(sst) >>> >>> # Rolling climatology adapts to climate change >>> clim_2000 = rolling_clim.sel(time='2000').mean() >>> clim_2020 = rolling_clim.sel(time='2020').mean() >>> print(f"Climate change signal: {(clim_2020 - clim_2000).compute():.3f} °C")
Memory considerations for large datasets:
>>> # Ensure appropriate chunking for memory efficiency >>> large_sst = sst.chunk({'time': 30, 'lat': 45, 'lon': 90}) >>> large_climatology = marEx.rolling_climatology(large_sst) >>> # Output maintains input chunking structure
- marEx.identify_extremes(da, method_extreme='hobday_extreme', threshold_percentile=95, dimensions=None, coordinates=None, window_days_hobday=11, window_spatial_hobday=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Identify extreme events exceeding a percentile threshold using specified method.
- Parameters:
da (xarray.DataArray) – DataArray containing anomalies
method_extreme (str, default='hobday_extreme') – Method for threshold calculation (‘global_extreme’ or ‘hobday_extreme’)
threshold_percentile (float, default=95) – Percentile threshold (e.g., 95 for 95th percentile)
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
window_days_hobday (int, default=11) – Window for day-of-year threshold (hobday_extreme only)
window_spatial_hobday (int, default=None) – Window for day-of-year threshold spatial clustering (hobday_extreme only)
method_percentile (str, default='approximate') – Method for percentile computation (‘exact’ or ‘approximate’)
precision (float, default=0.01) – Precision for histogram bins in approximate method
max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning
use_temp_checkpoints (bool)
verbose (bool | None)
quiet (bool | None)
- Returns:
Tuple of (extremes, thresholds) where extremes is a boolean array identifying extreme events and thresholds contains the threshold values used
- Return type:
Examples
Basic extreme identification with global thresholds:
>>> import xarray as xr >>> import marEx >>> >>> # Load anomaly data (from compute_normalised_anomaly) >>> anomalies = xr.open_dataset('anomalies.nc', chunks={}).dat_anomaly >>> >>> # Identify extreme events using global-in-time 95th percentile >>> extremes, thresholds = marEx.identify_extremes( ... anomalies, ... method_extreme="global_extreme", ... threshold_percentile=95 ... ) >>> print(f"Extreme events shape: {extremes.shape}") Extreme events shape: (1461, 180, 360) >>> print(f"Thresholds shape: {thresholds.shape}") Thresholds shape: (180, 360)
>>> # Count total extreme events >>> total_extremes = extremes.sum().compute() >>> print(f"Total extreme events: {total_extremes}")
Using day-of-year specific thresholds (cf. Hobday et al. 2016 method):
>>> # More sophisticated threshold calculation >>> extremes_hobday, thresholds_hobday = marEx.identify_extremes( ... anomalies, ... method_extreme="hobday_extreme", ... threshold_percentile=95, ... window_days_hobday=11 # 11-day window around each day-of-year ... window_spatial_hobday=3 # 3x3 spatial window for clustering percentile calcuation ... ) >>> print(f"Hobday thresholds shape: {thresholds_hobday.shape}") Hobday thresholds shape: (366, 180, 360)
>>> # Compare seasonal variation in thresholds >>> summer_threshold = thresholds_hobday.sel(dayofyear=200).mean() >>> winter_threshold = thresholds_hobday.sel(dayofyear=50).mean() >>> print(f"Summer vs Winter thresholds: {summer_threshold:.3f} vs {winter_threshold:.3f}")
Comparison of exact vs approximate percentile methods:
>>> # Approximate method (faster, default) >>> extremes_approx, thresh_approx = marEx.identify_extremes( ... anomalies, method_percentile="approximate" ... ) >>> >>> # Exact method (slower & memory intensive) >>> extremes_exact, thresh_exact = marEx.identify_extremes( ... anomalies, method_percentile="exact" ... ) >>> >>> # Compare threshold precision — ~0.005C >>> threshold_diff = (thresh_exact - thresh_approx).std().compute() >>> print(f"Threshold difference (exact vs approx): {threshold_diff:.6f}")
Different percentile thresholds for varying event rarity:
>>> # Conservative threshold - very extreme events only >>> extremes_98, _ = marEx.identify_extremes( ... anomalies, threshold_percentile=98 ... ) >>> >>> # Moderate threshold - more frequent events >>> extremes_90, _ = marEx.identify_extremes( ... anomalies, threshold_percentile=90 ... ) >>> >>> # Compare event frequency >>> print(f"99th percentile events: {extremes_99.sum().compute()}") >>> print(f"90th percentile events: {extremes_90.sum().compute()}")
Processing unstructured data:
>>> # ICON ocean model data >>> icon_anomalies = xr.open_dataset('icon_anomalies.nc', chunks={}).dat_anomaly >>> extremes_unstructured, thresholds_unstructured = marEx.identify_extremes( ... icon_anomalies, ... dimensions={"time": "time", "x": "ncells"}, ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... threshold_percentile=95 ... ) >>> print(f"Unstructured extremes shape: {extremes_unstructured.shape}")
Advanced Hobday method with custom temporal window:
>>> # Longer temporal window for smoother thresholds >>> extremes_smooth, thresholds_smooth = marEx.identify_extremes( ... anomalies, ... method_extreme="hobday_extreme", ... window_days_hobday=31, # Longer smoothing window ... threshold_percentile=95 ... ) >>> >>> # Compare threshold smoothness >>> std_11day = thresholds_hobday.std(dim='dayofyear').mean().compute() >>> std_31day = thresholds_smooth.std(dim='dayofyear').mean().compute() >>> print(f"Threshold variability: 11-day={std_11day:.3f}, 31-day={std_31day:.3f}")
- class marEx.tracker(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]
Bases:
objectTracker identifies and tracks arbitrary binary objects in spatial data through time.
The tracker supports both structured (regular grid) and unstructured data, and seamlessly handles splitting & merging of objects. It identifies connected regions in binary data at each time step, and tracks these as evolving events through time.
Main workflow:
Preprocessing: Fill spatiotemporal holes, filter small objects
Object identification: Label connected components at each time
Tracking: Determine object correspondences across time
Optional splitting & merging: Handle complex event evolution
- Parameters:
data_bin (xarray.DataArray) – Binary field of extreme points to group, label, and track (True = object, False = background) Must represent and underlying dask array.
mask (xarray.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)
R_fill (int) – The radius of the kernel used in morphological opening & closing, relating to the largest hole/gap that can be filled. In units of grid cells.
area_filter_quartile (float, optional) – The fraction of the smallest objects to discard, i.e. the quantile defining the smallest area object retained. Quantile must be in (0-1) (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.
area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile. Use this for fixed minimum area thresholds (e.g., 10 cells minimum).
temp_dir (str, optional) – Path to temporary directory for storing intermediate results
T_fill (int, default=2) – The permissible temporal gap (in days) between objects for tracking continuity to be maintained (must be even)
allow_merging (bool, default=True) – Allow objects to split and merge across time. Apply splitting & merging criteria, track merge events, and maintain original identities of merged objects across time. N.B.: False reverts to classical ndmeasure.label with simplar time connectivity, i.e. Scannell et al.
nn_partitioning (bool, default=False) – Implement a better partitioning of merged child objects based on closest parent cell. False reverts to using parent centroids to determine partitioning between new child objects, i.e. Di Sun & Bohai Zhang 2023. N.B.: Centroid-based partitioning has major problems with small merging objects suddenly obtaining unrealistically-large (and often disjoint) fractions of the larger object.
overlap_threshold (float, default=0.5) – The fraction of the smaller object’s area that must overlap with the larger object’s area to be considered the same event and continue tracking with the same ID.
unstructured_grid (bool, default=False) – Whether data is on an unstructured grid
dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Coordinate names for unstructured grids. Should contain ‘x’ and ‘y’ keys for x and y coordinates. May also contain ‘time’ if the time coordinate name is different from the dimension name.
neighbours (xarray.DataArray, optional) – For unstructured grid, indicates connectivity between cells
cell_areas (xarray.DataArray, optional) – For unstructured grid, area of each cell (required). For structured grid, area of each cell (optional). If not provided, defaults to 1.0 for each cell (resulting in cell counts as areas). Note: Overridden by grid_resolution if provided for structured grids.
grid_resolution (float, optional) – Grid resolution in degrees for structured grids only (ignored for unstructured grids). When provided, automatically calculates cell areas using spherical geometry. Overrides any provided cell_areas parameter.
max_iteration (int, default=40) – Maximum number of iterations for merging/splitting algorithm
checkpoint (str, default='None') – Checkpoint strategy (‘save’, ‘load’, or None)
debug (int, default=0) – Debug level (0-2)
verbose (bool, optional) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.
quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.
regional_mode (bool, default=False) – Enable regional mode for non-global coordinate ranges. When True, coordinate_units must be specified.
coordinate_units (str, optional) – Coordinate units when regional_mode=True. Must be either ‘degrees’ or ‘radians’.
Examples
Basic tracking of marine heatwave events from preprocessed data:
>>> import xarray as xr >>> import marEx >>> >>> # Load preprocessed extreme events data >>> processed = xr.open_dataset('extreme_events.nc', chunks={}) >>> extreme_events = processed.extreme_events # Boolean array >>> mask = processed.mask # Ocean/land mask >>> >>> # Initialise tracker with basic parameters >>> tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, # Fill holes up to 8 grid cells ... area_filter_quartile=0.5 # Remove smallest 50% of objects ... allow_merging=False # Basic tracking without splitting/merging ... ) >>> >>> # Run tracking algorithm >>> events = tracker.run() >>> print(f"Identified {events.ID.max().compute()} distinct events") Identified 1247 distinct events
Using automatic grid area calculation from resolution:
>>> # For regular lat/lon grids, automatically calculate physical areas >>> grid_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... grid_resolution=0.25 # Grid resolution in degrees ... ) >>> # Cell areas are calculated automatically using spherical geometry >>> grid_events = grid_tracker.run()
Advanced tracking with merging and splitting enabled:
>>> # More sophisticated tracking with temporal gap filling >>> advanced_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=12, # Larger spatial gap filling ... T_fill=4, # Fill up to 4-day temporal gaps ... area_filter_quartile=0.25, # More aggressive size filtering ... allow_merging=True, # Enable split/merge detection ... overlap_threshold=0.3 # Lower threshold for object linking ... ) >>> >>> events_advanced, merges_log = advanced_tracker.run(return_merges=True) >>> print(events_advanced.data_vars) Data variables: event (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> event_centroid (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> ID_field (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> global_ID (time, ID) int32 dask.array<chunksize=(25, 1247)> area (time, ID) float32 dask.array<chunksize=(25, 1247)> centroid (component, time, ID) float64 dask.array<chunksize=(2, 25, 1247)> presence (time, ID) bool dask.array<chunksize=(25, 1247)> time_start (ID) datetime64[ns] dask.array<chunksize=(1247,)> time_end (ID) datetime64[ns] dask.array<chunksize=(1247,)> merge_ledger (time, ID, sibling_ID) int32 dask.array<chunksize=(25, 1247, 10)>
Processing unstructured ocean model data (ICON):
>>> # Load ICON ocean model data with connectivity >>> icon_data = xr.open_dataset('icon_extremes.nc', chunks={}) >>> icon_extremes = icon_data.extreme_events # (time, ncells) >>> icon_mask = icon_data.mask >>> neighbours = icon_data.neighbours # Cell connectivity >>> cell_areas = icon_data.cell_areas # Physical areas >>> >>> # Track events on unstructured grid >>> unstructured_tracker = marEx.tracker( ... icon_extremes, ... icon_mask, ... R_fill=5, # 5-neighbor radius for gap filling ... area_filter_quartile=0.6, # Remove 60% of smallest events ... unstructured_grid=True, # Enable unstructured mode ... dimensions={"x": "ncells"}, # Must specify the name of the spatial dimension ... coordinates={"x": "lon", "y": "lat"}, # Spatial coordinate names ... neighbours=neighbours, # Required for unstructured ... cell_areas=cell_areas # Required for area calculations ... ) >>> unstructured_events = unstructured_tracker.run()
Memory management and checkpointing for large datasets:
>>> # Use checkpointing for very large datasets >>> large_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... temp_dir='/scratch/user/tracking_temp', # Temporary storage ... checkpoint='save' # Save intermediate results ... ) >>> # Processing can be resumed if interrupted >>> large_events = large_tracker.run()
Comparing different filtering strategies:
>>> # Conservative filtering - keep more events >>> conservative = marEx.tracker( ... extreme_events, mask, R_fill=5, area_filter_quartile=0.1 ... ) >>> conservative_events = conservative.run() >>> >>> # Aggressive filtering - focus on largest events >>> aggressive = marEx.tracker( ... extreme_events, mask, R_fill=15, area_filter_quartile=0.8 ... ) >>> aggressive_events = aggressive.run() >>> >>> print(f"Conservative: {conservative_events.ID.max().compute()} events") >>> print(f"Aggressive: {aggressive_events.ID.max().compute()} events")
Using absolute area filtering instead of percentile-based:
>>> # Filter objects smaller than 25 grid cells >>> absolute_tracker = marEx.tracker( ... extreme_events, mask, R_fill=8, area_filter_absolute=25 ... ) >>> absolute_events = absolute_tracker.run() >>> >>> # Default behavior (area_filter_quartile=0.5) when no parameters provided >>> default_tracker = marEx.tracker(extreme_events, mask, R_fill=8) >>> default_events = default_tracker.run() # Uses quartile=0.5 filtering
Using physical cell areas for structured grids:
>>> # Load data with irregular grid cell areas >>> grid_areas = xr.open_dataset('grid_areas.nc').cell_area # (lat, lon) in m² >>> >>> # Track events using physical areas instead of cell counts >>> physical_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... cell_areas=grid_areas # Physical areas in m² ... ) >>> events = physical_tracker.run() >>> # Now events.area contains physical areas in m² instead of cell counts
Integration with full marEx workflow:
>>> # Complete workflow from raw data to tracked events >>> raw_sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Step 1: Preprocess to identify extremes >>> processed = marEx.preprocess_data(raw_sst, threshold_percentile=95) >>> >>> # Step 2: Track extreme events >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> tracked_events = tracker.run()
Initialise the tracker with parameters and data.
- __init__(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]
Initialise the tracker with parameters and data.
- Parameters:
data_bin (DataArray)
mask (DataArray)
area_filter_quartile (float | None)
area_filter_absolute (int | None)
temp_dir (str | None)
T_fill (int)
allow_merging (bool)
nn_partitioning (bool)
overlap_threshold (float)
unstructured_grid (bool)
neighbours (DataArray | None)
cell_areas (DataArray | None)
grid_resolution (float | None)
max_iteration (int)
checkpoint (Literal['save', 'load', 'None'] | None)
debug (int)
verbose (bool | None)
quiet (bool | None)
regional_mode (bool)
coordinate_units (Literal['degrees', 'radians'] | None)
- Return type:
None
- run(return_merges=False, checkpoint=None)[source]
Run the complete object identification and tracking pipeline.
This method executes the full workflow:
Preprocessing: morphological operations and size filtering
Identification and tracking of objects through time
Computing and attaching statistics to the results
- Parameters:
- Returns:
events_ds (xarray.Dataset) – Dataset containing tracked events and their properties
merges_ds (xarray.Dataset, optional) – Dataset with merge event information (only if return_merges=True)
- Return type:
- run_preprocess(checkpoint=None)[source]
Preprocess binary data to prepare for tracking.
This performs morphological operations to fill holes/gaps in both space and time, then filters small objects according to the area_filter_quartile or area_filter_absolute.
- run_tracking(data_bin_preprocessed)[source]
Track objects through time to identify events.
- Parameters:
data_bin_preprocessed (xarray.DataArray) – Preprocessed binary data
- Returns:
events_ds (xarray.Dataset) – Dataset containing tracked events
merges_ds (xarray.Dataset) – Dataset with merge information
N_events_final (int) – Final number of unique events
- Return type:
- run_stats_attributes(events_ds, merges_ds, object_stats, N_events_final)[source]
Add statistics and attributes to the events dataset.
- Parameters:
events_ds (xarray.Dataset) – Dataset containing tracked events
merges_ds (xarray.Dataset) – Dataset with merge information
object_stats (tuple) – Preprocessed object statistics
N_events_final (int) – Final number of events
- Returns:
events_ds – Dataset with added statistics and attributes
- Return type:
- compute_area(data_bin)[source]
Compute the total area of binary data at each time.
- Parameters:
data_bin (xarray.DataArray) – Binary data
- Returns:
area – Total area at each time (units: pixels for structured grid, matching cell_area for unstructured)
- Return type:
- fill_holes(data_bin, R_fill=None)[source]
Fill holes and gaps using morphological operations.
This performs closing (dilation followed by erosion) to fill small gaps, then opening (erosion followed by dilation) to remove small isolated objects.
- Parameters:
data_bin (xarray.DataArray) – Binary data to process
R_fill (int, optional) – Fill radius override
- Returns:
data_bin_filled – Binary data with holes/gaps filled
- Return type:
- fill_time_gaps(data_bin)[source]
Fill temporal gaps between objects.
Performs binary closing (dilation then erosion) along the time dimension to fill small time gaps between objects.
- Parameters:
data_bin (xarray.DataArray) – Binary data to process
- Returns:
data_bin_filled – Binary data with temporal gaps filled
- Return type:
- refresh_dask_graph(data_bin)[source]
Clear and reset the Dask graph via save/load cycle.
This is needed to work around a memory leak bug in Dask where “Unmanaged Memory” builds up within loops.
- Parameters:
data_bin (xarray.DataArray) – Data to refresh
- Returns:
data_new – Data with fresh Dask graph
- Return type:
- filter_small_objects(data_bin)[source]
Remove objects smaller than a threshold area.
- Parameters:
data_bin (xarray.DataArray) – Binary data to filter
- Returns:
data_bin_filtered (xarray.DataArray) – Binary data with small objects removed
area_threshold (float) – Area threshold used for filtering
object_areas (xarray.DataArray) – Areas of all objects pre-filtering
N_objects_prefiltered (int) – Number of objects before filtering
N_objects_filtered (int) – Number of objects after filtering
- Return type:
- identify_objects(data_bin, time_connectivity)[source]
Identify connected regions in binary data.
- Parameters:
data_bin (xarray.DataArray) – Binary data to identify objects in
time_connectivity (bool) – Whether to connect objects across time
- Returns:
object_id_field (xarray.DataArray) – Field of integer IDs for each object
None (NoneType) – Placeholder for compatibility with track_objects
N_objects (int) – Number of objects identified
- Return type:
- calculate_centroid(binary_mask, original_centroid=None)[source]
Calculate object centroid, handling edge cases for periodic boundaries.
- Parameters:
binary_mask (numpy.ndarray) – 2D binary array where True indicates the object (dimensions are (y,x))
original_centroid (tuple, optional) – (y_centroid, x_centroid) from regionprops_table
- Returns:
(y_centroid, x_centroid)
- Return type:
- calculate_object_properties(object_id_field, properties=None)[source]
Calculate properties of objects from ID field.
- Parameters:
object_id_field (xarray.DataArray) – Field containing object IDs
properties (list, optional) – List of properties to calculate (defaults to [‘label’, ‘area’])
- Returns:
object_props – Dataset containing calculated properties with ‘ID’ dimension
- Return type:
- check_overlap_slice(ids_t0, ids_next)[source]
Find overlapping objects between two consecutive time slices.
- Parameters:
ids_t0 (numpy.ndarray) – Object IDs at current time
ids_next (numpy.ndarray) – Object IDs at next time
- Returns:
Array of shape (n_overlaps, 3) with [id_t0, id_next, overlap_area]
- Return type:
- find_overlapping_objects(object_id_field)[source]
Find all overlapping objects across time.
- Parameters:
object_id_field (xarray.DataArray) – Field containing object IDs
- Returns:
overlap_objects_list_unique_filtered – Array of object ID pairs that overlap across time, with overlap area The object in the first column precedes the second column in time. The third column contains:
For structured grid: number of overlapping pixels (int32)
For unstructured grid: total overlapping area in m^2 (float32)
- Return type:
(N x 3) numpy.ndarray
- enforce_overlap_threshold(overlap_objects_list, object_props)[source]
Filter object pairs based on overlap threshold.
- Parameters:
overlap_objects_list ((N x 3) numpy.ndarray) – Array of object ID pairs with overlap area
object_props (xarray.Dataset) – Object properties including area
- Returns:
overlap_objects_list_filtered – Filtered array of object ID pairs that meet the overlap threshold
- Return type:
(M x 3) numpy.ndarray
- consolidate_object_ids(data_t_minus_2, data_t_minus_1, object_props, timestep)[source]
Consolidate object IDs between t-2 and t-1 to ensure consistent tracking.
This identifies objects at t-1 that are actually continuations of objects from t-2 (but got different IDs due to partitioning) and renames them to maintain consistent IDs across timesteps.
- Parameters:
data_t_minus_2 (xr.DataArray) – Object field at timestep t-2
data_t_minus_1 (xr.DataArray) – Object field at timestep t-1 (will be modified)
object_props (xr.Dataset) – Object properties dataset (will be modified)
timestep (int) – Current timestep number for logging purposes
- Returns:
data_t_minus_1_consolidated (xr.DataArray) – Updated t-1 field with consolidated IDs
object_props_updated (xr.Dataset) – Updated object properties with merged/deleted objects
- Return type:
Notes
Uses self.overlap_threshold for determining consolidation eligibility
Updates object properties by recalculating for consolidated objects
Removes redundant child objects from object_props
- compute_id_time_dict(da, child_objects, max_objects, all_objects=True)[source]
Generate lookup table mapping object IDs to their time index.
- Parameters:
da (xarray.DataArray) – Field of object IDs
child_objects (list or array) – Object IDs to include in the dictionary
max_objects (int) – Maximum number of objects
all_objects (bool, default=True) – Whether to process all objects or just child_objects
- Returns:
time_index_map – Dictionary mapping object IDs to time indices
- Return type:
- track_objects(data_bin)[source]
Track objects through time to form events.
This is the main tracking method that handles splitting and merging of objects.
- Parameters:
data_bin (xarray.DataArray) – Preprocessed binary data: Field of globally unique integer IDs of each element in connected regions. ID = 0 indicates no object.
- Returns:
split_merged_events_ds (xarray.Dataset) – Dataset containing tracked events
merge_events (xarray.Dataset) – Dataset with merge information
N_events (int) – Final number of events
- Return type:
- cluster_rename_objects_and_props(object_id_field_unique, object_props, overlap_objects_list, merge_events)[source]
Cluster the object pairs and relabel to determine final event IDs.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs must not be repeated across time.
object_props (xarray.Dataset) – Properties of each object that also need to be relabeled.
overlap_objects_list ((N x 2) numpy.ndarray) – Array of object ID pairs that indicate which objects are in the same event. The object in the first column precedes the second column in time.
merge_events (xarray.Dataset) – Information about merge events
- Returns:
split_merged_events_ds – Dataset with relabeled events and their properties. ID = 0 indicates no object.
- Return type:
- split_and_merge_objects(object_id_field_unique, object_props)[source]
Implement object splitting and merging logic.
This identifies and processes cases where objects split or merge over time, creating new object IDs as needed.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs are required to be monotonically increasing with time.
object_props (xarray.Dataset) – Properties of each object
- Returns:
(object_id_field, object_props, overlap_objects_list, merge_events)
- Return type:
- split_and_merge_objects_parallel(object_id_field_unique, object_props)[source]
Optimised parallel implementation of object splitting and merging.
This version is specifically designed for unstructured grids with more efficient memory handling and better parallelism than the standard split_and_merge_objects method. It processes data in chunks, handles merging events, and efficiently updates object IDs.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs
object_props (xarray.Dataset) – Properties of each object
- Returns:
(object_id_field, object_props, overlap_objects_list, merge_events)
- Return type:
- marEx.regional_tracker(data_bin, mask, coordinate_units, R_fill, area_filter_quartile=None, area_filter_absolute=None, **kwargs)[source]
Create a tracker instance configured for regional (non-global) data.
This is a convenience function that automatically sets regional_mode=True and requires explicit specification of coordinate units, since auto-detection may fail for regional coordinate ranges.
- Parameters:
data_bin (xr.DataArray) – Binary data to identify and track objects in (True = object, False = background)
mask (xr.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)
coordinate_units ({'degrees', 'radians'}) – Units of the coordinate system. Must be specified for regional data.
R_fill (int or float) – Radius for filling holes/gaps in spatial domain (in grid cells)
area_filter_quartile (float, optional) – Quantile (0-1) for filtering smallest objects (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.
area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile.
**kwargs – Additional parameters passed to the tracker class
- Returns:
Configured tracker instance with regional_mode=True
- Return type:
Examples
Track events in regional Mediterranean Sea data:
>>> import marEx >>> # For regional data with degree coordinates >>> regional_tracker = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='degrees', ... R_fill=5, ... area_filter_quartile=0.3 ... ) >>> events = regional_tracker.run()
Track events in regional data with radian coordinates:
>>> # For model output with radian coordinates >>> regional_tracker = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='radians', ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> events = regional_tracker.run()
Using absolute area filtering in regional mode:
>>> # Keep only features larger than 15 grid cells >>> absolute_regional = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='degrees', ... R_fill=5, ... area_filter_absolute=15 ... ) >>> events = absolute_regional.run()
- marEx.specify_grid(grid_type=None, fpath_tgrid=None, fpath_ckdtree=None)[source]
Set the global grid specification that will be used by all plotters.
- Parameters:
- Return type:
None
- class marEx.PlotConfig(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)[source]
Bases:
objectConfiguration class for plot parameters
- Parameters:
- cmap
Colormap name or ListedColormap object
- Type:
str | matplotlib.colors.ListedColormap | None
- norm
Custom normalization (BoundaryNorm or Normalize)
- Type:
matplotlib.colors.BoundaryNorm | matplotlib.colors.Normalize | None
- projection
Cartopy projection for map plots
- Type:
Any | None
- cmap: str | ListedColormap | None = None
- norm: BoundaryNorm | Normalize | None = None
- __init__(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)
- Parameters:
- Return type:
None
- exception marEx.MarExError(message, details=None, suggestions=None, error_code=None, context=None)[source]
Bases:
ExceptionBase exception class for all MarEx-specific errors.
This is the root of the MarEx exception hierarchy and provides common functionality for all marEx exceptions including:
Structured error context
Exception chaining support
Consistent error formatting
- Parameters:
message (str) – Primary error message describing what went wrong
details (str, optional) – Additional technical details about the error
suggestions (list of str, optional) – Actionable suggestions for resolving the error
error_code (str, optional) – Structured error code for programmatic handling
context (dict, optional) – Additional context information (e.g., parameter values, data shapes)
Initialise the Error.
- __init__(message, details=None, suggestions=None, error_code=None, context=None)[source]
Initialise the Error.
- exception marEx.DataValidationError(message, details=None, suggestions=None, error_code='DATA_VALIDATION', context=None)[source]
Bases:
MarExErrorRaise exception for input data validation issues.
This exception covers problems with input data structure, format, content, or compatibility with marEx processing requirements.
Common scenarios:
Non-Dask arrays when Dask is required
Missing required coordinates or dimensions
Invalid data types or ranges
Incompatible chunking strategies
Malformed input datasets
Examples
>>> raise DataValidationError( ... "Input DataArray must be Dask-backed", ... details="Found numpy array, but marEx requires chunked Dask arrays", ... suggestions=["Use da.chunk() to convert to Dask array", "Load data with dask chunking: xr.open_dataset(...).chunk()"], ... context={"data_type": type(data), "shape": data.shape} ... )
Initialise the Error.
- Parameters:
- exception marEx.CoordinateError(message, details=None, suggestions=None, error_code='COORDINATE_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for coordinate system problems.
This exception handles issues with geographic coordinates including unit mismatches, invalid ranges, missing coordinate information, and coordinate system inconsistencies.
Common scenarios:
Latitude/longitude values outside valid ranges
Unit mismatches (degrees vs radians)
Missing coordinate dimensions
Inconsistent coordinate systems between datasets
Auto-detection failures
Initialise the Error.
- Parameters:
- exception marEx.ProcessingError(message, details=None, suggestions=None, error_code='PROCESSING_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for computational and algorithmic issues.
This exception covers problems that occur during data processing, including numerical computation errors, algorithm convergence issues, and memory/performance problems.
Common scenarios:
Insufficient memory for computation
Numerical instability or overflow
Algorithm convergence failures
Chunking strategy problems
Dask computation errors
Initialise the Error.
- Parameters:
- exception marEx.ConfigurationError(message, details=None, suggestions=None, error_code='CONFIGURATION_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for parameter and setup issues.
This exception handles problems with function parameters, configuration settings, and setup requirements that prevent proper operation.
Common scenarios:
Invalid parameter values or combinations
Missing required configuration
Incompatible parameter settings
Environment setup issues
Examples
>>> raise ConfigurationError( ... "Invalid threshold percentile value", ... details="threshold_percentile must be between 0 and 100", ... suggestions=["Use percentile value between 50-99 for extreme events", "Common values: 90 (moderate), 95 (strong), 99 (severe)"], ... context={"provided_value": 150, "valid_range": [0, 100]} ... )
Initialise the Error.
- Parameters:
- exception marEx.DependencyError(message, details=None, suggestions=None, error_code='DEPENDENCY_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for missing or incompatible dependencies.
This exception handles issues with optional or required dependencies that are missing, incompatible, or incorrectly configured.
Common scenarios:
Missing optional dependencies (JAX, ffmpeg)
Version incompatibilities
Import failures
System dependency issues
Examples
>>> raise DependencyError( ... "JAX acceleration not available", ... details="JAX package not found or incompatible version", ... suggestions=["Install JAX: pip install marEx[full]", "Check CUDA compatibility for GPU acceleration", "Processing will continue with NumPy backend"], ... context={"requested_feature": "GPU acceleration", "available": False} ... )
Initialise the Error.
- Parameters:
- exception marEx.TrackingError(message, details=None, suggestions=None, error_code='TRACKING_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for object tracking and identification issues.
This exception covers problems specific to the tracking module including binary object identification, temporal linking, and merge/split handling.
Common scenarios:
Invalid binary input data
Tracking parameter conflicts
Temporal continuity issues
Memory overflow during tracking
Checkpoint/resume failures
Examples
>>> raise TrackingError( ... "Tracking failed due to excessive memory usage", ... details="Event fragmentation created >100,000 objects per timestep", ... suggestions=["Increase area_filter_quartile to remove small events", "Apply stronger spatial smoothing before tracking", "Consider processing shorter time periods"], ... context={"objects_per_timestep": 150000, "memory_limit_gb": 32} ... )
Initialise the Error.
- Parameters:
- exception marEx.VisualisationError(message, details=None, suggestions=None, error_code='VISUALISATION_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for plotting and visualisation problems.
This exception handles issues with the plotX visualisation system including matplotlib configuration, cartopy projections, and animation generation.
Common scenarios:
Missing plotting dependencies
Cartopy projection issues
Invalid plot configuration
Animation encoding failures
Grid type detection problems
Examples
>>> raise VisualisationError( ... "Animation creation failed", ... details="ffmpeg encoder not found for MP4 generation", ... suggestions=["Install ffmpeg system package", "Use alternative format: save_format='gif'", "Install ffmpeg via conda: conda install ffmpeg"], ... context={"requested_format": "mp4", "available_encoders": ["png", "gif"]} ... )
Initialise the Error.
- Parameters:
- marEx.create_data_validation_error(message, data_info=None, **kwargs)[source]
Create DataValidationError with common data context.
- Parameters:
- Returns:
Configured exception with data context
- Return type:
- marEx.create_coordinate_error(message, coordinate_ranges=None, detected_system=None, **kwargs)[source]
Create CoordinateError with coordinate context.
- Parameters:
- Returns:
Configured exception with coordinate context
- Return type:
- marEx.create_processing_error(message, computation_info=None, **kwargs)[source]
Create ProcessingError with computation context.
- Parameters:
- Returns:
Configured exception with computation context
- Return type:
- marEx.wrap_exception(original_exception, message=None, marex_exception_type=None)[source]
Wrap a generic exception in an appropriate MarEx exception.
This function helps maintain backward compatibility while migrating to the new exception hierarchy by wrapping generic exceptions in appropriate MarEx-specific exceptions.
- Parameters:
- Returns:
Wrapped exception with original as cause
- Return type:
- marEx.print_dependency_status()[source]
Print status of all tracked dependencies.
- Return type:
None
- marEx.configure_logging(level=None, format_str=None, date_format=None, log_file=None, console_output=True, disable_external_loggers=True, verbose=None, quiet=None)[source]
Configure logging for the MarEx package.
- Parameters:
level (int | str | None) – Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
format_str (str | None) – Custom log format string
date_format (str | None) – Date format for timestamps
log_file (str | Path | None) – Optional file path for logging output
console_output (bool) – Whether to log to console
disable_external_loggers (bool) – Whether to suppress noisy external library logs
verbose (bool | None) – Enable verbose mode (detailed logging with DEBUG level)
quiet (bool | None) – Enable quiet mode (minimal logging with WARNING level)
- Return type:
None
- marEx.set_verbose_mode(verbose=True)[source]
Enable or disable verbose logging mode.
- Parameters:
verbose (bool) – If True, enables verbose mode with DEBUG level logging
- Return type:
None
- marEx.set_quiet_mode(quiet=True)[source]
Enable or disable quiet logging mode.
- Parameters:
quiet (bool) – If True, enables quiet mode with WARNING level logging
- Return type:
None
- marEx.get_verbosity_level()[source]
Get the current verbosity level.
- Returns:
‘quiet’, ‘normal’, or ‘verbose’
- Return type:
Current verbosity level
Core Functions
The main entry points for using marEx are:
|
Complete preprocessing pipeline for marine extreme event identification. |
|
Tracker identifies and tracks arbitrary binary objects in spatial data through time. |
|
Create a tracker instance configured for regional (non-global) data. |
|
Set the global grid specification that will be used by all plotters. |
Data Preprocessing
Functions for data preprocessing, anomaly detection, and extreme event identification:
|
Generate normalised anomalies using specified methodology. |
|
Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. |
|
Compute rolling climatology efficiently using flox cohorts. |
|
Identify extreme events exceeding a percentile threshold using specified method. |
Visualisation
Plotting configuration and utilities:
|
Configuration class for plot parameters |
|
Set the global grid specification that will be used by all plotters. |
Exception Hierarchy
MarEx provides a structured exception hierarchy for precise error handling:
|
Base exception class for all MarEx-specific errors. |
|
Raise exception for input data validation issues. |
|
Raise exception for coordinate system problems. |
|
Raise exception for computational and algorithmic issues. |
|
Raise exception for parameter and setup issues. |
|
Raise exception for missing or incompatible dependencies. |
|
Raise exception for object tracking and identification issues. |
|
Raise exception for plotting and visualisation problems. |
|
Create DataValidationError with common data context. |
|
Create CoordinateError with coordinate context. |
|
Create ProcessingError with computation context. |
|
Wrap a generic exception in an appropriate MarEx exception. |
Logging System
Configurable logging system for development and debugging:
|
Configure logging for the MarEx package. |
|
Enable or disable verbose logging mode. |
|
Enable or disable quiet logging mode. |
Reset logging to normal mode (INFO level). |
|
Get the current verbosity level. |
|
Check if verbose mode is currently enabled. |
|
Check if quiet mode is currently enabled. |
|
|
Get a configured logger instance for MarEx. |
Dependency Management
Utilities for managing optional dependencies:
|
Check if a dependency is available. |
Print status of all tracked dependencies. |
|
Get the current installation profile. |
Modules
Detection Module (marEx.detect)
MarEx-Detect: Marine Extremes Detection Module
Preprocessing toolkit for marine extremes identification from scalar oceanographic data. Converts raw time series into standardised anomalies and identifies extreme events (e.g., Marine Heatwaves using Sea Surface Temperature).
Core capabilities:
Two preprocessing methodologies: Detrended Baseline and Shifting Baseline
Two definitions for extreme events: Global Extreme and Hobday Extreme
Threshold-based extreme event identification
Efficient processing of both structured (gridded) and unstructured data
Compatible data formats:
Structured data: 3D arrays (time, lat, lon)
Unstructured data: 2D arrays (time, cell)
- marEx.detect.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Complete preprocessing pipeline for marine extreme event identification.
Supports separate methods for anomaly computation and extreme identification:
Anomaly Methods:
‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics
‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years
‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !
‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)
Extreme Methods:
‘global_extreme’: Global-in-time threshold value
‘hobday_extreme’: Local day-of-year specific thresholds with windowing
- Parameters:
da (xarray.DataArray) – Raw input data
method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).
method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).
threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.
window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).
smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).
window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).
window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).
std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).
detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.
force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).
reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.
method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.
precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.
max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.
dask_chunks (dict, optional) – Chunking specification for distributed computation.
dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.
coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.
neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.
cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.
use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.
verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.
quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.
- Returns:
Processed dataset with anomalies and extreme event identification
- Return type:
Examples
Basic usage with gridded SST data for marine heatwave detection:
>>> import xarray as xr >>> import marEx >>> >>> # Load and chunk SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Basic preprocessing with default shifting baseline method >>> result = marEx.preprocess_data(sst, threshold_percentile=90) >>> print(result) <xarray.Dataset> Dimensions: (time: 1461, lat: 180, lon: 360) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)> extreme_events (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)> thresholds (lat, lon) float32 dask.array<chunksize=(180, 360)>
>>> # Check which locations have extreme events >>> print(f"Total extreme events: {result.extreme_events.sum().compute()}") Total extreme events: 15847
Using shifting baseline method for more accurate climatology:
>>> # Requires at least 15 years of data by default >>> result_shifting = marEx.preprocess_data( ... sst, ... method_anomaly="shifting_baseline", ... window_year_baseline=10, # Use shorter window if needed ... smooth_days_baseline=31 # Longer smoothing window ... ) >>> # Note: First 10 years will be removed from output
Using Hobday extreme method with day-of-year specific thresholds:
>>> result_hobday = marEx.preprocess_data( ... sst, ... method_extreme="hobday_extreme", ... window_days_hobday=11, # 11-day window for each day-of-year ... threshold_percentile=95 ... ) >>> print(result_hobday.thresholds.dims) ('dayofyear', 'lat', 'lon')
Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:
>>> result_advanced = marEx.preprocess_data( ... sst, ... method_anomaly="detrend_harmonic", ... detrend_orders=[1, 2], # Linear and quadratic trends ... std_normalise=True, # Standardise by rolling std ... force_zero_mean=True, ... threshold_percentile=95 ... ) >>> # Result includes both raw and standardised anomalies >>> print('dat_stn' in result_advanced) True
Processing unstructured data:
>>> # For ICON ocean model data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50}) >>> result_unstructured = marEx.preprocess_data( ... icon_sst, ... dimensions={"x": "ncells"}, # Must specify the name of the spatial dimension ... dask_chunks={"time": 50} ... )
Error handling - insufficient data for shifting baseline:
>>> short_data = sst.isel(time=slice(0, 1000)) # Only ~3 years >>> try: ... result = marEx.preprocess_data( ... short_data, ... method_anomaly="shifting_baseline", ... window_year_baseline=15 ... ) ... except ValueError as e: ... print(f"Error: {e}") Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline requires at least 15 years.
Performance considerations with chunking:
>>> # For large datasets, adjust chunking for memory management >>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180}) >>> result = marEx.preprocess_data( ... large_sst, ... dask_chunks={"time": 25}, ... method_percentile="approximate" # Use approximate method (Default) for long time-series calculations ... )
Integration with tracking workflow:
>>> # Preprocess data then track events >>> processed = marEx.preprocess_data(sst, threshold_percentile=95) >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> events = tracker.run() >>> print(f"Identified {events.event.max().compute()} distinct events")
Simple fixed baseline approach:
>>> # Basic daily climatology across all years >>> result_fixed = marEx.preprocess_data( ... sst, ... method_anomaly="fixed_baseline", ... threshold_percentile=95 ... ) >>> # Uses all available data for climatology computation
Combined trend removal and fixed climatology:
>>> # Remove long-term trends then compute daily climatology >>> result_combined = marEx.preprocess_data( ... sst, ... method_anomaly="detrend_fixed_baseline", ... detrend_orders=[1], # Linear trend ... threshold_percentile=95, ... force_zero_mean=True ... ) >>> # Balances trend removal with simple climatology
- marEx.detect.compute_normalised_anomaly(da, method_anomaly='shifting_baseline', dimensions=None, coordinates=None, window_year_baseline=15, smooth_days_baseline=21, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Generate normalised anomalies using specified methodology.
- Parameters:
da (xarray.DataArray) – Input data with dimensions matching the ‘dimensions’ parameter
method_anomaly (str, default='shifting_baseline') –
Anomaly computation method. Options: - ‘detrend_harmonic’: Detrending with harmonics and polynomials (efficient, biased) - ‘shifting_baseline’: Rolling climatology (accurate, shortens time series) - ‘fixed_baseline’: Daily climatology using full time series (keeps long-term trends in the anomaly) - ‘detrend_fixed_baseline’: Polynomial detrending + fixed climatology (does not shorten time series,
keeps trends in seasonal timing in the anomaly)
dimensions (dict, optional) – Mapping of conceptual dimensions to actual dimension names in the data
coordinates (dict, optional) – Mapping of conceptual coordinates to actual coordinate names in the data
window_year_baseline (int, default=15) – Number of years for rolling climatology (shifting_baseline only)
smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline only)
std_normalise (bool, default=False) – Whether to normalise by 30-day rolling standard deviation (detrend_harmonic only)
detrend_orders (list, default=[1]) – Polynomial orders for trend removal (detrend_harmonic and detrend_fixed_baseline only)
force_zero_mean (bool, default=True) – Explicitly enforce zero mean in final anomalies (detrend_harmonic and detrend_fixed_baseline only)
reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless.
use_temp_checkpoints (bool)
verbose (bool | None)
quiet (bool | None)
- Returns:
Dataset containing anomalies, mask, and metadata
- Return type:
Examples
Basic detrended baseline anomaly computation:
>>> import xarray as xr >>> import marEx >>> >>> # Load chunked SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute anomalies using shifting baseline (default) >>> result = marEx.compute_normalised_anomaly(sst) >>> print(result.data_vars) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)>
>>> # Check that anomalies have approximately zero mean >>> print(f"Mean anomaly: {result.dat_anomaly.mean().compute():.6f}") Mean anomaly: 0.000023
Previous configuration (marEx v2.0 default) of detrended baseline with higher-order polynomials and standardisation. Note: marEx v3.0+ uses shifting_baseline as the default method:
>>> result_advanced = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="detrend_harmonic", ... detrend_orders=[1, 2, 3], # Linear, quadratic, cubic trends ... std_normalise=True, # Add standardised anomalies ... force_zero_mean=True ... ) >>> print(result_advanced.data_vars) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)> dat_stn (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> STD (dayofyear, lat, lon) float32 dask.array<chunksize=(366, 180, 360)>
>>> # Standardised anomalies have unit variance >>> print(f"STD of standardised anomalies: {result_advanced.dat_stn.std().compute():.3f}")
Accurate shifting baseline method for climate-aware anomalies:
>>> result_shifting = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="shifting_baseline", ... window_year_baseline=10, # Use 10-year rolling climatology ... smooth_days_baseline=31 # 31-day smoothing window ... ) >>> # Anomalies computed relative to recent past climatology
Processing unstructured data:
>>> # ICON ocean model with ncells dimension >>> icon_data = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> result_unstructured = marEx.compute_normalised_anomaly( ... icon_data, ... dimensions={"time": "time", "x": "ncells"} ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... ) >>> print(result_unstructured.dims) Frozen({'time': 1461, 'ncells': 83886})
Comparison of methods - detrended vs shifting baseline:
>>> # Detrended baseline - faster, slight bias >>> detrended = marEx.compute_normalised_anomaly( ... sst, method_anomaly="detrend_harmonic" ... ) >>> >>> # Shifting baseline - slower, more accurate >>> shifting = marEx.compute_normalised_anomaly( ... sst, method_anomaly="shifting_baseline", ... window_year_baseline=15 ... ) >>> >>> # Compare anomaly magnitudes >>> print(f"Detrended RMS: {detrended.dat_anomaly.std().compute():.3f}") >>> print(f"Shifting RMS: {shifting.dat_anomaly.std().compute():.3f}")
Fixed baseline climatology:
>>> # Use full time series for daily climatology >>> result_fixed = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="fixed_baseline" ... ) >>> # Climatology computed from all available years
Fixed baseline with a restricted reference period:
>>> # Compute climatology from 1990-2020 only, but output anomalies for all years >>> result_ref = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="fixed_baseline", ... reference_period=(1990, 2020) ... )
Fixed detrended baseline:
>>> # Remove long-term trends then compute fixed climatology >>> result_fixed_detrended = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="detrend_fixed_baseline", ... detrend_orders=[1], # Remove linear trend ... force_zero_mean=True ... ) >>> # Combines trend removal with fixed climatology
- marEx.detect.identify_extremes(da, method_extreme='hobday_extreme', threshold_percentile=95, dimensions=None, coordinates=None, window_days_hobday=11, window_spatial_hobday=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Identify extreme events exceeding a percentile threshold using specified method.
- Parameters:
da (xarray.DataArray) – DataArray containing anomalies
method_extreme (str, default='hobday_extreme') – Method for threshold calculation (‘global_extreme’ or ‘hobday_extreme’)
threshold_percentile (float, default=95) – Percentile threshold (e.g., 95 for 95th percentile)
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
window_days_hobday (int, default=11) – Window for day-of-year threshold (hobday_extreme only)
window_spatial_hobday (int, default=None) – Window for day-of-year threshold spatial clustering (hobday_extreme only)
method_percentile (str, default='approximate') – Method for percentile computation (‘exact’ or ‘approximate’)
precision (float, default=0.01) – Precision for histogram bins in approximate method
max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning
use_temp_checkpoints (bool)
verbose (bool | None)
quiet (bool | None)
- Returns:
Tuple of (extremes, thresholds) where extremes is a boolean array identifying extreme events and thresholds contains the threshold values used
- Return type:
Examples
Basic extreme identification with global thresholds:
>>> import xarray as xr >>> import marEx >>> >>> # Load anomaly data (from compute_normalised_anomaly) >>> anomalies = xr.open_dataset('anomalies.nc', chunks={}).dat_anomaly >>> >>> # Identify extreme events using global-in-time 95th percentile >>> extremes, thresholds = marEx.identify_extremes( ... anomalies, ... method_extreme="global_extreme", ... threshold_percentile=95 ... ) >>> print(f"Extreme events shape: {extremes.shape}") Extreme events shape: (1461, 180, 360) >>> print(f"Thresholds shape: {thresholds.shape}") Thresholds shape: (180, 360)
>>> # Count total extreme events >>> total_extremes = extremes.sum().compute() >>> print(f"Total extreme events: {total_extremes}")
Using day-of-year specific thresholds (cf. Hobday et al. 2016 method):
>>> # More sophisticated threshold calculation >>> extremes_hobday, thresholds_hobday = marEx.identify_extremes( ... anomalies, ... method_extreme="hobday_extreme", ... threshold_percentile=95, ... window_days_hobday=11 # 11-day window around each day-of-year ... window_spatial_hobday=3 # 3x3 spatial window for clustering percentile calcuation ... ) >>> print(f"Hobday thresholds shape: {thresholds_hobday.shape}") Hobday thresholds shape: (366, 180, 360)
>>> # Compare seasonal variation in thresholds >>> summer_threshold = thresholds_hobday.sel(dayofyear=200).mean() >>> winter_threshold = thresholds_hobday.sel(dayofyear=50).mean() >>> print(f"Summer vs Winter thresholds: {summer_threshold:.3f} vs {winter_threshold:.3f}")
Comparison of exact vs approximate percentile methods:
>>> # Approximate method (faster, default) >>> extremes_approx, thresh_approx = marEx.identify_extremes( ... anomalies, method_percentile="approximate" ... ) >>> >>> # Exact method (slower & memory intensive) >>> extremes_exact, thresh_exact = marEx.identify_extremes( ... anomalies, method_percentile="exact" ... ) >>> >>> # Compare threshold precision — ~0.005C >>> threshold_diff = (thresh_exact - thresh_approx).std().compute() >>> print(f"Threshold difference (exact vs approx): {threshold_diff:.6f}")
Different percentile thresholds for varying event rarity:
>>> # Conservative threshold - very extreme events only >>> extremes_98, _ = marEx.identify_extremes( ... anomalies, threshold_percentile=98 ... ) >>> >>> # Moderate threshold - more frequent events >>> extremes_90, _ = marEx.identify_extremes( ... anomalies, threshold_percentile=90 ... ) >>> >>> # Compare event frequency >>> print(f"99th percentile events: {extremes_99.sum().compute()}") >>> print(f"90th percentile events: {extremes_90.sum().compute()}")
Processing unstructured data:
>>> # ICON ocean model data >>> icon_anomalies = xr.open_dataset('icon_anomalies.nc', chunks={}).dat_anomaly >>> extremes_unstructured, thresholds_unstructured = marEx.identify_extremes( ... icon_anomalies, ... dimensions={"time": "time", "x": "ncells"}, ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... threshold_percentile=95 ... ) >>> print(f"Unstructured extremes shape: {extremes_unstructured.shape}")
Advanced Hobday method with custom temporal window:
>>> # Longer temporal window for smoother thresholds >>> extremes_smooth, thresholds_smooth = marEx.identify_extremes( ... anomalies, ... method_extreme="hobday_extreme", ... window_days_hobday=31, # Longer smoothing window ... threshold_percentile=95 ... ) >>> >>> # Compare threshold smoothness >>> std_11day = thresholds_hobday.std(dim='dayofyear').mean().compute() >>> std_31day = thresholds_smooth.std(dim='dayofyear').mean().compute() >>> print(f"Threshold variability: 11-day={std_11day:.3f}, 31-day={std_31day:.3f}")
- marEx.detect.rolling_climatology(da, window_year_baseline=15, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]
Compute rolling climatology efficiently using flox cohorts. Uses the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.
- Parameters:
da (xarray.DataArray) – Input data with time coordinate
window_year_baseline (int, default=15) – Number of years to include in each climatology window
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
use_temp_checkpoints (bool)
- Returns:
Rolling climatology with same shape as input data
- Return type:
Examples
Basic rolling climatology computation:
>>> import xarray as xr >>> import marEx >>> >>> # Load 20 years of SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute 15-year rolling climatology >>> climatology = marEx.rolling_climatology(sst, window_year_baseline=15) >>> print(climatology.shape) (7305, 180, 360) # Same as input >>> >>> # First 15 years will be NaN (insufficient history) >>> print(f"NaN values in first year: {climatology.isel(time=slice(0, 365)).isnull().all().compute()}") True
Shorter window for datasets with limited time span:
>>> # For datasets with only 10 years, use shorter window >>> short_climatology = marEx.rolling_climatology( ... sst, window_year_baseline=5 ... ) >>> # First 5 years will be NaN instead of 15
Processing unstructured data:
>>> # ICON ocean model data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> icon_climatology = marEx.rolling_climatology( ... icon_sst, ... dimensions={"time": "time", "x": "ncells"} ... coordinates={"time": "time", "x": "lon", "y": "lat"} ... ) >>> print(icon_climatology.dims) Frozen({'time': 7305, 'ncells': 83886})
Comparing with fixed climatology:
>>> # Fixed climatology (traditional approach) >>> fixed_clim = sst.groupby(sst.time.dt.dayofyear).mean() >>> >>> # Rolling climatology (adaptive approach) >>> rolling_clim = marEx.rolling_climatology(sst) >>> >>> # Rolling climatology adapts to climate change >>> clim_2000 = rolling_clim.sel(time='2000').mean() >>> clim_2020 = rolling_clim.sel(time='2020').mean() >>> print(f"Climate change signal: {(clim_2020 - clim_2000).compute():.3f} °C")
Memory considerations for large datasets:
>>> # Ensure appropriate chunking for memory efficiency >>> large_sst = sst.chunk({'time': 30, 'lat': 45, 'lon': 90}) >>> large_climatology = marEx.rolling_climatology(large_sst) >>> # Output maintains input chunking structure
- marEx.detect.smoothed_rolling_climatology(da, window_year_baseline=15, smooth_days_baseline=21, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]
Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.
- Parameters:
da (xarray.DataArray) – Input data with time coordinate
window_year_baseline (int, default=15) – Number of years to include in each climatology window
smooth_days_baseline (int, default=21) – Number of days for temporal smoothing window
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
use_temp_checkpoints (bool)
- Returns:
Smoothed rolling climatology with same shape as input data
- Return type:
Examples
Basic smoothed rolling climatology:
>>> import xarray as xr >>> import marEx >>> >>> # Load SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute smoothed rolling climatology >>> smooth_clim = marEx.smoothed_rolling_climatology( ... sst, ... window_year_baseline=15, ... smooth_days_baseline=21 ... ) >>> print(smooth_clim.shape) (7305, 180, 360)
Comparing different smoothing windows:
>>> # Short smoothing - more day-to-day variability >>> clim_short = marEx.smoothed_rolling_climatology( ... sst, smooth_days_baseline=7 ... ) >>> >>> # Long smoothing - smoother seasonal cycle >>> clim_long = marEx.smoothed_rolling_climatology( ... sst, smooth_days_baseline=61 ... ) >>> >>> # Compare variability >>> var_short = clim_short.std(dim='time').mean().compute() >>> var_long = clim_long.std(dim='time').mean().compute() >>> print(f"Variability: short={var_short:.3f}, long={var_long:.3f}")
Climatology for anomaly computation:
>>> # Compute smoothed climatology then anomalies >>> climatology = marEx.smoothed_rolling_climatology(sst) >>> anomalies = sst - climatology >>> >>> # Check that anomalies have reasonable properties >>> print(f"Anomaly mean: {anomalies.mean().compute():.6f}") >>> print(f"Anomaly std: {anomalies.std().compute():.3f}")
Unstructured data processing:
>>> # ICON ocean data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> icon_smooth_clim = marEx.smoothed_rolling_climatology( ... icon_sst, ... dimensions={"time": "time", "x": "ncells"}, ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... window_year_baseline=10, ... smooth_days_baseline=31 ... )
Effect of smoothing on seasonal cycle:
>>> # Raw rolling climatology (no temporal smoothing) >>> raw_clim = marEx.rolling_climatology(sst, window_year_baseline=15) >>> >>> # Smoothed rolling climatology >>> smooth_clim = marEx.smoothed_rolling_climatology( ... sst, window_year_baseline=15, smooth_days_baseline=21 ... ) >>> >>> # Compare seasonal cycle smoothness >>> # Extract annual cycle for a point >>> point_raw = raw_clim.isel(lat=90, lon=180).sel(time='2010') >>> point_smooth = smooth_clim.isel(lat=90, lon=180).sel(time='2010') >>> >>> print(f"Raw climatology range: {(point_raw.max() - point_raw.min()).compute():.3f}") >>> print(f"Smooth climatology range: {(point_smooth.max() - point_smooth.min()).compute():.3f}")
Performance considerations:
>>> # Efficient implementation smooths raw data first, then computes climatology >>> # This is more memory-efficient than smoothing the climatology >>> large_sst = sst.chunk({'time': 25, 'lat': 45, 'lon': 90}) >>> efficient_clim = marEx.smoothed_rolling_climatology(large_sst)
- marEx.detect.add_decimal_year(da, dim='time', coord=None)[source]
Add decimal year coordinate to DataArray for trend analysis.
- Parameters:
da (xarray.DataArray) – Input data with datetime coordinate
dim (str, optional) – Name of the time dimension
coord (str, optional) – Name of the time coordinate (if different from dimension name)
- Returns:
Input data with added ‘decimal_year’ coordinate
- Return type:
Functions
|
Complete preprocessing pipeline for marine extreme event identification. |
Generate normalised anomalies using specified methodology. |
|
Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. |
|
|
Compute rolling climatology efficiently using flox cohorts. |
|
Identify extreme events exceeding a percentile threshold using specified method. |
Tracking Module (marEx.track)
MarEx-Track: Marine Extreme Event Identification, Tracking, and Splitting/Merging Module
MarEx identifies and tracks extreme events in oceanographic data across time, supporting both structured (regular grid) and unstructured datasets. It can identify discrete objects at single time points and track them as evolving events through time, seamlessly handling splitting and merging.
This package provides algorithms to:
Identify binary objects in spatial data at each time step
Track these objects across time to form coherent events
Handle merging and splitting of objects over time
Calculate and maintain object/event properties through time
Filter by size criteria to focus on significant events
Key terminology:
Object: A connected region in binary data at a single time point
Event: One or more objects tracked through time and identified as the same entity
- class marEx.track.tracker(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]
Bases:
objectTracker identifies and tracks arbitrary binary objects in spatial data through time.
The tracker supports both structured (regular grid) and unstructured data, and seamlessly handles splitting & merging of objects. It identifies connected regions in binary data at each time step, and tracks these as evolving events through time.
Main workflow:
Preprocessing: Fill spatiotemporal holes, filter small objects
Object identification: Label connected components at each time
Tracking: Determine object correspondences across time
Optional splitting & merging: Handle complex event evolution
- Parameters:
data_bin (xarray.DataArray) – Binary field of extreme points to group, label, and track (True = object, False = background) Must represent and underlying dask array.
mask (xarray.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)
R_fill (int) – The radius of the kernel used in morphological opening & closing, relating to the largest hole/gap that can be filled. In units of grid cells.
area_filter_quartile (float, optional) – The fraction of the smallest objects to discard, i.e. the quantile defining the smallest area object retained. Quantile must be in (0-1) (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.
area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile. Use this for fixed minimum area thresholds (e.g., 10 cells minimum).
temp_dir (str, optional) – Path to temporary directory for storing intermediate results
T_fill (int, default=2) – The permissible temporal gap (in days) between objects for tracking continuity to be maintained (must be even)
allow_merging (bool, default=True) – Allow objects to split and merge across time. Apply splitting & merging criteria, track merge events, and maintain original identities of merged objects across time. N.B.: False reverts to classical ndmeasure.label with simplar time connectivity, i.e. Scannell et al.
nn_partitioning (bool, default=False) – Implement a better partitioning of merged child objects based on closest parent cell. False reverts to using parent centroids to determine partitioning between new child objects, i.e. Di Sun & Bohai Zhang 2023. N.B.: Centroid-based partitioning has major problems with small merging objects suddenly obtaining unrealistically-large (and often disjoint) fractions of the larger object.
overlap_threshold (float, default=0.5) – The fraction of the smaller object’s area that must overlap with the larger object’s area to be considered the same event and continue tracking with the same ID.
unstructured_grid (bool, default=False) – Whether data is on an unstructured grid
dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Coordinate names for unstructured grids. Should contain ‘x’ and ‘y’ keys for x and y coordinates. May also contain ‘time’ if the time coordinate name is different from the dimension name.
neighbours (xarray.DataArray, optional) – For unstructured grid, indicates connectivity between cells
cell_areas (xarray.DataArray, optional) – For unstructured grid, area of each cell (required). For structured grid, area of each cell (optional). If not provided, defaults to 1.0 for each cell (resulting in cell counts as areas). Note: Overridden by grid_resolution if provided for structured grids.
grid_resolution (float, optional) – Grid resolution in degrees for structured grids only (ignored for unstructured grids). When provided, automatically calculates cell areas using spherical geometry. Overrides any provided cell_areas parameter.
max_iteration (int, default=40) – Maximum number of iterations for merging/splitting algorithm
checkpoint (str, default='None') – Checkpoint strategy (‘save’, ‘load’, or None)
debug (int, default=0) – Debug level (0-2)
verbose (bool, optional) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.
quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.
regional_mode (bool, default=False) – Enable regional mode for non-global coordinate ranges. When True, coordinate_units must be specified.
coordinate_units (str, optional) – Coordinate units when regional_mode=True. Must be either ‘degrees’ or ‘radians’.
Examples
Basic tracking of marine heatwave events from preprocessed data:
>>> import xarray as xr >>> import marEx >>> >>> # Load preprocessed extreme events data >>> processed = xr.open_dataset('extreme_events.nc', chunks={}) >>> extreme_events = processed.extreme_events # Boolean array >>> mask = processed.mask # Ocean/land mask >>> >>> # Initialise tracker with basic parameters >>> tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, # Fill holes up to 8 grid cells ... area_filter_quartile=0.5 # Remove smallest 50% of objects ... allow_merging=False # Basic tracking without splitting/merging ... ) >>> >>> # Run tracking algorithm >>> events = tracker.run() >>> print(f"Identified {events.ID.max().compute()} distinct events") Identified 1247 distinct events
Using automatic grid area calculation from resolution:
>>> # For regular lat/lon grids, automatically calculate physical areas >>> grid_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... grid_resolution=0.25 # Grid resolution in degrees ... ) >>> # Cell areas are calculated automatically using spherical geometry >>> grid_events = grid_tracker.run()
Advanced tracking with merging and splitting enabled:
>>> # More sophisticated tracking with temporal gap filling >>> advanced_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=12, # Larger spatial gap filling ... T_fill=4, # Fill up to 4-day temporal gaps ... area_filter_quartile=0.25, # More aggressive size filtering ... allow_merging=True, # Enable split/merge detection ... overlap_threshold=0.3 # Lower threshold for object linking ... ) >>> >>> events_advanced, merges_log = advanced_tracker.run(return_merges=True) >>> print(events_advanced.data_vars) Data variables: event (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> event_centroid (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> ID_field (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> global_ID (time, ID) int32 dask.array<chunksize=(25, 1247)> area (time, ID) float32 dask.array<chunksize=(25, 1247)> centroid (component, time, ID) float64 dask.array<chunksize=(2, 25, 1247)> presence (time, ID) bool dask.array<chunksize=(25, 1247)> time_start (ID) datetime64[ns] dask.array<chunksize=(1247,)> time_end (ID) datetime64[ns] dask.array<chunksize=(1247,)> merge_ledger (time, ID, sibling_ID) int32 dask.array<chunksize=(25, 1247, 10)>
Processing unstructured ocean model data (ICON):
>>> # Load ICON ocean model data with connectivity >>> icon_data = xr.open_dataset('icon_extremes.nc', chunks={}) >>> icon_extremes = icon_data.extreme_events # (time, ncells) >>> icon_mask = icon_data.mask >>> neighbours = icon_data.neighbours # Cell connectivity >>> cell_areas = icon_data.cell_areas # Physical areas >>> >>> # Track events on unstructured grid >>> unstructured_tracker = marEx.tracker( ... icon_extremes, ... icon_mask, ... R_fill=5, # 5-neighbor radius for gap filling ... area_filter_quartile=0.6, # Remove 60% of smallest events ... unstructured_grid=True, # Enable unstructured mode ... dimensions={"x": "ncells"}, # Must specify the name of the spatial dimension ... coordinates={"x": "lon", "y": "lat"}, # Spatial coordinate names ... neighbours=neighbours, # Required for unstructured ... cell_areas=cell_areas # Required for area calculations ... ) >>> unstructured_events = unstructured_tracker.run()
Memory management and checkpointing for large datasets:
>>> # Use checkpointing for very large datasets >>> large_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... temp_dir='/scratch/user/tracking_temp', # Temporary storage ... checkpoint='save' # Save intermediate results ... ) >>> # Processing can be resumed if interrupted >>> large_events = large_tracker.run()
Comparing different filtering strategies:
>>> # Conservative filtering - keep more events >>> conservative = marEx.tracker( ... extreme_events, mask, R_fill=5, area_filter_quartile=0.1 ... ) >>> conservative_events = conservative.run() >>> >>> # Aggressive filtering - focus on largest events >>> aggressive = marEx.tracker( ... extreme_events, mask, R_fill=15, area_filter_quartile=0.8 ... ) >>> aggressive_events = aggressive.run() >>> >>> print(f"Conservative: {conservative_events.ID.max().compute()} events") >>> print(f"Aggressive: {aggressive_events.ID.max().compute()} events")
Using absolute area filtering instead of percentile-based:
>>> # Filter objects smaller than 25 grid cells >>> absolute_tracker = marEx.tracker( ... extreme_events, mask, R_fill=8, area_filter_absolute=25 ... ) >>> absolute_events = absolute_tracker.run() >>> >>> # Default behavior (area_filter_quartile=0.5) when no parameters provided >>> default_tracker = marEx.tracker(extreme_events, mask, R_fill=8) >>> default_events = default_tracker.run() # Uses quartile=0.5 filtering
Using physical cell areas for structured grids:
>>> # Load data with irregular grid cell areas >>> grid_areas = xr.open_dataset('grid_areas.nc').cell_area # (lat, lon) in m² >>> >>> # Track events using physical areas instead of cell counts >>> physical_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... cell_areas=grid_areas # Physical areas in m² ... ) >>> events = physical_tracker.run() >>> # Now events.area contains physical areas in m² instead of cell counts
Integration with full marEx workflow:
>>> # Complete workflow from raw data to tracked events >>> raw_sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Step 1: Preprocess to identify extremes >>> processed = marEx.preprocess_data(raw_sst, threshold_percentile=95) >>> >>> # Step 2: Track extreme events >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> tracked_events = tracker.run()
Initialise the tracker with parameters and data.
- __init__(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]
Initialise the tracker with parameters and data.
- Parameters:
data_bin (DataArray)
mask (DataArray)
area_filter_quartile (float | None)
area_filter_absolute (int | None)
temp_dir (str | None)
T_fill (int)
allow_merging (bool)
nn_partitioning (bool)
overlap_threshold (float)
unstructured_grid (bool)
neighbours (DataArray | None)
cell_areas (DataArray | None)
grid_resolution (float | None)
max_iteration (int)
checkpoint (Literal['save', 'load', 'None'] | None)
debug (int)
verbose (bool | None)
quiet (bool | None)
regional_mode (bool)
coordinate_units (Literal['degrees', 'radians'] | None)
- Return type:
None
- run(return_merges=False, checkpoint=None)[source]
Run the complete object identification and tracking pipeline.
This method executes the full workflow:
Preprocessing: morphological operations and size filtering
Identification and tracking of objects through time
Computing and attaching statistics to the results
- Parameters:
- Returns:
events_ds (xarray.Dataset) – Dataset containing tracked events and their properties
merges_ds (xarray.Dataset, optional) – Dataset with merge event information (only if return_merges=True)
- Return type:
- run_preprocess(checkpoint=None)[source]
Preprocess binary data to prepare for tracking.
This performs morphological operations to fill holes/gaps in both space and time, then filters small objects according to the area_filter_quartile or area_filter_absolute.
- run_tracking(data_bin_preprocessed)[source]
Track objects through time to identify events.
- Parameters:
data_bin_preprocessed (xarray.DataArray) – Preprocessed binary data
- Returns:
events_ds (xarray.Dataset) – Dataset containing tracked events
merges_ds (xarray.Dataset) – Dataset with merge information
N_events_final (int) – Final number of unique events
- Return type:
- run_stats_attributes(events_ds, merges_ds, object_stats, N_events_final)[source]
Add statistics and attributes to the events dataset.
- Parameters:
events_ds (xarray.Dataset) – Dataset containing tracked events
merges_ds (xarray.Dataset) – Dataset with merge information
object_stats (tuple) – Preprocessed object statistics
N_events_final (int) – Final number of events
- Returns:
events_ds – Dataset with added statistics and attributes
- Return type:
- compute_area(data_bin)[source]
Compute the total area of binary data at each time.
- Parameters:
data_bin (xarray.DataArray) – Binary data
- Returns:
area – Total area at each time (units: pixels for structured grid, matching cell_area for unstructured)
- Return type:
- fill_holes(data_bin, R_fill=None)[source]
Fill holes and gaps using morphological operations.
This performs closing (dilation followed by erosion) to fill small gaps, then opening (erosion followed by dilation) to remove small isolated objects.
- Parameters:
data_bin (xarray.DataArray) – Binary data to process
R_fill (int, optional) – Fill radius override
- Returns:
data_bin_filled – Binary data with holes/gaps filled
- Return type:
- fill_time_gaps(data_bin)[source]
Fill temporal gaps between objects.
Performs binary closing (dilation then erosion) along the time dimension to fill small time gaps between objects.
- Parameters:
data_bin (xarray.DataArray) – Binary data to process
- Returns:
data_bin_filled – Binary data with temporal gaps filled
- Return type:
- refresh_dask_graph(data_bin)[source]
Clear and reset the Dask graph via save/load cycle.
This is needed to work around a memory leak bug in Dask where “Unmanaged Memory” builds up within loops.
- Parameters:
data_bin (xarray.DataArray) – Data to refresh
- Returns:
data_new – Data with fresh Dask graph
- Return type:
- filter_small_objects(data_bin)[source]
Remove objects smaller than a threshold area.
- Parameters:
data_bin (xarray.DataArray) – Binary data to filter
- Returns:
data_bin_filtered (xarray.DataArray) – Binary data with small objects removed
area_threshold (float) – Area threshold used for filtering
object_areas (xarray.DataArray) – Areas of all objects pre-filtering
N_objects_prefiltered (int) – Number of objects before filtering
N_objects_filtered (int) – Number of objects after filtering
- Return type:
- identify_objects(data_bin, time_connectivity)[source]
Identify connected regions in binary data.
- Parameters:
data_bin (xarray.DataArray) – Binary data to identify objects in
time_connectivity (bool) – Whether to connect objects across time
- Returns:
object_id_field (xarray.DataArray) – Field of integer IDs for each object
None (NoneType) – Placeholder for compatibility with track_objects
N_objects (int) – Number of objects identified
- Return type:
- calculate_centroid(binary_mask, original_centroid=None)[source]
Calculate object centroid, handling edge cases for periodic boundaries.
- Parameters:
binary_mask (numpy.ndarray) – 2D binary array where True indicates the object (dimensions are (y,x))
original_centroid (tuple, optional) – (y_centroid, x_centroid) from regionprops_table
- Returns:
(y_centroid, x_centroid)
- Return type:
- calculate_object_properties(object_id_field, properties=None)[source]
Calculate properties of objects from ID field.
- Parameters:
object_id_field (xarray.DataArray) – Field containing object IDs
properties (list, optional) – List of properties to calculate (defaults to [‘label’, ‘area’])
- Returns:
object_props – Dataset containing calculated properties with ‘ID’ dimension
- Return type:
- check_overlap_slice(ids_t0, ids_next)[source]
Find overlapping objects between two consecutive time slices.
- Parameters:
ids_t0 (numpy.ndarray) – Object IDs at current time
ids_next (numpy.ndarray) – Object IDs at next time
- Returns:
Array of shape (n_overlaps, 3) with [id_t0, id_next, overlap_area]
- Return type:
- find_overlapping_objects(object_id_field)[source]
Find all overlapping objects across time.
- Parameters:
object_id_field (xarray.DataArray) – Field containing object IDs
- Returns:
overlap_objects_list_unique_filtered – Array of object ID pairs that overlap across time, with overlap area The object in the first column precedes the second column in time. The third column contains:
For structured grid: number of overlapping pixels (int32)
For unstructured grid: total overlapping area in m^2 (float32)
- Return type:
(N x 3) numpy.ndarray
- enforce_overlap_threshold(overlap_objects_list, object_props)[source]
Filter object pairs based on overlap threshold.
- Parameters:
overlap_objects_list ((N x 3) numpy.ndarray) – Array of object ID pairs with overlap area
object_props (xarray.Dataset) – Object properties including area
- Returns:
overlap_objects_list_filtered – Filtered array of object ID pairs that meet the overlap threshold
- Return type:
(M x 3) numpy.ndarray
- consolidate_object_ids(data_t_minus_2, data_t_minus_1, object_props, timestep)[source]
Consolidate object IDs between t-2 and t-1 to ensure consistent tracking.
This identifies objects at t-1 that are actually continuations of objects from t-2 (but got different IDs due to partitioning) and renames them to maintain consistent IDs across timesteps.
- Parameters:
data_t_minus_2 (xr.DataArray) – Object field at timestep t-2
data_t_minus_1 (xr.DataArray) – Object field at timestep t-1 (will be modified)
object_props (xr.Dataset) – Object properties dataset (will be modified)
timestep (int) – Current timestep number for logging purposes
- Returns:
data_t_minus_1_consolidated (xr.DataArray) – Updated t-1 field with consolidated IDs
object_props_updated (xr.Dataset) – Updated object properties with merged/deleted objects
- Return type:
Notes
Uses self.overlap_threshold for determining consolidation eligibility
Updates object properties by recalculating for consolidated objects
Removes redundant child objects from object_props
- compute_id_time_dict(da, child_objects, max_objects, all_objects=True)[source]
Generate lookup table mapping object IDs to their time index.
- Parameters:
da (xarray.DataArray) – Field of object IDs
child_objects (list or array) – Object IDs to include in the dictionary
max_objects (int) – Maximum number of objects
all_objects (bool, default=True) – Whether to process all objects or just child_objects
- Returns:
time_index_map – Dictionary mapping object IDs to time indices
- Return type:
- track_objects(data_bin)[source]
Track objects through time to form events.
This is the main tracking method that handles splitting and merging of objects.
- Parameters:
data_bin (xarray.DataArray) – Preprocessed binary data: Field of globally unique integer IDs of each element in connected regions. ID = 0 indicates no object.
- Returns:
split_merged_events_ds (xarray.Dataset) – Dataset containing tracked events
merge_events (xarray.Dataset) – Dataset with merge information
N_events (int) – Final number of events
- Return type:
- cluster_rename_objects_and_props(object_id_field_unique, object_props, overlap_objects_list, merge_events)[source]
Cluster the object pairs and relabel to determine final event IDs.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs must not be repeated across time.
object_props (xarray.Dataset) – Properties of each object that also need to be relabeled.
overlap_objects_list ((N x 2) numpy.ndarray) – Array of object ID pairs that indicate which objects are in the same event. The object in the first column precedes the second column in time.
merge_events (xarray.Dataset) – Information about merge events
- Returns:
split_merged_events_ds – Dataset with relabeled events and their properties. ID = 0 indicates no object.
- Return type:
- split_and_merge_objects(object_id_field_unique, object_props)[source]
Implement object splitting and merging logic.
This identifies and processes cases where objects split or merge over time, creating new object IDs as needed.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs are required to be monotonically increasing with time.
object_props (xarray.Dataset) – Properties of each object
- Returns:
(object_id_field, object_props, overlap_objects_list, merge_events)
- Return type:
- split_and_merge_objects_parallel(object_id_field_unique, object_props)[source]
Optimised parallel implementation of object splitting and merging.
This version is specifically designed for unstructured grids with more efficient memory handling and better parallelism than the standard split_and_merge_objects method. It processes data in chunks, handles merging events, and efficiently updates object IDs.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs
object_props (xarray.Dataset) – Properties of each object
- Returns:
(object_id_field, object_props, overlap_objects_list, merge_events)
- Return type:
- marEx.track.wrapped_euclidian_distance_mask_parallel(mask_values, parent_centroids_values, Nx, wrap)[source]
Optimised function for computing wrapped Euclidean distances.
Efficiently calculates distances between points in a binary mask and a set of centroids, accounting for periodic boundaries in the x dimension.
- Parameters:
mask_values (np.ndarray) – 2D boolean array where True indicates points to calculate distances for
parent_centroids_values (np.ndarray) – Array of shape (n_parents, 2) containing (y, x) coordinates of parent centroids
Nx (int) – Size of the x-dimension for periodic boundary wrapping
wrap (bool) – Whether to treat x-dimension as periodic and wrap
- Returns:
distances – Array of shape (n_true_points, n_parents) with minimum distances
- Return type:
np.ndarray
- marEx.track.create_grid_index_arrays(points_y, points_x, grid_size, ny, nx)[source]
Create a grid-based spatial index for efficient point lookup.
This function divides space into a grid and assigns points to grid cells for more efficient spatial queries compared to brute force comparisons.
- marEx.track.wrapped_euclidian_distance_points(y1, x1, y2, x2, nx, half_nx, wrap)[source]
Calculate distance with periodic boundary conditions in x dimension.
- Parameters:
y1 (float) – Coordinates of first point
x1 (float) – Coordinates of first point
y2 (float) – Coordinates of second point
x2 (float) – Coordinates of second point
nx (int) – Size of x dimension
half_nx (float) – Half the size of x dimension
wrap (bool) – Whether to apply periodic boundary conditions in x
- Returns:
Euclidean distance accounting for periodic boundary in x (or not)
- Return type:
- marEx.track.partition_nn_grid(child_mask, parent_masks, child_ids, parent_centroids, Nx, max_distance=20, wrap=True)[source]
Partition a child object based on nearest parent object points.
This implementation uses spatial indexing and highly-threaded processing for efficient distance calculations. The algorithm assigns each point in the child object to the closest parent object.
- Parameters:
child_mask (np.ndarray) – Binary mask of the child object
parent_masks (np.ndarray) – List of binary masks for each parent object
child_ids (np.ndarray) – List of IDs to assign to partitions
parent_centroids (np.ndarray) – Array of shape (n_parents, 2) with parent centroids
Nx (int) – Size of x dimension for periodic boundaries
max_distance (int, default=20) – Maximum search distance
wrap (bool, default=True) – Whether to apply periodic boundary conditions in the x dimension
- Returns:
new_labels – Array containing assigned child_ids for each point
- Return type:
np.ndarray
- marEx.track.partition_nn_unstructured(child_mask, parent_masks, child_ids, parent_centroids, neighbours_int, lat, lon, max_distance=20)[source]
Partition a child object on an unstructured grid based on nearest parent points.
This function implements an efficient algorithm for assigning each cell in a child object to the nearest parent object, using graph traversal and spatial distances. It is optimised for unstructured grids.
- Parameters:
child_mask (np.ndarray) – 1D boolean array where True indicates points in the child object
parent_masks (np.ndarray) – 2D boolean array of shape (n_parents, n_points) where True indicates points in each parent object
child_ids (np.ndarray) – 1D array containing the IDs to assign to each partition of the child object
parent_centroids (np.ndarray) – Array of shape (n_parents, 2) containing (lat, lon) coordinates of parent centroids in degrees
neighbours_int (np.ndarray) – 2D array of shape (3, n_points) containing indices of neighboring cells for each point
lat (np.ndarray) – Latitude/longitude arrays in degrees
lon (np.ndarray) – Latitude/longitude arrays in degrees
max_distance (int, default=20) – Maximum number of edge hops to search for parent points
- Returns:
new_labels – 1D array containing the assigned child_ids for each True point in child_mask
- Return type:
np.ndarray
- marEx.track.partition_nn_unstructured_optimised(child_mask, parent_frontiers, parent_centroids, neighbours_int, lat, lon, max_distance=20)[source]
Memory-optimised nearest neighbor partitioning for unstructured grids.
This version uses more efficient memory management compared to partition_nn_unstructured, making it suitable for very large grids. It uses a compact representation of parent frontiers to reduce memory usage during graph traversal.
- Parameters:
child_mask (np.ndarray) – 1D boolean array indicating which cells belong to the child object
parent_frontiers (np.ndarray) – 1D uint8 array with parent indices (255 for unvisited points)
parent_centroids (np.ndarray) – Array of shape (n_parents, 2) containing (lat, lon) coordinates
neighbours_int (np.ndarray) – 2D array of shape (3, n_points) containing indices of neighboring cells
lat (np.ndarray) – 1D arrays of latitude/longitude in degrees
lon (np.ndarray) – 1D arrays of latitude/longitude in degrees
max_distance (int, default=20) – Maximum number of edge hops to search for parent points
- Returns:
result – 1D array containing the assigned parent indices for points in child_mask
- Return type:
np.ndarray
- marEx.track.partition_centroid_unstructured(child_mask, parent_centroids, child_ids, lat, lon)[source]
Partition a child object based on closest parent centroids on an unstructured grid.
This function assigns each cell in the child object to the parent with the closest centroid, using great circle distances on a spherical grid.
- Parameters:
child_mask (np.ndarray) – 1D boolean array indicating which cells belong to the child object
parent_centroids (np.ndarray) – Array of shape (n_parents, 2) containing (lat, lon) coordinates of parent centroids in degrees
child_ids (np.ndarray) – Array of IDs to assign to each partition of the child object
lat (np.ndarray) – Latitude/longitude arrays in degrees
lon (np.ndarray) – Latitude/longitude arrays in degrees
- Returns:
new_labels – 1D array containing assigned child_ids for cells in child_mask
- Return type:
np.ndarray
- marEx.track.sparse_bool_power(vec, sp_data, indices, indptr, exponent)[source]
Efficient sparse boolean matrix power operation.
This function implements a fast sparse matrix power operation for boolean matrices, avoiding memory leaks present in scipy+Dask implementations. It’s used for efficient morphological operations on unstructured grids.
- Parameters:
vec (np.ndarray) – Boolean vector to multiply
sp_data (np.ndarray) – Sparse matrix in CSR format
indices (np.ndarray) – Sparse matrix in CSR format
indptr (np.ndarray) – Sparse matrix in CSR format
exponent (int) – Number of times to apply the matrix
- Returns:
Result of (sparse_matrix ^ exponent) * vec
- Return type:
np.ndarray
- marEx.track.regional_tracker(data_bin, mask, coordinate_units, R_fill, area_filter_quartile=None, area_filter_absolute=None, **kwargs)[source]
Create a tracker instance configured for regional (non-global) data.
This is a convenience function that automatically sets regional_mode=True and requires explicit specification of coordinate units, since auto-detection may fail for regional coordinate ranges.
- Parameters:
data_bin (xr.DataArray) – Binary data to identify and track objects in (True = object, False = background)
mask (xr.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)
coordinate_units ({'degrees', 'radians'}) – Units of the coordinate system. Must be specified for regional data.
R_fill (int or float) – Radius for filling holes/gaps in spatial domain (in grid cells)
area_filter_quartile (float, optional) – Quantile (0-1) for filtering smallest objects (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.
area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile.
**kwargs – Additional parameters passed to the tracker class
- Returns:
Configured tracker instance with regional_mode=True
- Return type:
Examples
Track events in regional Mediterranean Sea data:
>>> import marEx >>> # For regional data with degree coordinates >>> regional_tracker = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='degrees', ... R_fill=5, ... area_filter_quartile=0.3 ... ) >>> events = regional_tracker.run()
Track events in regional data with radian coordinates:
>>> # For model output with radian coordinates >>> regional_tracker = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='radians', ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> events = regional_tracker.run()
Using absolute area filtering in regional mode:
>>> # Keep only features larger than 15 grid cells >>> absolute_regional = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='degrees', ... R_fill=5, ... area_filter_absolute=15 ... ) >>> events = absolute_regional.run()
Classes
|
Tracker identifies and tracks arbitrary binary objects in spatial data through time. |
Functions
|
Create a tracker instance configured for regional (non-global) data. |
Plotting Module (marEx.plotX)
MarEx-PlotX: Marine Extremes Visualisation Module
Comprehensive visualisation module for marine extreme events supporting both structured (regular grid) and unstructured oceanographic data. Provides automated grid detection and specialised plotting capabilities optimised for each data structure type.
Core capabilities:
Polymorphic plotting with automatic grid type detection
Single plot generation with customisable styling and projections
Multi-panel plotting for comparative analysis
Animation generation for temporal visualisation
Memory-efficient handling of large oceanographic datasets
Supported data formats:
Structured data: 3D arrays (time, lat, lon) for regular rectangular grids
Unstructured data: 2D arrays (time, cell) for irregular triangular meshes
- class marEx.plotX.PlotXAccessor(xarray_obj)[source]
Bases:
objectXarray accessor for plotX functionality with support for custom dimensions and coordinates.
Initialise the PlotXAccessor.
- Parameters:
xarray_obj (DataArray)
- __call__(dimensions=None, coordinates=None)[source]
Create a plotter instance with optional custom dimensions and coordinates.
- Parameters:
- Returns:
Appropriate plotter instance for the data structure
- Return type:
- single_plot(config, **kwargs)[source]
Create a single plot with default dimension detection.
- Parameters:
config (PlotConfig)
- multi_plot(config, **kwargs)[source]
Create multiple plots with default dimension detection.
- Parameters:
config (PlotConfig)
- animate(config, **kwargs)[source]
Create animation with default dimension detection.
- Parameters:
config (PlotConfig)
- marEx.plotX.specify_grid(grid_type=None, fpath_tgrid=None, fpath_ckdtree=None)[source]
Set the global grid specification that will be used by all plotters.
- Parameters:
- Return type:
None
Configuration Classes
|
Configuration class for plot parameters |
Functions
|
Set the global grid specification that will be used by all plotters. |
Submodules
Helper Module (marEx.helper)
HPC Dask Helper: Utilities for High-Performance Computing with Dask
This module provides utilities for setting up and managing Dask clusters in HPC environments, with specific support for the DKRZ Levante Supercomputer.
- marEx.helper.configure_dask(scratch_dir=None, config=None)[source]
Configure Dask with appropriate settings for HPC environments.
- marEx.helper.get_cluster_info(client)[source]
Get and print cluster connection information.
- Parameters:
client (Client) – Dask client connected to a cluster.
- Returns:
Dictionary containing connection information.
- Return type:
Examples
Basic cluster info retrieval:
>>> import marEx >>> >>> # Start local cluster >>> client = marEx.start_local_cluster(n_workers=2) >>> >>> # Get connection information >>> info = marEx.get_cluster_info(client) Hostname: login01 Forward Port: login01:8787 Dashboard Link: localhost:8787/status >>> >>> print(f"Connect via: {info['dashboard_link']}") Connect via: localhost:8787/status >>> client.close()
SSH tunneling for remote access:
>>> # Start cluster on HPC system >>> client = marEx.start_distributed_cluster( ... n_workers=8, workers_per_node=4, dashboard_address=8889 ... ) >>> >>> # Get tunneling information >>> info = marEx.get_cluster_info(client) Hostname: levante-login01 Forward Port: levante-login01:8889 Dashboard Link: localhost:8889/status >>> >>> # Use this info to set up SSH tunnel: >>> # ssh -L 8889:localhost:8889 levante-login01.dkrz.de >>> # Then access dashboard at localhost:8889/status >>> >>> client.close()
Monitoring cluster status:
>>> client = marEx.start_local_cluster(n_workers=4) >>> info = marEx.get_cluster_info(client) >>> >>> # Access cluster details >>> print(f"Dashboard URL: {client.dashboard_link}") >>> print(f"Cluster address: {client.cluster.scheduler_address}") >>> print(f"Total threads: {client.nthreads()}") >>> >>> # Use port for programmatic access >>> import requests >>> try: ... response = requests.get(f"http://localhost:{info['port']}/info") ... print(f"Scheduler info: {response.status_code}") ... except: ... print("Dashboard not accessible") >>> >>> client.close()
- marEx.helper.start_local_cluster(n_workers=4, threads_per_worker=1, scratch_dir=None, verbose=None, quiet=None, **kwargs)[source]
Start a local Dask cluster.
- Parameters:
n_workers (int, default=4) – Number of worker processes to start.
threads_per_worker (int, default=1) – Number of threads per worker.
scratch_dir (str or Path, optional) – Directory to use for temporary files.
verbose (bool, optional) – Enable verbose logging with detailed cluster startup information. If None, uses current global logging configuration.
quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.
**kwargs – Additional keyword arguments to pass to LocalCluster.
- Returns:
Dask client connected to the local cluster.
- Return type:
Client
Examples
Basic local cluster for development:
>>> import marEx >>> >>> # Start simple local cluster >>> client = marEx.start_local_cluster(n_workers=2, threads_per_worker=1) >>> print(client) <Client: 'tcp://127.0.0.1:xxxxx' processes=2 threads=2, memory=15.7 GB> >>> >>> # Check cluster status >>> print(f"Dashboard: {client.dashboard_link}") Dashboard: http://127.0.0.1:8787/status >>> >>> client.close()
Optimised cluster for CPU-intensive work:
>>> # Use one worker per physical core >>> client = marEx.start_local_cluster( ... n_workers=8, # Number of physical cores ... threads_per_worker=1 # Avoid hyperthreading for compute ... ) >>> >>> # Process data with the cluster >>> import xarray as xr >>> data = xr.open_dataset('large_data.nc').chunk({'time': 25}) >>> result = data.mean().compute() >>> client.close()
Memory-optimised cluster:
>>> # Configure for large datasets >>> client = marEx.start_local_cluster( ... n_workers=4, ... threads_per_worker=2, ... memory_limit='8GB', # Limit memory per worker ... scratch_dir='/tmp/dask' # Fast (e.g. Lustre) local storage ... )
Integration with marEx preprocessing:
>>> # Start cluster then process data >>> client = marEx.start_local_cluster(n_workers=16) >>> >>> # Load and preprocess SST data >>> sst = xr.open_dataset('sst_data.nc').sst.chunk({'time': 30}) >>> processed = marEx.preprocess_data(sst, threshold_percentile=95) >>> >>> # Track events using the cluster >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> events = tracker.run() >>> >>> print(f"Processed {len(sst.time)} time steps with {client.nthreads()} threads") >>> client.close()
Custom worker configuration:
>>> # Advanced configuration for specific workloads >>> client = marEx.start_local_cluster( ... n_workers=4, ... threads_per_worker=2, ... processes=True, # Use separate processes (default) ... silence_logs=False, # Keep logs for debugging ... dashboard_address=':8787' # Specific dashboard port ... )
- marEx.helper.start_distributed_cluster(n_workers, workers_per_node, runtime=9, node_memory=256, dashboard_address=8889, queue='compute', scratch_dir=None, account=None, verbose=None, quiet=None, **kwargs)[source]
Start a distributed Dask cluster on a SLURM-based supercomputer.
- Parameters:
n_workers (int) – Total number of workers to request.
workers_per_node (int) – Number of workers per node.
runtime (int, default=9) – Maximum runtime in minutes.
node_memory (int, default=256) – Memory per node in GB (256, 512, or 1024).
dashboard_address (int, default=8889) – Port for the Dask dashboard.
queue (str, default='compute') – SLURM queue to submit jobs to.
scratch_dir (str or Path, optional) – Directory to use for temporary files.
account (str, optional) – SLURM account to charge. Defaults to DKRZ_ACCOUNT.
**kwargs – Additional keyword arguments to pass to SLURMCluster.
verbose (bool | None)
quiet (bool | None)
- Returns:
Dask client connected to the distributed cluster.
- Return type:
Client
Examples
Basic SLURM cluster for medium-scale processing:
>>> import marEx >>> >>> # Start cluster with 16 workers on 4 nodes (4 workers per node) >>> client = marEx.start_distributed_cluster( ... n_workers=16, ... workers_per_node=4, ... runtime=60, # 1 hour ... node_memory=128 # 128GB nodes ... ) >>> print(f"Cluster: {client}") >>> # Access dashboard via SSH tunnel >>> cluster_info = marEx.get_cluster_info(client) >>> client.close()
Processing large marEx workflow on HPC:
>>> # Start cluster for full marEx pipeline >>> client = marEx.start_distributed_cluster( ... n_workers=32, ... workers_per_node=8, ... runtime=60, # 1 hour ... node_memory=256 # 256GB nodes ... ) >>> >>> # Load very large SST dataset >>> import xarray as xr >>> sst = xr.open_zarr('/work/data/large_sst.zarr').chunk({'time': 25}) >>> >>> # Preprocess with distributed computing >>> processed = marEx.preprocess_data( ... sst, ... method_anomaly="shifting_baseline", ... threshold_percentile=95 ... ) >>> >>> # Track events using full cluster >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=12, ... area_filter_quartile=0.3 ... ) >>> events = tracker.run() >>> >>> # Save results >>> events.to_zarr('/work/results/tracked_events.zarr') >>> client.close()
Custom SLURM configuration:
>>> # Advanced SLURM configuration >>> client = marEx.start_distributed_cluster( ... n_workers=64, ... workers_per_node=32, ... runtime=20, # 20 minutes ... node_memory=256, ... account='myproject', # Custom account ... queue='debug', # debug queue ... )
Dashboard access and monitoring:
>>> # Start cluster and set up monitoring >>> client = marEx.start_distributed_cluster( ... n_workers=16, ... workers_per_node=4, ... dashboard_address=8890 # Custom dashboard port ... ) >>> >>> # Get connection info for SSH tunneling >>> info = marEx.get_cluster_info(client) >>> print(f"SSH tunnel: ssh -L {info['port']}:localhost:{info['port']} {info['hostname']}") >>> print(f"Dashboard: {info['dashboard_link']}") >>> >>> # Monitor cluster performance >>> print(f"Workers: {len(client.scheduler_info()['workers'])}") >>> print(f"Total memory: {sum(w['memory_limit'] for w in client.scheduler_info()['workers'].values()) / 1e9:.1f} GB")
Memory optimisation strategies:
>>> # Optimise for different workload types >>> >>> # Memory-intensive: fewer workers per node >>> memory_cluster = marEx.start_distributed_cluster( ... n_workers=8, workers_per_node=2, node_memory=512 ... ) >>> >>> # CPU-intensive: more workers per node >>> cpu_cluster = marEx.start_distributed_cluster( ... n_workers=64, workers_per_node=16, node_memory=256 ... )
- marEx.helper.checkpoint_to_zarr(data, name='checkpoint', cleanup=False, timedim='time')[source]
Save and reload a Dask-backed xarray object to break graph dependencies.
This function materialises a Dask array/dataset to a temporary file and immediately reloads it, thereby breaking the computational graph. This prevents expensive recomputations when the same data is used multiple times downstream.
- Parameters:
data (xarray.DataArray or xarray.Dataset) – Dask-backed xarray object to checkpoint
name (str, default='checkpoint') – Name prefix for the temporary file (for logging/debugging)
cleanup (bool, default=False) – Whether to delete the temporary file after reloading. By default (False), temp files are kept for the session and cleaned up by the OS temp directory manager. Set to True to immediately delete after reload.
timedim (str, default='time') – Name of the time dimension for chunking adjustments
- Returns:
Reloaded data with broken graph dependencies
- Return type:
Examples
>>> import marEx >>> anomalies = marEx.compute_normalised_anomaly(sst) >>> anomalies_checkpointed = marEx.helper.checkpoint_to_zarr( ... anomalies, name="anomalies" ... )
- marEx.helper.fix_dask_tuple_array(da)[source]
Fix a dask array that has tuple (i.e. task) references in its chunks. This addresses a longstanding issue/bug when dask arrays are saved to Zarr. Process chunk by chunk to maintain memory efficiency.
- Parameters:
da (xarray.DataArray) – DataArray with Dask array backend that may have tuple chunk references
- Returns:
DataArray with materialised chunks that can be safely saved to Zarr
- Return type:
HPC Cluster Management
Start a distributed Dask cluster on a SLURM-based supercomputer. |
|
Start a local Dask cluster. |
|
|
Configure Dask with appropriate settings for HPC environments. |
|
Get and print cluster connection information. |
Fix a dask array that has tuple (i.e. task) references in its chunks. |
Detailed API Documentation
Detection and Preprocessing
- marEx.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Complete preprocessing pipeline for marine extreme event identification.
Supports separate methods for anomaly computation and extreme identification:
Anomaly Methods:
‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics
‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years
‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !
‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)
Extreme Methods:
‘global_extreme’: Global-in-time threshold value
‘hobday_extreme’: Local day-of-year specific thresholds with windowing
- Parameters:
da (xarray.DataArray) – Raw input data
method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).
method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).
threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.
window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).
smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).
window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).
window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).
std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).
detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.
force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).
reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.
method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.
precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.
max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.
dask_chunks (dict, optional) – Chunking specification for distributed computation.
dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.
coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.
neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.
cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.
use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.
verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.
quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.
- Returns:
Processed dataset with anomalies and extreme event identification
- Return type:
Examples
Basic usage with gridded SST data for marine heatwave detection:
>>> import xarray as xr >>> import marEx >>> >>> # Load and chunk SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Basic preprocessing with default shifting baseline method >>> result = marEx.preprocess_data(sst, threshold_percentile=90) >>> print(result) <xarray.Dataset> Dimensions: (time: 1461, lat: 180, lon: 360) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)> extreme_events (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)> thresholds (lat, lon) float32 dask.array<chunksize=(180, 360)>
>>> # Check which locations have extreme events >>> print(f"Total extreme events: {result.extreme_events.sum().compute()}") Total extreme events: 15847
Using shifting baseline method for more accurate climatology:
>>> # Requires at least 15 years of data by default >>> result_shifting = marEx.preprocess_data( ... sst, ... method_anomaly="shifting_baseline", ... window_year_baseline=10, # Use shorter window if needed ... smooth_days_baseline=31 # Longer smoothing window ... ) >>> # Note: First 10 years will be removed from output
Using Hobday extreme method with day-of-year specific thresholds:
>>> result_hobday = marEx.preprocess_data( ... sst, ... method_extreme="hobday_extreme", ... window_days_hobday=11, # 11-day window for each day-of-year ... threshold_percentile=95 ... ) >>> print(result_hobday.thresholds.dims) ('dayofyear', 'lat', 'lon')
Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:
>>> result_advanced = marEx.preprocess_data( ... sst, ... method_anomaly="detrend_harmonic", ... detrend_orders=[1, 2], # Linear and quadratic trends ... std_normalise=True, # Standardise by rolling std ... force_zero_mean=True, ... threshold_percentile=95 ... ) >>> # Result includes both raw and standardised anomalies >>> print('dat_stn' in result_advanced) True
Processing unstructured data:
>>> # For ICON ocean model data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50}) >>> result_unstructured = marEx.preprocess_data( ... icon_sst, ... dimensions={"x": "ncells"}, # Must specify the name of the spatial dimension ... dask_chunks={"time": 50} ... )
Error handling - insufficient data for shifting baseline:
>>> short_data = sst.isel(time=slice(0, 1000)) # Only ~3 years >>> try: ... result = marEx.preprocess_data( ... short_data, ... method_anomaly="shifting_baseline", ... window_year_baseline=15 ... ) ... except ValueError as e: ... print(f"Error: {e}") Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline requires at least 15 years.
Performance considerations with chunking:
>>> # For large datasets, adjust chunking for memory management >>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180}) >>> result = marEx.preprocess_data( ... large_sst, ... dask_chunks={"time": 25}, ... method_percentile="approximate" # Use approximate method (Default) for long time-series calculations ... )
Integration with tracking workflow:
>>> # Preprocess data then track events >>> processed = marEx.preprocess_data(sst, threshold_percentile=95) >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> events = tracker.run() >>> print(f"Identified {events.event.max().compute()} distinct events")
Simple fixed baseline approach:
>>> # Basic daily climatology across all years >>> result_fixed = marEx.preprocess_data( ... sst, ... method_anomaly="fixed_baseline", ... threshold_percentile=95 ... ) >>> # Uses all available data for climatology computation
Combined trend removal and fixed climatology:
>>> # Remove long-term trends then compute daily climatology >>> result_combined = marEx.preprocess_data( ... sst, ... method_anomaly="detrend_fixed_baseline", ... detrend_orders=[1], # Linear trend ... threshold_percentile=95, ... force_zero_mean=True ... ) >>> # Balances trend removal with simple climatology
- marEx.compute_normalised_anomaly(da, method_anomaly='shifting_baseline', dimensions=None, coordinates=None, window_year_baseline=15, smooth_days_baseline=21, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Generate normalised anomalies using specified methodology.
- Parameters:
da (xarray.DataArray) – Input data with dimensions matching the ‘dimensions’ parameter
method_anomaly (str, default='shifting_baseline') –
Anomaly computation method. Options: - ‘detrend_harmonic’: Detrending with harmonics and polynomials (efficient, biased) - ‘shifting_baseline’: Rolling climatology (accurate, shortens time series) - ‘fixed_baseline’: Daily climatology using full time series (keeps long-term trends in the anomaly) - ‘detrend_fixed_baseline’: Polynomial detrending + fixed climatology (does not shorten time series,
keeps trends in seasonal timing in the anomaly)
dimensions (dict, optional) – Mapping of conceptual dimensions to actual dimension names in the data
coordinates (dict, optional) – Mapping of conceptual coordinates to actual coordinate names in the data
window_year_baseline (int, default=15) – Number of years for rolling climatology (shifting_baseline only)
smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline only)
std_normalise (bool, default=False) – Whether to normalise by 30-day rolling standard deviation (detrend_harmonic only)
detrend_orders (list, default=[1]) – Polynomial orders for trend removal (detrend_harmonic and detrend_fixed_baseline only)
force_zero_mean (bool, default=True) – Explicitly enforce zero mean in final anomalies (detrend_harmonic and detrend_fixed_baseline only)
reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless.
use_temp_checkpoints (bool)
verbose (bool | None)
quiet (bool | None)
- Returns:
Dataset containing anomalies, mask, and metadata
- Return type:
Examples
Basic detrended baseline anomaly computation:
>>> import xarray as xr >>> import marEx >>> >>> # Load chunked SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute anomalies using shifting baseline (default) >>> result = marEx.compute_normalised_anomaly(sst) >>> print(result.data_vars) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)>
>>> # Check that anomalies have approximately zero mean >>> print(f"Mean anomaly: {result.dat_anomaly.mean().compute():.6f}") Mean anomaly: 0.000023
Previous configuration (marEx v2.0 default) of detrended baseline with higher-order polynomials and standardisation. Note: marEx v3.0+ uses shifting_baseline as the default method:
>>> result_advanced = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="detrend_harmonic", ... detrend_orders=[1, 2, 3], # Linear, quadratic, cubic trends ... std_normalise=True, # Add standardised anomalies ... force_zero_mean=True ... ) >>> print(result_advanced.data_vars) Data variables: dat_anomaly (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> mask (lat, lon) bool dask.array<chunksize=(180, 360)> dat_stn (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)> STD (dayofyear, lat, lon) float32 dask.array<chunksize=(366, 180, 360)>
>>> # Standardised anomalies have unit variance >>> print(f"STD of standardised anomalies: {result_advanced.dat_stn.std().compute():.3f}")
Accurate shifting baseline method for climate-aware anomalies:
>>> result_shifting = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="shifting_baseline", ... window_year_baseline=10, # Use 10-year rolling climatology ... smooth_days_baseline=31 # 31-day smoothing window ... ) >>> # Anomalies computed relative to recent past climatology
Processing unstructured data:
>>> # ICON ocean model with ncells dimension >>> icon_data = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> result_unstructured = marEx.compute_normalised_anomaly( ... icon_data, ... dimensions={"time": "time", "x": "ncells"} ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... ) >>> print(result_unstructured.dims) Frozen({'time': 1461, 'ncells': 83886})
Comparison of methods - detrended vs shifting baseline:
>>> # Detrended baseline - faster, slight bias >>> detrended = marEx.compute_normalised_anomaly( ... sst, method_anomaly="detrend_harmonic" ... ) >>> >>> # Shifting baseline - slower, more accurate >>> shifting = marEx.compute_normalised_anomaly( ... sst, method_anomaly="shifting_baseline", ... window_year_baseline=15 ... ) >>> >>> # Compare anomaly magnitudes >>> print(f"Detrended RMS: {detrended.dat_anomaly.std().compute():.3f}") >>> print(f"Shifting RMS: {shifting.dat_anomaly.std().compute():.3f}")
Fixed baseline climatology:
>>> # Use full time series for daily climatology >>> result_fixed = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="fixed_baseline" ... ) >>> # Climatology computed from all available years
Fixed baseline with a restricted reference period:
>>> # Compute climatology from 1990-2020 only, but output anomalies for all years >>> result_ref = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="fixed_baseline", ... reference_period=(1990, 2020) ... )
Fixed detrended baseline:
>>> # Remove long-term trends then compute fixed climatology >>> result_fixed_detrended = marEx.compute_normalised_anomaly( ... sst, ... method_anomaly="detrend_fixed_baseline", ... detrend_orders=[1], # Remove linear trend ... force_zero_mean=True ... ) >>> # Combines trend removal with fixed climatology
- marEx.smoothed_rolling_climatology(da, window_year_baseline=15, smooth_days_baseline=21, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]
Compute a smoothed rolling climatology using the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.
- Parameters:
da (xarray.DataArray) – Input data with time coordinate
window_year_baseline (int, default=15) – Number of years to include in each climatology window
smooth_days_baseline (int, default=21) – Number of days for temporal smoothing window
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
use_temp_checkpoints (bool)
- Returns:
Smoothed rolling climatology with same shape as input data
- Return type:
Examples
Basic smoothed rolling climatology:
>>> import xarray as xr >>> import marEx >>> >>> # Load SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute smoothed rolling climatology >>> smooth_clim = marEx.smoothed_rolling_climatology( ... sst, ... window_year_baseline=15, ... smooth_days_baseline=21 ... ) >>> print(smooth_clim.shape) (7305, 180, 360)
Comparing different smoothing windows:
>>> # Short smoothing - more day-to-day variability >>> clim_short = marEx.smoothed_rolling_climatology( ... sst, smooth_days_baseline=7 ... ) >>> >>> # Long smoothing - smoother seasonal cycle >>> clim_long = marEx.smoothed_rolling_climatology( ... sst, smooth_days_baseline=61 ... ) >>> >>> # Compare variability >>> var_short = clim_short.std(dim='time').mean().compute() >>> var_long = clim_long.std(dim='time').mean().compute() >>> print(f"Variability: short={var_short:.3f}, long={var_long:.3f}")
Climatology for anomaly computation:
>>> # Compute smoothed climatology then anomalies >>> climatology = marEx.smoothed_rolling_climatology(sst) >>> anomalies = sst - climatology >>> >>> # Check that anomalies have reasonable properties >>> print(f"Anomaly mean: {anomalies.mean().compute():.6f}") >>> print(f"Anomaly std: {anomalies.std().compute():.3f}")
Unstructured data processing:
>>> # ICON ocean data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> icon_smooth_clim = marEx.smoothed_rolling_climatology( ... icon_sst, ... dimensions={"time": "time", "x": "ncells"}, ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... window_year_baseline=10, ... smooth_days_baseline=31 ... )
Effect of smoothing on seasonal cycle:
>>> # Raw rolling climatology (no temporal smoothing) >>> raw_clim = marEx.rolling_climatology(sst, window_year_baseline=15) >>> >>> # Smoothed rolling climatology >>> smooth_clim = marEx.smoothed_rolling_climatology( ... sst, window_year_baseline=15, smooth_days_baseline=21 ... ) >>> >>> # Compare seasonal cycle smoothness >>> # Extract annual cycle for a point >>> point_raw = raw_clim.isel(lat=90, lon=180).sel(time='2010') >>> point_smooth = smooth_clim.isel(lat=90, lon=180).sel(time='2010') >>> >>> print(f"Raw climatology range: {(point_raw.max() - point_raw.min()).compute():.3f}") >>> print(f"Smooth climatology range: {(point_smooth.max() - point_smooth.min()).compute():.3f}")
Performance considerations:
>>> # Efficient implementation smooths raw data first, then computes climatology >>> # This is more memory-efficient than smoothing the climatology >>> large_sst = sst.chunk({'time': 25, 'lat': 45, 'lon': 90}) >>> efficient_clim = marEx.smoothed_rolling_climatology(large_sst)
- marEx.rolling_climatology(da, window_year_baseline=15, dimensions=None, coordinates=None, use_temp_checkpoints=False)[source]
Compute rolling climatology efficiently using flox cohorts. Uses the previous window_year_baseline years of data and reassemble it to match the original data structure. Years without enough previous data will be filled with NaN.
- Parameters:
da (xarray.DataArray) – Input data with time coordinate
window_year_baseline (int, default=15) – Number of years to include in each climatology window
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
use_temp_checkpoints (bool)
- Returns:
Rolling climatology with same shape as input data
- Return type:
Examples
Basic rolling climatology computation:
>>> import xarray as xr >>> import marEx >>> >>> # Load 20 years of SST data >>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Compute 15-year rolling climatology >>> climatology = marEx.rolling_climatology(sst, window_year_baseline=15) >>> print(climatology.shape) (7305, 180, 360) # Same as input >>> >>> # First 15 years will be NaN (insufficient history) >>> print(f"NaN values in first year: {climatology.isel(time=slice(0, 365)).isnull().all().compute()}") True
Shorter window for datasets with limited time span:
>>> # For datasets with only 10 years, use shorter window >>> short_climatology = marEx.rolling_climatology( ... sst, window_year_baseline=5 ... ) >>> # First 5 years will be NaN instead of 15
Processing unstructured data:
>>> # ICON ocean model data >>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 25}) >>> icon_climatology = marEx.rolling_climatology( ... icon_sst, ... dimensions={"time": "time", "x": "ncells"} ... coordinates={"time": "time", "x": "lon", "y": "lat"} ... ) >>> print(icon_climatology.dims) Frozen({'time': 7305, 'ncells': 83886})
Comparing with fixed climatology:
>>> # Fixed climatology (traditional approach) >>> fixed_clim = sst.groupby(sst.time.dt.dayofyear).mean() >>> >>> # Rolling climatology (adaptive approach) >>> rolling_clim = marEx.rolling_climatology(sst) >>> >>> # Rolling climatology adapts to climate change >>> clim_2000 = rolling_clim.sel(time='2000').mean() >>> clim_2020 = rolling_clim.sel(time='2020').mean() >>> print(f"Climate change signal: {(clim_2020 - clim_2000).compute():.3f} °C")
Memory considerations for large datasets:
>>> # Ensure appropriate chunking for memory efficiency >>> large_sst = sst.chunk({'time': 30, 'lat': 45, 'lon': 90}) >>> large_climatology = marEx.rolling_climatology(large_sst) >>> # Output maintains input chunking structure
- marEx.identify_extremes(da, method_extreme='hobday_extreme', threshold_percentile=95, dimensions=None, coordinates=None, window_days_hobday=11, window_spatial_hobday=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, use_temp_checkpoints=False, verbose=None, quiet=None)[source]
Identify extreme events exceeding a percentile threshold using specified method.
- Parameters:
da (xarray.DataArray) – DataArray containing anomalies
method_extreme (str, default='hobday_extreme') – Method for threshold calculation (‘global_extreme’ or ‘hobday_extreme’)
threshold_percentile (float, default=95) – Percentile threshold (e.g., 95 for 95th percentile)
dimensions (dict, optional) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Mapping of coordinates to names in the data
window_days_hobday (int, default=11) – Window for day-of-year threshold (hobday_extreme only)
window_spatial_hobday (int, default=None) – Window for day-of-year threshold spatial clustering (hobday_extreme only)
method_percentile (str, default='approximate') – Method for percentile computation (‘exact’ or ‘approximate’)
precision (float, default=0.01) – Precision for histogram bins in approximate method
max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning
use_temp_checkpoints (bool)
verbose (bool | None)
quiet (bool | None)
- Returns:
Tuple of (extremes, thresholds) where extremes is a boolean array identifying extreme events and thresholds contains the threshold values used
- Return type:
Examples
Basic extreme identification with global thresholds:
>>> import xarray as xr >>> import marEx >>> >>> # Load anomaly data (from compute_normalised_anomaly) >>> anomalies = xr.open_dataset('anomalies.nc', chunks={}).dat_anomaly >>> >>> # Identify extreme events using global-in-time 95th percentile >>> extremes, thresholds = marEx.identify_extremes( ... anomalies, ... method_extreme="global_extreme", ... threshold_percentile=95 ... ) >>> print(f"Extreme events shape: {extremes.shape}") Extreme events shape: (1461, 180, 360) >>> print(f"Thresholds shape: {thresholds.shape}") Thresholds shape: (180, 360)
>>> # Count total extreme events >>> total_extremes = extremes.sum().compute() >>> print(f"Total extreme events: {total_extremes}")
Using day-of-year specific thresholds (cf. Hobday et al. 2016 method):
>>> # More sophisticated threshold calculation >>> extremes_hobday, thresholds_hobday = marEx.identify_extremes( ... anomalies, ... method_extreme="hobday_extreme", ... threshold_percentile=95, ... window_days_hobday=11 # 11-day window around each day-of-year ... window_spatial_hobday=3 # 3x3 spatial window for clustering percentile calcuation ... ) >>> print(f"Hobday thresholds shape: {thresholds_hobday.shape}") Hobday thresholds shape: (366, 180, 360)
>>> # Compare seasonal variation in thresholds >>> summer_threshold = thresholds_hobday.sel(dayofyear=200).mean() >>> winter_threshold = thresholds_hobday.sel(dayofyear=50).mean() >>> print(f"Summer vs Winter thresholds: {summer_threshold:.3f} vs {winter_threshold:.3f}")
Comparison of exact vs approximate percentile methods:
>>> # Approximate method (faster, default) >>> extremes_approx, thresh_approx = marEx.identify_extremes( ... anomalies, method_percentile="approximate" ... ) >>> >>> # Exact method (slower & memory intensive) >>> extremes_exact, thresh_exact = marEx.identify_extremes( ... anomalies, method_percentile="exact" ... ) >>> >>> # Compare threshold precision — ~0.005C >>> threshold_diff = (thresh_exact - thresh_approx).std().compute() >>> print(f"Threshold difference (exact vs approx): {threshold_diff:.6f}")
Different percentile thresholds for varying event rarity:
>>> # Conservative threshold - very extreme events only >>> extremes_98, _ = marEx.identify_extremes( ... anomalies, threshold_percentile=98 ... ) >>> >>> # Moderate threshold - more frequent events >>> extremes_90, _ = marEx.identify_extremes( ... anomalies, threshold_percentile=90 ... ) >>> >>> # Compare event frequency >>> print(f"99th percentile events: {extremes_99.sum().compute()}") >>> print(f"90th percentile events: {extremes_90.sum().compute()}")
Processing unstructured data:
>>> # ICON ocean model data >>> icon_anomalies = xr.open_dataset('icon_anomalies.nc', chunks={}).dat_anomaly >>> extremes_unstructured, thresholds_unstructured = marEx.identify_extremes( ... icon_anomalies, ... dimensions={"time": "time", "x": "ncells"}, ... coordinates={"time": "time", "x": "lon", "y": "lat"}, ... threshold_percentile=95 ... ) >>> print(f"Unstructured extremes shape: {extremes_unstructured.shape}")
Advanced Hobday method with custom temporal window:
>>> # Longer temporal window for smoother thresholds >>> extremes_smooth, thresholds_smooth = marEx.identify_extremes( ... anomalies, ... method_extreme="hobday_extreme", ... window_days_hobday=31, # Longer smoothing window ... threshold_percentile=95 ... ) >>> >>> # Compare threshold smoothness >>> std_11day = thresholds_hobday.std(dim='dayofyear').mean().compute() >>> std_31day = thresholds_smooth.std(dim='dayofyear').mean().compute() >>> print(f"Threshold variability: 11-day={std_11day:.3f}, 31-day={std_31day:.3f}")
Event Tracking
- class marEx.tracker(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]
Bases:
objectTracker identifies and tracks arbitrary binary objects in spatial data through time.
The tracker supports both structured (regular grid) and unstructured data, and seamlessly handles splitting & merging of objects. It identifies connected regions in binary data at each time step, and tracks these as evolving events through time.
Main workflow:
Preprocessing: Fill spatiotemporal holes, filter small objects
Object identification: Label connected components at each time
Tracking: Determine object correspondences across time
Optional splitting & merging: Handle complex event evolution
- Parameters:
data_bin (xarray.DataArray) – Binary field of extreme points to group, label, and track (True = object, False = background) Must represent and underlying dask array.
mask (xarray.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)
R_fill (int) – The radius of the kernel used in morphological opening & closing, relating to the largest hole/gap that can be filled. In units of grid cells.
area_filter_quartile (float, optional) – The fraction of the smallest objects to discard, i.e. the quantile defining the smallest area object retained. Quantile must be in (0-1) (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.
area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile. Use this for fixed minimum area thresholds (e.g., 10 cells minimum).
temp_dir (str, optional) – Path to temporary directory for storing intermediate results
T_fill (int, default=2) – The permissible temporal gap (in days) between objects for tracking continuity to be maintained (must be even)
allow_merging (bool, default=True) – Allow objects to split and merge across time. Apply splitting & merging criteria, track merge events, and maintain original identities of merged objects across time. N.B.: False reverts to classical ndmeasure.label with simplar time connectivity, i.e. Scannell et al.
nn_partitioning (bool, default=False) – Implement a better partitioning of merged child objects based on closest parent cell. False reverts to using parent centroids to determine partitioning between new child objects, i.e. Di Sun & Bohai Zhang 2023. N.B.: Centroid-based partitioning has major problems with small merging objects suddenly obtaining unrealistically-large (and often disjoint) fractions of the larger object.
overlap_threshold (float, default=0.5) – The fraction of the smaller object’s area that must overlap with the larger object’s area to be considered the same event and continue tracking with the same ID.
unstructured_grid (bool, default=False) – Whether data is on an unstructured grid
dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data
coordinates (dict, optional) – Coordinate names for unstructured grids. Should contain ‘x’ and ‘y’ keys for x and y coordinates. May also contain ‘time’ if the time coordinate name is different from the dimension name.
neighbours (xarray.DataArray, optional) – For unstructured grid, indicates connectivity between cells
cell_areas (xarray.DataArray, optional) – For unstructured grid, area of each cell (required). For structured grid, area of each cell (optional). If not provided, defaults to 1.0 for each cell (resulting in cell counts as areas). Note: Overridden by grid_resolution if provided for structured grids.
grid_resolution (float, optional) – Grid resolution in degrees for structured grids only (ignored for unstructured grids). When provided, automatically calculates cell areas using spherical geometry. Overrides any provided cell_areas parameter.
max_iteration (int, default=40) – Maximum number of iterations for merging/splitting algorithm
checkpoint (str, default='None') – Checkpoint strategy (‘save’, ‘load’, or None)
debug (int, default=0) – Debug level (0-2)
verbose (bool, optional) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.
quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.
regional_mode (bool, default=False) – Enable regional mode for non-global coordinate ranges. When True, coordinate_units must be specified.
coordinate_units (str, optional) – Coordinate units when regional_mode=True. Must be either ‘degrees’ or ‘radians’.
Examples
Basic tracking of marine heatwave events from preprocessed data:
>>> import xarray as xr >>> import marEx >>> >>> # Load preprocessed extreme events data >>> processed = xr.open_dataset('extreme_events.nc', chunks={}) >>> extreme_events = processed.extreme_events # Boolean array >>> mask = processed.mask # Ocean/land mask >>> >>> # Initialise tracker with basic parameters >>> tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, # Fill holes up to 8 grid cells ... area_filter_quartile=0.5 # Remove smallest 50% of objects ... allow_merging=False # Basic tracking without splitting/merging ... ) >>> >>> # Run tracking algorithm >>> events = tracker.run() >>> print(f"Identified {events.ID.max().compute()} distinct events") Identified 1247 distinct events
Using automatic grid area calculation from resolution:
>>> # For regular lat/lon grids, automatically calculate physical areas >>> grid_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... grid_resolution=0.25 # Grid resolution in degrees ... ) >>> # Cell areas are calculated automatically using spherical geometry >>> grid_events = grid_tracker.run()
Advanced tracking with merging and splitting enabled:
>>> # More sophisticated tracking with temporal gap filling >>> advanced_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=12, # Larger spatial gap filling ... T_fill=4, # Fill up to 4-day temporal gaps ... area_filter_quartile=0.25, # More aggressive size filtering ... allow_merging=True, # Enable split/merge detection ... overlap_threshold=0.3 # Lower threshold for object linking ... ) >>> >>> events_advanced, merges_log = advanced_tracker.run(return_merges=True) >>> print(events_advanced.data_vars) Data variables: event (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> event_centroid (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> ID_field (time, lat, lon) int32 dask.array<chunksize=(25, 180, 360)> global_ID (time, ID) int32 dask.array<chunksize=(25, 1247)> area (time, ID) float32 dask.array<chunksize=(25, 1247)> centroid (component, time, ID) float64 dask.array<chunksize=(2, 25, 1247)> presence (time, ID) bool dask.array<chunksize=(25, 1247)> time_start (ID) datetime64[ns] dask.array<chunksize=(1247,)> time_end (ID) datetime64[ns] dask.array<chunksize=(1247,)> merge_ledger (time, ID, sibling_ID) int32 dask.array<chunksize=(25, 1247, 10)>
Processing unstructured ocean model data (ICON):
>>> # Load ICON ocean model data with connectivity >>> icon_data = xr.open_dataset('icon_extremes.nc', chunks={}) >>> icon_extremes = icon_data.extreme_events # (time, ncells) >>> icon_mask = icon_data.mask >>> neighbours = icon_data.neighbours # Cell connectivity >>> cell_areas = icon_data.cell_areas # Physical areas >>> >>> # Track events on unstructured grid >>> unstructured_tracker = marEx.tracker( ... icon_extremes, ... icon_mask, ... R_fill=5, # 5-neighbor radius for gap filling ... area_filter_quartile=0.6, # Remove 60% of smallest events ... unstructured_grid=True, # Enable unstructured mode ... dimensions={"x": "ncells"}, # Must specify the name of the spatial dimension ... coordinates={"x": "lon", "y": "lat"}, # Spatial coordinate names ... neighbours=neighbours, # Required for unstructured ... cell_areas=cell_areas # Required for area calculations ... ) >>> unstructured_events = unstructured_tracker.run()
Memory management and checkpointing for large datasets:
>>> # Use checkpointing for very large datasets >>> large_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... temp_dir='/scratch/user/tracking_temp', # Temporary storage ... checkpoint='save' # Save intermediate results ... ) >>> # Processing can be resumed if interrupted >>> large_events = large_tracker.run()
Comparing different filtering strategies:
>>> # Conservative filtering - keep more events >>> conservative = marEx.tracker( ... extreme_events, mask, R_fill=5, area_filter_quartile=0.1 ... ) >>> conservative_events = conservative.run() >>> >>> # Aggressive filtering - focus on largest events >>> aggressive = marEx.tracker( ... extreme_events, mask, R_fill=15, area_filter_quartile=0.8 ... ) >>> aggressive_events = aggressive.run() >>> >>> print(f"Conservative: {conservative_events.ID.max().compute()} events") >>> print(f"Aggressive: {aggressive_events.ID.max().compute()} events")
Using absolute area filtering instead of percentile-based:
>>> # Filter objects smaller than 25 grid cells >>> absolute_tracker = marEx.tracker( ... extreme_events, mask, R_fill=8, area_filter_absolute=25 ... ) >>> absolute_events = absolute_tracker.run() >>> >>> # Default behavior (area_filter_quartile=0.5) when no parameters provided >>> default_tracker = marEx.tracker(extreme_events, mask, R_fill=8) >>> default_events = default_tracker.run() # Uses quartile=0.5 filtering
Using physical cell areas for structured grids:
>>> # Load data with irregular grid cell areas >>> grid_areas = xr.open_dataset('grid_areas.nc').cell_area # (lat, lon) in m² >>> >>> # Track events using physical areas instead of cell counts >>> physical_tracker = marEx.tracker( ... extreme_events, ... mask, ... R_fill=8, ... area_filter_quartile=0.5, ... cell_areas=grid_areas # Physical areas in m² ... ) >>> events = physical_tracker.run() >>> # Now events.area contains physical areas in m² instead of cell counts
Integration with full marEx workflow:
>>> # Complete workflow from raw data to tracked events >>> raw_sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30}) >>> >>> # Step 1: Preprocess to identify extremes >>> processed = marEx.preprocess_data(raw_sst, threshold_percentile=95) >>> >>> # Step 2: Track extreme events >>> tracker = marEx.tracker( ... processed.extreme_events, ... processed.mask, ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> tracked_events = tracker.run()
Initialise the tracker with parameters and data.
- __init__(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]
Initialise the tracker with parameters and data.
- Parameters:
data_bin (DataArray)
mask (DataArray)
area_filter_quartile (float | None)
area_filter_absolute (int | None)
temp_dir (str | None)
T_fill (int)
allow_merging (bool)
nn_partitioning (bool)
overlap_threshold (float)
unstructured_grid (bool)
neighbours (DataArray | None)
cell_areas (DataArray | None)
grid_resolution (float | None)
max_iteration (int)
checkpoint (Literal['save', 'load', 'None'] | None)
debug (int)
verbose (bool | None)
quiet (bool | None)
regional_mode (bool)
coordinate_units (Literal['degrees', 'radians'] | None)
- Return type:
None
- run(return_merges=False, checkpoint=None)[source]
Run the complete object identification and tracking pipeline.
This method executes the full workflow:
Preprocessing: morphological operations and size filtering
Identification and tracking of objects through time
Computing and attaching statistics to the results
- Parameters:
- Returns:
events_ds (xarray.Dataset) – Dataset containing tracked events and their properties
merges_ds (xarray.Dataset, optional) – Dataset with merge event information (only if return_merges=True)
- Return type:
- run_preprocess(checkpoint=None)[source]
Preprocess binary data to prepare for tracking.
This performs morphological operations to fill holes/gaps in both space and time, then filters small objects according to the area_filter_quartile or area_filter_absolute.
- run_tracking(data_bin_preprocessed)[source]
Track objects through time to identify events.
- Parameters:
data_bin_preprocessed (xarray.DataArray) – Preprocessed binary data
- Returns:
events_ds (xarray.Dataset) – Dataset containing tracked events
merges_ds (xarray.Dataset) – Dataset with merge information
N_events_final (int) – Final number of unique events
- Return type:
- run_stats_attributes(events_ds, merges_ds, object_stats, N_events_final)[source]
Add statistics and attributes to the events dataset.
- Parameters:
events_ds (xarray.Dataset) – Dataset containing tracked events
merges_ds (xarray.Dataset) – Dataset with merge information
object_stats (tuple) – Preprocessed object statistics
N_events_final (int) – Final number of events
- Returns:
events_ds – Dataset with added statistics and attributes
- Return type:
- compute_area(data_bin)[source]
Compute the total area of binary data at each time.
- Parameters:
data_bin (xarray.DataArray) – Binary data
- Returns:
area – Total area at each time (units: pixels for structured grid, matching cell_area for unstructured)
- Return type:
- fill_holes(data_bin, R_fill=None)[source]
Fill holes and gaps using morphological operations.
This performs closing (dilation followed by erosion) to fill small gaps, then opening (erosion followed by dilation) to remove small isolated objects.
- Parameters:
data_bin (xarray.DataArray) – Binary data to process
R_fill (int, optional) – Fill radius override
- Returns:
data_bin_filled – Binary data with holes/gaps filled
- Return type:
- fill_time_gaps(data_bin)[source]
Fill temporal gaps between objects.
Performs binary closing (dilation then erosion) along the time dimension to fill small time gaps between objects.
- Parameters:
data_bin (xarray.DataArray) – Binary data to process
- Returns:
data_bin_filled – Binary data with temporal gaps filled
- Return type:
- refresh_dask_graph(data_bin)[source]
Clear and reset the Dask graph via save/load cycle.
This is needed to work around a memory leak bug in Dask where “Unmanaged Memory” builds up within loops.
- Parameters:
data_bin (xarray.DataArray) – Data to refresh
- Returns:
data_new – Data with fresh Dask graph
- Return type:
- filter_small_objects(data_bin)[source]
Remove objects smaller than a threshold area.
- Parameters:
data_bin (xarray.DataArray) – Binary data to filter
- Returns:
data_bin_filtered (xarray.DataArray) – Binary data with small objects removed
area_threshold (float) – Area threshold used for filtering
object_areas (xarray.DataArray) – Areas of all objects pre-filtering
N_objects_prefiltered (int) – Number of objects before filtering
N_objects_filtered (int) – Number of objects after filtering
- Return type:
- identify_objects(data_bin, time_connectivity)[source]
Identify connected regions in binary data.
- Parameters:
data_bin (xarray.DataArray) – Binary data to identify objects in
time_connectivity (bool) – Whether to connect objects across time
- Returns:
object_id_field (xarray.DataArray) – Field of integer IDs for each object
None (NoneType) – Placeholder for compatibility with track_objects
N_objects (int) – Number of objects identified
- Return type:
- calculate_centroid(binary_mask, original_centroid=None)[source]
Calculate object centroid, handling edge cases for periodic boundaries.
- Parameters:
binary_mask (numpy.ndarray) – 2D binary array where True indicates the object (dimensions are (y,x))
original_centroid (tuple, optional) – (y_centroid, x_centroid) from regionprops_table
- Returns:
(y_centroid, x_centroid)
- Return type:
- calculate_object_properties(object_id_field, properties=None)[source]
Calculate properties of objects from ID field.
- Parameters:
object_id_field (xarray.DataArray) – Field containing object IDs
properties (list, optional) – List of properties to calculate (defaults to [‘label’, ‘area’])
- Returns:
object_props – Dataset containing calculated properties with ‘ID’ dimension
- Return type:
- check_overlap_slice(ids_t0, ids_next)[source]
Find overlapping objects between two consecutive time slices.
- Parameters:
ids_t0 (numpy.ndarray) – Object IDs at current time
ids_next (numpy.ndarray) – Object IDs at next time
- Returns:
Array of shape (n_overlaps, 3) with [id_t0, id_next, overlap_area]
- Return type:
- find_overlapping_objects(object_id_field)[source]
Find all overlapping objects across time.
- Parameters:
object_id_field (xarray.DataArray) – Field containing object IDs
- Returns:
overlap_objects_list_unique_filtered – Array of object ID pairs that overlap across time, with overlap area The object in the first column precedes the second column in time. The third column contains:
For structured grid: number of overlapping pixels (int32)
For unstructured grid: total overlapping area in m^2 (float32)
- Return type:
(N x 3) numpy.ndarray
- enforce_overlap_threshold(overlap_objects_list, object_props)[source]
Filter object pairs based on overlap threshold.
- Parameters:
overlap_objects_list ((N x 3) numpy.ndarray) – Array of object ID pairs with overlap area
object_props (xarray.Dataset) – Object properties including area
- Returns:
overlap_objects_list_filtered – Filtered array of object ID pairs that meet the overlap threshold
- Return type:
(M x 3) numpy.ndarray
- consolidate_object_ids(data_t_minus_2, data_t_minus_1, object_props, timestep)[source]
Consolidate object IDs between t-2 and t-1 to ensure consistent tracking.
This identifies objects at t-1 that are actually continuations of objects from t-2 (but got different IDs due to partitioning) and renames them to maintain consistent IDs across timesteps.
- Parameters:
data_t_minus_2 (xr.DataArray) – Object field at timestep t-2
data_t_minus_1 (xr.DataArray) – Object field at timestep t-1 (will be modified)
object_props (xr.Dataset) – Object properties dataset (will be modified)
timestep (int) – Current timestep number for logging purposes
- Returns:
data_t_minus_1_consolidated (xr.DataArray) – Updated t-1 field with consolidated IDs
object_props_updated (xr.Dataset) – Updated object properties with merged/deleted objects
- Return type:
Notes
Uses self.overlap_threshold for determining consolidation eligibility
Updates object properties by recalculating for consolidated objects
Removes redundant child objects from object_props
- compute_id_time_dict(da, child_objects, max_objects, all_objects=True)[source]
Generate lookup table mapping object IDs to their time index.
- Parameters:
da (xarray.DataArray) – Field of object IDs
child_objects (list or array) – Object IDs to include in the dictionary
max_objects (int) – Maximum number of objects
all_objects (bool, default=True) – Whether to process all objects or just child_objects
- Returns:
time_index_map – Dictionary mapping object IDs to time indices
- Return type:
- track_objects(data_bin)[source]
Track objects through time to form events.
This is the main tracking method that handles splitting and merging of objects.
- Parameters:
data_bin (xarray.DataArray) – Preprocessed binary data: Field of globally unique integer IDs of each element in connected regions. ID = 0 indicates no object.
- Returns:
split_merged_events_ds (xarray.Dataset) – Dataset containing tracked events
merge_events (xarray.Dataset) – Dataset with merge information
N_events (int) – Final number of events
- Return type:
- cluster_rename_objects_and_props(object_id_field_unique, object_props, overlap_objects_list, merge_events)[source]
Cluster the object pairs and relabel to determine final event IDs.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs must not be repeated across time.
object_props (xarray.Dataset) – Properties of each object that also need to be relabeled.
overlap_objects_list ((N x 2) numpy.ndarray) – Array of object ID pairs that indicate which objects are in the same event. The object in the first column precedes the second column in time.
merge_events (xarray.Dataset) – Information about merge events
- Returns:
split_merged_events_ds – Dataset with relabeled events and their properties. ID = 0 indicates no object.
- Return type:
- split_and_merge_objects(object_id_field_unique, object_props)[source]
Implement object splitting and merging logic.
This identifies and processes cases where objects split or merge over time, creating new object IDs as needed.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs are required to be monotonically increasing with time.
object_props (xarray.Dataset) – Properties of each object
- Returns:
(object_id_field, object_props, overlap_objects_list, merge_events)
- Return type:
- split_and_merge_objects_parallel(object_id_field_unique, object_props)[source]
Optimised parallel implementation of object splitting and merging.
This version is specifically designed for unstructured grids with more efficient memory handling and better parallelism than the standard split_and_merge_objects method. It processes data in chunks, handles merging events, and efficiently updates object IDs.
- Parameters:
object_id_field_unique (xarray.DataArray) – Field of unique object IDs
object_props (xarray.Dataset) – Properties of each object
- Returns:
(object_id_field, object_props, overlap_objects_list, merge_events)
- Return type:
- marEx.regional_tracker(data_bin, mask, coordinate_units, R_fill, area_filter_quartile=None, area_filter_absolute=None, **kwargs)[source]
Create a tracker instance configured for regional (non-global) data.
This is a convenience function that automatically sets regional_mode=True and requires explicit specification of coordinate units, since auto-detection may fail for regional coordinate ranges.
- Parameters:
data_bin (xr.DataArray) – Binary data to identify and track objects in (True = object, False = background)
mask (xr.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)
coordinate_units ({'degrees', 'radians'}) – Units of the coordinate system. Must be specified for regional data.
R_fill (int or float) – Radius for filling holes/gaps in spatial domain (in grid cells)
area_filter_quartile (float, optional) – Quantile (0-1) for filtering smallest objects (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.
area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile.
**kwargs – Additional parameters passed to the tracker class
- Returns:
Configured tracker instance with regional_mode=True
- Return type:
Examples
Track events in regional Mediterranean Sea data:
>>> import marEx >>> # For regional data with degree coordinates >>> regional_tracker = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='degrees', ... R_fill=5, ... area_filter_quartile=0.3 ... ) >>> events = regional_tracker.run()
Track events in regional data with radian coordinates:
>>> # For model output with radian coordinates >>> regional_tracker = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='radians', ... R_fill=8, ... area_filter_quartile=0.5 ... ) >>> events = regional_tracker.run()
Using absolute area filtering in regional mode:
>>> # Keep only features larger than 15 grid cells >>> absolute_regional = marEx.regional_tracker( ... extreme_events, ... mask, ... coordinate_units='degrees', ... R_fill=5, ... area_filter_absolute=15 ... ) >>> events = absolute_regional.run()
Visualisation
- class marEx.PlotConfig(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)[source]
Bases:
objectConfiguration class for plot parameters
- Parameters:
- cmap
Colormap name or ListedColormap object
- Type:
str | matplotlib.colors.ListedColormap | None
- norm
Custom normalization (BoundaryNorm or Normalize)
- Type:
matplotlib.colors.BoundaryNorm | matplotlib.colors.Normalize | None
- projection
Cartopy projection for map plots
- Type:
Any | None
- cmap: str | ListedColormap | None = None
- norm: BoundaryNorm | Normalize | None = None
- __init__(title=None, var_units='', issym=False, cmap=None, cperc=None, clim=None, show_colorbar=True, grid_lines=True, grid_labels=False, dimensions=None, coordinates=None, norm=None, plot_IDs=False, extend='both', verbose=None, quiet=None, projection=None, framerate=10)
- Parameters:
- Return type:
None
Exception Hierarchy
- exception marEx.MarExError(message, details=None, suggestions=None, error_code=None, context=None)[source]
Bases:
ExceptionBase exception class for all MarEx-specific errors.
This is the root of the MarEx exception hierarchy and provides common functionality for all marEx exceptions including:
Structured error context
Exception chaining support
Consistent error formatting
- Parameters:
message (str) – Primary error message describing what went wrong
details (str, optional) – Additional technical details about the error
suggestions (list of str, optional) – Actionable suggestions for resolving the error
error_code (str, optional) – Structured error code for programmatic handling
context (dict, optional) – Additional context information (e.g., parameter values, data shapes)
Initialise the Error.
- __init__(message, details=None, suggestions=None, error_code=None, context=None)[source]
Initialise the Error.
- exception marEx.DataValidationError(message, details=None, suggestions=None, error_code='DATA_VALIDATION', context=None)[source]
Bases:
MarExErrorRaise exception for input data validation issues.
This exception covers problems with input data structure, format, content, or compatibility with marEx processing requirements.
Common scenarios:
Non-Dask arrays when Dask is required
Missing required coordinates or dimensions
Invalid data types or ranges
Incompatible chunking strategies
Malformed input datasets
Examples
>>> raise DataValidationError( ... "Input DataArray must be Dask-backed", ... details="Found numpy array, but marEx requires chunked Dask arrays", ... suggestions=["Use da.chunk() to convert to Dask array", "Load data with dask chunking: xr.open_dataset(...).chunk()"], ... context={"data_type": type(data), "shape": data.shape} ... )
Initialise the Error.
- Parameters:
- exception marEx.CoordinateError(message, details=None, suggestions=None, error_code='COORDINATE_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for coordinate system problems.
This exception handles issues with geographic coordinates including unit mismatches, invalid ranges, missing coordinate information, and coordinate system inconsistencies.
Common scenarios:
Latitude/longitude values outside valid ranges
Unit mismatches (degrees vs radians)
Missing coordinate dimensions
Inconsistent coordinate systems between datasets
Auto-detection failures
Initialise the Error.
- Parameters:
- exception marEx.ProcessingError(message, details=None, suggestions=None, error_code='PROCESSING_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for computational and algorithmic issues.
This exception covers problems that occur during data processing, including numerical computation errors, algorithm convergence issues, and memory/performance problems.
Common scenarios:
Insufficient memory for computation
Numerical instability or overflow
Algorithm convergence failures
Chunking strategy problems
Dask computation errors
Initialise the Error.
- Parameters:
- exception marEx.ConfigurationError(message, details=None, suggestions=None, error_code='CONFIGURATION_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for parameter and setup issues.
This exception handles problems with function parameters, configuration settings, and setup requirements that prevent proper operation.
Common scenarios:
Invalid parameter values or combinations
Missing required configuration
Incompatible parameter settings
Environment setup issues
Examples
>>> raise ConfigurationError( ... "Invalid threshold percentile value", ... details="threshold_percentile must be between 0 and 100", ... suggestions=["Use percentile value between 50-99 for extreme events", "Common values: 90 (moderate), 95 (strong), 99 (severe)"], ... context={"provided_value": 150, "valid_range": [0, 100]} ... )
Initialise the Error.
- Parameters:
- exception marEx.DependencyError(message, details=None, suggestions=None, error_code='DEPENDENCY_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for missing or incompatible dependencies.
This exception handles issues with optional or required dependencies that are missing, incompatible, or incorrectly configured.
Common scenarios:
Missing optional dependencies (JAX, ffmpeg)
Version incompatibilities
Import failures
System dependency issues
Examples
>>> raise DependencyError( ... "JAX acceleration not available", ... details="JAX package not found or incompatible version", ... suggestions=["Install JAX: pip install marEx[full]", "Check CUDA compatibility for GPU acceleration", "Processing will continue with NumPy backend"], ... context={"requested_feature": "GPU acceleration", "available": False} ... )
Initialise the Error.
- Parameters:
- exception marEx.TrackingError(message, details=None, suggestions=None, error_code='TRACKING_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for object tracking and identification issues.
This exception covers problems specific to the tracking module including binary object identification, temporal linking, and merge/split handling.
Common scenarios:
Invalid binary input data
Tracking parameter conflicts
Temporal continuity issues
Memory overflow during tracking
Checkpoint/resume failures
Examples
>>> raise TrackingError( ... "Tracking failed due to excessive memory usage", ... details="Event fragmentation created >100,000 objects per timestep", ... suggestions=["Increase area_filter_quartile to remove small events", "Apply stronger spatial smoothing before tracking", "Consider processing shorter time periods"], ... context={"objects_per_timestep": 150000, "memory_limit_gb": 32} ... )
Initialise the Error.
- Parameters:
- exception marEx.VisualisationError(message, details=None, suggestions=None, error_code='VISUALISATION_ERROR', context=None)[source]
Bases:
MarExErrorRaise exception for plotting and visualisation problems.
This exception handles issues with the plotX visualisation system including matplotlib configuration, cartopy projections, and animation generation.
Common scenarios:
Missing plotting dependencies
Cartopy projection issues
Invalid plot configuration
Animation encoding failures
Grid type detection problems
Examples
>>> raise VisualisationError( ... "Animation creation failed", ... details="ffmpeg encoder not found for MP4 generation", ... suggestions=["Install ffmpeg system package", "Use alternative format: save_format='gif'", "Install ffmpeg via conda: conda install ffmpeg"], ... context={"requested_format": "mp4", "available_encoders": ["png", "gif"]} ... )
Initialise the Error.
- Parameters:
- marEx.create_data_validation_error(message, data_info=None, **kwargs)[source]
Create DataValidationError with common data context.
- Parameters:
- Returns:
Configured exception with data context
- Return type:
- marEx.create_coordinate_error(message, coordinate_ranges=None, detected_system=None, **kwargs)[source]
Create CoordinateError with coordinate context.
- Parameters:
- Returns:
Configured exception with coordinate context
- Return type:
- marEx.create_processing_error(message, computation_info=None, **kwargs)[source]
Create ProcessingError with computation context.
- Parameters:
- Returns:
Configured exception with computation context
- Return type:
- marEx.wrap_exception(original_exception, message=None, marex_exception_type=None)[source]
Wrap a generic exception in an appropriate MarEx exception.
This function helps maintain backward compatibility while migrating to the new exception hierarchy by wrapping generic exceptions in appropriate MarEx-specific exceptions.
- Parameters:
- Returns:
Wrapped exception with original as cause
- Return type:
Logging System
- marEx.configure_logging(level=None, format_str=None, date_format=None, log_file=None, console_output=True, disable_external_loggers=True, verbose=None, quiet=None)[source]
Configure logging for the MarEx package.
- Parameters:
level (int | str | None) – Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
format_str (str | None) – Custom log format string
date_format (str | None) – Date format for timestamps
log_file (str | Path | None) – Optional file path for logging output
console_output (bool) – Whether to log to console
disable_external_loggers (bool) – Whether to suppress noisy external library logs
verbose (bool | None) – Enable verbose mode (detailed logging with DEBUG level)
quiet (bool | None) – Enable quiet mode (minimal logging with WARNING level)
- Return type:
None
- marEx.set_verbose_mode(verbose=True)[source]
Enable or disable verbose logging mode.
- Parameters:
verbose (bool) – If True, enables verbose mode with DEBUG level logging
- Return type:
None
- marEx.set_quiet_mode(quiet=True)[source]
Enable or disable quiet logging mode.
- Parameters:
quiet (bool) – If True, enables quiet mode with WARNING level logging
- Return type:
None