Tracking Module (marEx.track)

The marEx.track module provides sophisticated algorithms for identifying and tracking extreme events through time. It handles complex scenarios including event merging, splitting, and temporal evolution using advanced image processing and graph theory techniques.

Overview

The tracking module implements a comprehensive workflow for converting binary extreme event masks into tracked event objects with unique identifiers, statistical properties, and temporal evolution information. It includes advanced algorithms for handling event lifecycles, including merging and splitting events whilst maintaining unique identities.

Key Features:

  • Binary Object Tracking: Advanced connected component analysis with dask parallelisation

  • Merge/Split Handling: Algorithms for event lifecycle management

  • Morphological Operations: Image processing for event preprocessing and hole filling

  • Statistical Analysis: Comprehensive event property calculation

  • Memory Efficient: Optimised for large datasets with intelligent chunking

  • Grid Agnostic: Works with both structured and unstructured grids

Main Classes

tracker(data_bin, mask, R_fill[, ...])

Tracker identifies and tracks arbitrary binary objects in spatial data through time.

Tracking Algorithm Comparison

The video below demonstrates the critical difference between basic connectivity-based tracking and marEx’s advanced merge/split algorithm:

Left (Basic Method — 3D Connected Components): Uses simple 3D connected component labeling where ANY objects that touch at any point in space-time are permanently merged into the same event. This creates a chain reaction: Event A touches Event B → merged. The merged AB touches Event C → all become one event. Over time, this produces unrealistic basin-spanning “mega-events” that spuriously link dozens of independent physical phenomena. These mega-events are algorithmically-induced artifacts with no coherent physical origin, making their statistics (size, duration, intensity) meaningless for mechanistic analysis.

Right (marEx Advanced Method — Genealogy Tracking): Prevents mega-events using overlap_threshold (requires significant overlap, not just touching) and maintains individual event identities through merge/split events using nearest-neighbor partitioning. Records complete parent/child genealogy in merge_ledger, enabling reconstruction of physically realistic event evolution. Tracked events correspond to actual coherent phenomena, providing mechanistically relevant statistics for scientific analysis.

Detailed Documentation

Tracker Class

class marEx.track.tracker(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Bases: object

Tracker identifies and tracks arbitrary binary objects in spatial data through time.

The tracker supports both structured (regular grid) and unstructured data, and seamlessly handles splitting & merging of objects. It identifies connected regions in binary data at each time step, and tracks these as evolving events through time.

Main workflow:

  1. Preprocessing: Fill spatiotemporal holes, filter small objects

  2. Object identification: Label connected components at each time

  3. Tracking: Determine object correspondences across time

  4. Optional splitting & merging: Handle complex event evolution

Parameters:
  • data_bin (xarray.DataArray) – Binary field of extreme points to group, label, and track (True = object, False = background) Must represent and underlying dask array.

  • mask (xarray.DataArray) – Binary mask indicating valid regions (True = valid, False = invalid)

  • R_fill (int) – The radius of the kernel used in morphological opening & closing, relating to the largest hole/gap that can be filled. In units of grid cells.

  • area_filter_quartile (float, optional) – The fraction of the smallest objects to discard, i.e. the quantile defining the smallest area object retained. Quantile must be in (0-1) (e.g., 0.25 removes smallest 25%). Mutually exclusive with area_filter_absolute. Default is 0.5 if neither parameter is provided.

  • area_filter_absolute (int, optional) – The minimum area (in grid cells) for an object to be retained. Mutually exclusive with area_filter_quartile. Use this for fixed minimum area thresholds (e.g., 10 cells minimum).

  • temp_dir (str, optional) – Path to temporary directory for storing intermediate results

  • T_fill (int, default=2) – The permissible temporal gap (in days) between objects for tracking continuity to be maintained (must be even)

  • allow_merging (bool, default=True) – Allow objects to split and merge across time. Apply splitting & merging criteria, track merge events, and maintain original identities of merged objects across time. N.B.: False reverts to classical ndmeasure.label with simplar time connectivity, i.e. Scannell et al.

  • nn_partitioning (bool, default=False) – Implement a better partitioning of merged child objects based on closest parent cell. False reverts to using parent centroids to determine partitioning between new child objects, i.e. Di Sun & Bohai Zhang 2023. N.B.: Centroid-based partitioning has major problems with small merging objects suddenly obtaining unrealistically-large (and often disjoint) fractions of the larger object.

  • overlap_threshold (float, default=0.5) – The fraction of the smaller object’s area that must overlap with the larger object’s area to be considered the same event and continue tracking with the same ID.

  • unstructured_grid (bool, default=False) – Whether data is on an unstructured grid

  • dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data

  • coordinates (dict, optional) – Coordinate names for unstructured grids. Should contain ‘x’ and ‘y’ keys for x and y coordinates. May also contain ‘time’ if the time coordinate name is different from the dimension name.

  • neighbours (xarray.DataArray, optional) – For unstructured grid, indicates connectivity between cells

  • cell_areas (xarray.DataArray, optional) – For unstructured grid, area of each cell (required). For structured grid, area of each cell (optional). If not provided, defaults to 1.0 for each cell (resulting in cell counts as areas). Note: Overridden by grid_resolution if provided for structured grids.

  • grid_resolution (float, optional) – Grid resolution in degrees for structured grids only (ignored for unstructured grids). When provided, automatically calculates cell areas using spherical geometry. Overrides any provided cell_areas parameter.

  • max_iteration (int, default=40) – Maximum number of iterations for merging/splitting algorithm

  • checkpoint (str, default='None') – Checkpoint strategy (‘save’, ‘load’, or None)

  • debug (int, default=0) – Debug level (0-2)

  • verbose (bool, optional) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.

  • quiet (bool, optional) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

  • regional_mode (bool, default=False) – Enable regional mode for non-global coordinate ranges. When True, coordinate_units must be specified.

  • coordinate_units (str, optional) – Coordinate units when regional_mode=True. Must be either ‘degrees’ or ‘radians’.

Examples

Basic tracking of marine heatwave events from preprocessed data:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load preprocessed extreme events data
>>> processed = xr.open_dataset('extreme_events.nc', chunks={})
>>> extreme_events = processed.extreme_events  # Boolean array
>>> mask = processed.mask  # Ocean/land mask
>>>
>>> # Initialise tracker with basic parameters
>>> tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,                    # Fill holes up to 8 grid cells
...     area_filter_quartile=0.5     # Remove smallest 50% of objects
...     allow_merging=False          # Basic tracking without splitting/merging
... )
>>>
>>> # Run tracking algorithm
>>> events = tracker.run()
>>> print(f"Identified {events.ID.max().compute()} distinct events")
Identified 1247 distinct events

Using automatic grid area calculation from resolution:

>>> # For regular lat/lon grids, automatically calculate physical areas
>>> grid_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     grid_resolution=0.25  # Grid resolution in degrees
... )
>>> # Cell areas are calculated automatically using spherical geometry
>>> grid_events = grid_tracker.run()

Advanced tracking with merging and splitting enabled:

>>> # More sophisticated tracking with temporal gap filling
>>> advanced_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=12,               # Larger spatial gap filling
...     T_fill=4,                # Fill up to 4-day temporal gaps
...     area_filter_quartile=0.25,  # More aggressive size filtering
...     allow_merging=True,      # Enable split/merge detection
...     overlap_threshold=0.3    # Lower threshold for object linking
... )
>>>
>>> events_advanced, merges_log = advanced_tracker.run(return_merges=True)
>>> print(events_advanced.data_vars)
Data variables:
    event           (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    event_centroid  (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    ID_field        (time, lat, lon)        int32           dask.array<chunksize=(25, 180, 360)>
    global_ID       (time, ID)              int32           dask.array<chunksize=(25, 1247)>
    area            (time, ID)              float32         dask.array<chunksize=(25, 1247)>
    centroid        (component, time, ID)   float64         dask.array<chunksize=(2, 25, 1247)>
    presence        (time, ID)              bool            dask.array<chunksize=(25, 1247)>
    time_start      (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    time_end        (ID)                    datetime64[ns]  dask.array<chunksize=(1247,)>
    merge_ledger    (time, ID, sibling_ID)  int32           dask.array<chunksize=(25, 1247, 10)>

Processing unstructured ocean model data (ICON):

>>> # Load ICON ocean model data with connectivity
>>> icon_data = xr.open_dataset('icon_extremes.nc', chunks={})
>>> icon_extremes = icon_data.extreme_events  # (time, ncells)
>>> icon_mask = icon_data.mask
>>> neighbours = icon_data.neighbours  # Cell connectivity
>>> cell_areas = icon_data.cell_areas  # Physical areas
>>>
>>> # Track events on unstructured grid
>>> unstructured_tracker = marEx.tracker(
...     icon_extremes,
...     icon_mask,
...     R_fill=5,                                   # 5-neighbor radius for gap filling
...     area_filter_quartile=0.6,                   # Remove 60% of smallest events
...     unstructured_grid=True,                     # Enable unstructured mode
...     dimensions={"x": "ncells"},                 # Must specify the name of the spatial dimension
...     coordinates={"x": "lon", "y": "lat"},       # Spatial coordinate names
...     neighbours=neighbours,                      # Required for unstructured
...     cell_areas=cell_areas                       # Required for area calculations
... )
>>> unstructured_events = unstructured_tracker.run()

Memory management and checkpointing for large datasets:

>>> # Use checkpointing for very large datasets
>>> large_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     temp_dir='/scratch/user/tracking_temp',  # Temporary storage
...     checkpoint='save'             # Save intermediate results
... )
>>> # Processing can be resumed if interrupted
>>> large_events = large_tracker.run()

Comparing different filtering strategies:

>>> # Conservative filtering - keep more events
>>> conservative = marEx.tracker(
...     extreme_events, mask, R_fill=5, area_filter_quartile=0.1
... )
>>> conservative_events = conservative.run()
>>>
>>> # Aggressive filtering - focus on largest events
>>> aggressive = marEx.tracker(
...     extreme_events, mask, R_fill=15, area_filter_quartile=0.8
... )
>>> aggressive_events = aggressive.run()
>>>
>>> print(f"Conservative: {conservative_events.ID.max().compute()} events")
>>> print(f"Aggressive: {aggressive_events.ID.max().compute()} events")

Using absolute area filtering instead of percentile-based:

>>> # Filter objects smaller than 25 grid cells
>>> absolute_tracker = marEx.tracker(
...     extreme_events, mask, R_fill=8, area_filter_absolute=25
... )
>>> absolute_events = absolute_tracker.run()
>>>
>>> # Default behavior (area_filter_quartile=0.5) when no parameters provided
>>> default_tracker = marEx.tracker(extreme_events, mask, R_fill=8)
>>> default_events = default_tracker.run()  # Uses quartile=0.5 filtering

Using physical cell areas for structured grids:

>>> # Load data with irregular grid cell areas
>>> grid_areas = xr.open_dataset('grid_areas.nc').cell_area  # (lat, lon) in m²
>>>
>>> # Track events using physical areas instead of cell counts
>>> physical_tracker = marEx.tracker(
...     extreme_events,
...     mask,
...     R_fill=8,
...     area_filter_quartile=0.5,
...     cell_areas=grid_areas  # Physical areas in m²
... )
>>> events = physical_tracker.run()
>>> # Now events.area contains physical areas in m² instead of cell counts

Integration with full marEx workflow:

>>> # Complete workflow from raw data to tracked events
>>> raw_sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Step 1: Preprocess to identify extremes
>>> processed = marEx.preprocess_data(raw_sst, threshold_percentile=95)
>>>
>>> # Step 2: Track extreme events
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> tracked_events = tracker.run()

Initialise the tracker with parameters and data.

__init__(data_bin, mask, R_fill, area_filter_quartile=None, area_filter_absolute=None, temp_dir=None, T_fill=2, allow_merging=True, nn_partitioning=False, overlap_threshold=0.5, unstructured_grid=False, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, grid_resolution=None, max_iteration=40, checkpoint=None, debug=0, verbose=None, quiet=None, regional_mode=False, coordinate_units=None)[source]

Initialise the tracker with parameters and data.

Parameters:
Return type:

None

run(return_merges=False, checkpoint=None)[source]

Run the complete object identification and tracking pipeline.

This method executes the full workflow:

  1. Preprocessing: morphological operations and size filtering

  2. Identification and tracking of objects through time

  3. Computing and attaching statistics to the results

Parameters:
  • return_merges (bool, default=False) – If True, return merge events dataset alongside the main events

  • checkpoint (str, optional) – Override the instance checkpoint setting

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events and their properties

  • merges_ds (xarray.Dataset, optional) – Dataset with merge event information (only if return_merges=True)

Return type:

Dataset | Tuple[Dataset, Dataset]

run_preprocess(checkpoint=None)[source]

Preprocess binary data to prepare for tracking.

This performs morphological operations to fill holes/gaps in both space and time, then filters small objects according to the area_filter_quartile or area_filter_absolute.

Parameters:

checkpoint (str, optional) – Checkpoint strategy override

Returns:

  • data_bin_filtered (xarray.DataArray) – Preprocessed binary data

  • object_stats (tuple) – Statistics about the preprocessing

Return type:

Tuple[DataArray, Tuple[float, int, int, float, float, float]]

run_tracking(data_bin_preprocessed)[source]

Track objects through time to identify events.

Parameters:

data_bin_preprocessed (xarray.DataArray) – Preprocessed binary data

Returns:

  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • N_events_final (int) – Final number of unique events

Return type:

Tuple[Dataset, Dataset, int]

run_stats_attributes(events_ds, merges_ds, object_stats, N_events_final)[source]

Add statistics and attributes to the events dataset.

Parameters:
  • events_ds (xarray.Dataset) – Dataset containing tracked events

  • merges_ds (xarray.Dataset) – Dataset with merge information

  • object_stats (tuple) – Preprocessed object statistics

  • N_events_final (int) – Final number of events

Returns:

events_ds – Dataset with added statistics and attributes

Return type:

xarray.Dataset

compute_area(data_bin)[source]

Compute the total area of binary data at each time.

Parameters:

data_bin (xarray.DataArray) – Binary data

Returns:

area – Total area at each time (units: pixels for structured grid, matching cell_area for unstructured)

Return type:

xarray.DataArray

fill_holes(data_bin, R_fill=None)[source]

Fill holes and gaps using morphological operations.

This performs closing (dilation followed by erosion) to fill small gaps, then opening (erosion followed by dilation) to remove small isolated objects.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to process

  • R_fill (int, optional) – Fill radius override

Returns:

data_bin_filled – Binary data with holes/gaps filled

Return type:

xarray.DataArray

fill_time_gaps(data_bin)[source]

Fill temporal gaps between objects.

Performs binary closing (dilation then erosion) along the time dimension to fill small time gaps between objects.

Parameters:

data_bin (xarray.DataArray) – Binary data to process

Returns:

data_bin_filled – Binary data with temporal gaps filled

Return type:

xarray.DataArray

refresh_dask_graph(data_bin)[source]

Clear and reset the Dask graph via save/load cycle.

This is needed to work around a memory leak bug in Dask where “Unmanaged Memory” builds up within loops.

Parameters:

data_bin (xarray.DataArray) – Data to refresh

Returns:

data_new – Data with fresh Dask graph

Return type:

xarray.DataArray

filter_small_objects(data_bin)[source]

Remove objects smaller than a threshold area.

Parameters:

data_bin (xarray.DataArray) – Binary data to filter

Returns:

  • data_bin_filtered (xarray.DataArray) – Binary data with small objects removed

  • area_threshold (float) – Area threshold used for filtering

  • object_areas (xarray.DataArray) – Areas of all objects pre-filtering

  • N_objects_prefiltered (int) – Number of objects before filtering

  • N_objects_filtered (int) – Number of objects after filtering

Return type:

Tuple[DataArray, float, DataArray, int, int]

identify_objects(data_bin, time_connectivity)[source]

Identify connected regions in binary data.

Parameters:
  • data_bin (xarray.DataArray) – Binary data to identify objects in

  • time_connectivity (bool) – Whether to connect objects across time

Returns:

  • object_id_field (xarray.DataArray) – Field of integer IDs for each object

  • None (NoneType) – Placeholder for compatibility with track_objects

  • N_objects (int) – Number of objects identified

Return type:

Tuple[DataArray, None, int]

calculate_centroid(binary_mask, original_centroid=None)[source]

Calculate object centroid, handling edge cases for periodic boundaries.

Parameters:
  • binary_mask (numpy.ndarray) – 2D binary array where True indicates the object (dimensions are (y,x))

  • original_centroid (tuple, optional) – (y_centroid, x_centroid) from regionprops_table

Returns:

(y_centroid, x_centroid)

Return type:

tuple

calculate_object_properties(object_id_field, properties=None)[source]

Calculate properties of objects from ID field.

Parameters:
  • object_id_field (xarray.DataArray) – Field containing object IDs

  • properties (list, optional) – List of properties to calculate (defaults to [‘label’, ‘area’])

Returns:

object_props – Dataset containing calculated properties with ‘ID’ dimension

Return type:

xarray.Dataset

check_overlap_slice(ids_t0, ids_next)[source]

Find overlapping objects between two consecutive time slices.

Parameters:
Returns:

Array of shape (n_overlaps, 3) with [id_t0, id_next, overlap_area]

Return type:

numpy.ndarray

find_overlapping_objects(object_id_field)[source]

Find all overlapping objects across time.

Parameters:

object_id_field (xarray.DataArray) – Field containing object IDs

Returns:

overlap_objects_list_unique_filtered – Array of object ID pairs that overlap across time, with overlap area The object in the first column precedes the second column in time. The third column contains:

  • For structured grid: number of overlapping pixels (int32)

  • For unstructured grid: total overlapping area in m^2 (float32)

Return type:

(N x 3) numpy.ndarray

enforce_overlap_threshold(overlap_objects_list, object_props)[source]

Filter object pairs based on overlap threshold.

Parameters:
  • overlap_objects_list ((N x 3) numpy.ndarray) – Array of object ID pairs with overlap area

  • object_props (xarray.Dataset) – Object properties including area

Returns:

overlap_objects_list_filtered – Filtered array of object ID pairs that meet the overlap threshold

Return type:

(M x 3) numpy.ndarray

consolidate_object_ids(data_t_minus_2, data_t_minus_1, object_props, timestep)[source]

Consolidate object IDs between t-2 and t-1 to ensure consistent tracking.

This identifies objects at t-1 that are actually continuations of objects from t-2 (but got different IDs due to partitioning) and renames them to maintain consistent IDs across timesteps.

Parameters:
  • data_t_minus_2 (xr.DataArray) – Object field at timestep t-2

  • data_t_minus_1 (xr.DataArray) – Object field at timestep t-1 (will be modified)

  • object_props (xr.Dataset) – Object properties dataset (will be modified)

  • timestep (int) – Current timestep number for logging purposes

Returns:

  • data_t_minus_1_consolidated (xr.DataArray) – Updated t-1 field with consolidated IDs

  • object_props_updated (xr.Dataset) – Updated object properties with merged/deleted objects

Return type:

Tuple[DataArray, Dataset]

Notes

  • Uses self.overlap_threshold for determining consolidation eligibility

  • Updates object properties by recalculating for consolidated objects

  • Removes redundant child objects from object_props

compute_id_time_dict(da, child_objects, max_objects, all_objects=True)[source]

Generate lookup table mapping object IDs to their time index.

Parameters:
  • da (xarray.DataArray) – Field of object IDs

  • child_objects (list or array) – Object IDs to include in the dictionary

  • max_objects (int) – Maximum number of objects

  • all_objects (bool, default=True) – Whether to process all objects or just child_objects

Returns:

time_index_map – Dictionary mapping object IDs to time indices

Return type:

dict

track_objects(data_bin)[source]

Track objects through time to form events.

This is the main tracking method that handles splitting and merging of objects.

Parameters:

data_bin (xarray.DataArray) – Preprocessed binary data: Field of globally unique integer IDs of each element in connected regions. ID = 0 indicates no object.

Returns:

  • split_merged_events_ds (xarray.Dataset) – Dataset containing tracked events

  • merge_events (xarray.Dataset) – Dataset with merge information

  • N_events (int) – Final number of events

Return type:

Tuple[Dataset, Dataset, int]

cluster_rename_objects_and_props(object_id_field_unique, object_props, overlap_objects_list, merge_events)[source]

Cluster the object pairs and relabel to determine final event IDs.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs must not be repeated across time.

  • object_props (xarray.Dataset) – Properties of each object that also need to be relabeled.

  • overlap_objects_list ((N x 2) numpy.ndarray) – Array of object ID pairs that indicate which objects are in the same event. The object in the first column precedes the second column in time.

  • merge_events (xarray.Dataset) – Information about merge events

Returns:

split_merged_events_ds – Dataset with relabeled events and their properties. ID = 0 indicates no object.

Return type:

xarray.Dataset

split_and_merge_objects(object_id_field_unique, object_props)[source]

Implement object splitting and merging logic.

This identifies and processes cases where objects split or merge over time, creating new object IDs as needed.

Parameters:
  • object_id_field_unique (xarray.DataArray) – Field of unique object IDs. IDs are required to be monotonically increasing with time.

  • object_props (xarray.Dataset) – Properties of each object

Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

split_and_merge_objects_parallel(object_id_field_unique, object_props)[source]

Optimised parallel implementation of object splitting and merging.

This version is specifically designed for unstructured grids with more efficient memory handling and better parallelism than the standard split_and_merge_objects method. It processes data in chunks, handles merging events, and efficiently updates object IDs.

Parameters:
Returns:

(object_id_field, object_props, overlap_objects_list, merge_events)

Return type:

tuple

Basic Usage Examples

Simple Event Tracking

import xarray as xr
import marEx

# Load preprocessed extreme events data
extremes_ds = xr.open_dataset('extreme_events.nc')

# Basic tracking with default parameters
event_tracker = marEx.tracker(
    extremes_ds.extreme_events,  # Binary extreme events field
    extremes_ds.mask,            # Land-sea mask
    R_fill=8,                    # Fill holes with radius < 8 cells
    area_filter_absolute=100     # Remove objects smaller than 100 grid cells
)

# Run tracking algorithm
tracked_events = event_tracker.run()

Advanced Tracking Configuration

# Advanced tracking with merge/split handling
event_tracker = marEx.tracker(
    extremes_ds.extreme_events,
    extremes_ds.mask,
    R_fill=8,                    # Fill holes with radius < 8 cells
    area_filter_quartile=0.5,    # Remove smallest 50% of events (alternative to area_filter_absolute)
    T_fill=2,                    # Allow 2-day gaps in tracking
    allow_merging=True,          # Enable merge/split tracking
    overlap_threshold=0.5,       # 50% overlap required for continuity
    nn_partitioning=True,        # Use nearest-neighbor partitioning
    cell_areas=grid_areas        # Optional: physical cell areas (m²)
)

# Run tracking and get merge information
tracked_events, merges_ds = event_tracker.run(return_merges=True)

Unstructured Grid Tracking

# For unstructured grids, specify dimensions
dimensions = {'time': 'time', 'x': 'ncells'}
coordinates = {'time': 'time', 'x': 'lon', 'y': 'lat'}

event_tracker = marEx.tracker(
    extremes_ds.extreme_events,
    extremes_ds.mask,
    R_fill=5,                    # Adjusted for unstructured grid
    area_filter_quartile=0.4,
    dimensions=dimensions,
    coordinates=coordinates
)

tracked_events = event_tracker.run()

Parameter Reference

Required Parameters

data_binxarray.DataArray

Binary data to identify and track objects in (True = event, False = background). Must be Dask-backed and boolean type.

maskxarray.DataArray

Binary mask indicating valid regions (True = valid, False = invalid). Must be boolean type.

R_fillint

Radius for filling holes/gaps in spatial domain (in grid cells). Controls morphological operations to fill small gaps in binary objects.

How R_fill Works:

The R_fill parameter controls a two-step morphological cleaning operation: closing followed by opening. This process makes identified events spatially coherent and removes small, isolated noise.

Physical Rationale:

  1. Closing (Fill Holes): Marine heatwaves are generally coherent spatial phenomena. Small “holes” within a larger event are often artefacts of data gridding or minor fluctuations below the threshold. Closing (dilation → erosion) fills these internal gaps, making the event representation physically realistic.

  2. Opening (Remove Noise): Very small, isolated pixels flagged as events are often statistical noise or sensor errors rather than genuine marine heatwaves. Opening (erosion → dilation) removes these spurious objects, ensuring only spatially significant events are tracked.

Visual Process (using R_fill=1 as example):

    Initial State                  After Closing                  After Opening
┌──────────────────────┐       ┌──────────────────────┐       ┌──────────────────────┐
│   █████              │       │   █████              │       │   █████              │
│  ███████             │       │  ███████             │       │  ███████             │
│ ███ █████            │       │ █████████            │       │ █████████            │
│ ███  ████   █        │       │ █████████   █        │       │ █████████            │
│  ████████            │       │  ████████            │       │  ████████            │
│   ███████            │       │   ███████            │       │   ███████            │
│    █████             │       │    █████             │       │    █████             │
│           █          │       │           █          │       │                      │
└──────────────────────┘       └──────────────────────┘       └──────────────────────┘
│                              │                              │
└─ Main event with             └─ Closing fills internal      └─ Opening removes
   2 small holes inside           holes, making event            isolated noise pixels,
   + 2 isolated noise pixels      more coherent                  leaving clean event

Structuring Element (Kernel):

Operations use a disk-shaped kernel with diameter = 2 * R_fill + 1. For R_fill=2, the kernel diameter is 5 pixels:

┌──────────────┐
│      ██      │     Any hole or isolated object smaller
│   ██ ██ ██   │     than this disk will be filled or removed
│██ ██ ██ ██ ██│
│   ██ ██ ██   │
│      ██      │
└──────────────┘

Example: R_fill=8 uses a (circular) disk kernel with diameter 17 pixels, filling holes and removing objects up to ~8 grid cells in radius.

When to Adjust:

  • Smaller R_fill (e.g., 3-5): Preserves small-scale features; use for coarse data or fine structures

  • Larger R_fill (e.g., 10-15): Creates more coherent objects; use for noisy data or large-scale events

area_filter_quartilefloat

Quantile (0-1) for filtering smallest objects. For example, 0.25 removes smallest 25% of objects, 0.5 removes smallest 50%. Mutually exclusive with area_filter_absolute.

area_filter_absoluteint

Minimum area (in grid cells) for an object to be retained. For example, 25 keeps only events with 25 or more grid cells. Mutually exclusive with area_filter_quartile.

Core Tracking Parameters

T_fillint, default=2

Number of timesteps for filling temporal gaps (must be even). Allows events to be tracked across short temporal interruptions.

How T_fill Works:

T_fill = Maximum temporal gap (in days) for event continuity

Timeline with T_fill=2:

Day 0: ████████  ← Event detected
Day 1:            ← Gap (no detection)
Day 2:            ← Gap (no detection)  } T_fill=2 days
Day 3: ████████  ← Event detected again
       │        │
       └────────┴─ SAME event ID (gap ≤ T_fill)

Timeline with T_fill=0 (no gap filling):

Day 0: ████████  ← Event ID: 1
Day 1:            ← Gap
Day 2: ████████  ← NEW event ID: 2 (no gap tolerance)

Example: T_fill=2 allows 2-day gaps while maintaining the same event ID.

When to Adjust:

  • T_fill=0: No gap filling, strict continuity requirement, or for weekly/monthly sampling

  • T_fill=2-4: Standard gap filling for daily data (recommended)

  • T_fill=6-10: More permissive for noisy data

allow_mergingbool, default=True

Allow objects to split and merge across time.

  • True: Apply splitting & merging criteria, track merge events, and maintain original identities

  • False: Classical connected component labeling with simple time connectivity

nn_partitioningbool, default=True

Use nearest-neighbor cell-based partitioning for merging events (improved algorithm).

  • True: Partition merged child objects based on closest parent cell (recommended)

  • False: Use parent centroids for partitioning (legacy method with known issues)

Why This Matters: When objects merge, their new area must be partitioned and assigned back to the original parent identities. The choice of algorithm for this partitioning is critical.

The centroid method partitions based on proximity to parent centroids. This can be problematic for non-convex or L-shaped features, where the centroid may lie far from the feature’s cells, leading to physically unrealistic partitions.

In contrast, the nn (nearest-neighbor) method assigns each cell to the parent of the closest cell from the previous timestep. This results in a more intuitive and physically-based partition that respects the geometry of the parent features.

Visual Comparison:

Scenario: A large C-shaped feature (A) and a small nearby feature (B) merge into a single large object. We compare how centroid-based vs nearest-neighbor methods partition the merged child back to parent lineages.

1. Initial State (Time t)

Two distinct parent features with centroids marked by ●.

  ┌───┬───┬───┬───┬───┐
5 │ A │ A │ A │ A │ A │
  ├───┼───┼───┼───┼───┤
4 │ A │ A │   │   │   │
  ├───┼───┼───┼───┼───┤
3 │ A │ ● │   │ B ● B │  ← Centroid A at (3, 2), Centroid B at (3, 4.5)
  ├───┼───┼───┼───┼───┤
2 │ A │ A │   │   │   │
  ├───┼───┼───┼───┼───┤
1 │ A │ A │ A │ A │ A │
  └───┴───┴───┴───┴───┘
    1   2   3   4   5

2. Merged State (Time t+1)

The features grow and merge into a single large object marked with ‘#’. Question: how should these 20 cells be attributed to lineages A and B ?

  ┌───┬───┬───┬───┬───┐
5 │   │ # │ # │ # │ # │
  ├───┼───┼───┼───┼───┤
4 │ # │ # │ # │   │   │
  ├───┼───┼───┼───┼───┤
3 │ # │ # │ # │ # │ # │
  ├───┼───┼───┼───┼───┤
2 │ # │ # │ # │   │   │
  ├───┼───┼───┼───┼───┤
1 │   │ # │ # │ # │ # │
  └───┴───┴───┴───┴───┘
    1   2   3   4   5

3. Centroid Partition (nn_partitioning=False) - ❌ PROBLEMATIC

Each cell assigned to nearest parent centroid. The partition line ║ creates a geometric boundary that ignores actual cell topology. Result: B is an odd disjoint object now !

  ┌───┬───┬──╦┬───┬───┐
5 │   │ A │ A║│ B │ B │
  ├───┼───┼──╫┼───┼───┤
4 │ A │ A │ A║│   │   │
  ├───┼───┼──╫┼───┼───┤
3 │ A │ A │ A║│ B │ B │
  ├───┼───┼──╫┼───┼───┤
2 │ A │ A │ A║│   │   │
  ├───┼───┼──╫┼───┼───┤
1 │   │ A │ A║│ B │ B │
  └───┴───┴──╩┴───┴───┘
    1   2   3   4   5

Problem: B is an odd disjoint object with portions unrelated to the original parent. The original C-shaped object means that the centroid misrepresents A’s main body, thus creating unrealistic/nonrepresentative partitions.

**4. Nearest-Neighbour Partition (nn_partitioning=True)

Each cell assigned to parent of nearest parent cell. Natural boundaries respect actual spatial connectivity. Result: B remains contiguous (realistic growth pattern).

  ┌───┬───┬───┬───┬───┐
5 │   │ A │ A │ A │ A │
  ├───┼───┼───┼───┼───┤
4 │ A │ A │ A │   │   │
  ├───┼───┼───┼───┼───┤
3 │ A │ A │ B │ B │ B │
  ├───┼───┼───┼───┼───┤
2 │ A │ A │ A │   │   │
  ├───┼───┼───┼───┼───┤
1 │   │ A │ A │ A │ A │
  └───┴───┴───┴───┴───┘
    1   2   3   4   5

Solution: Growth of the resulting merged object is correctly attributed to A’s lineage based on cell-level proximity, not abstract geometric centroids.

Example: Small Object Problem:

# Centroid method (old, has issues)
tracker_centroid = marEx.tracker(
    extremes, mask,
    R_fill=8,
    allow_merging=True,
    nn_partitioning=False  # Small objects get unrealistic portions
)

# Nearest-neighbor method (new, recommended)
tracker_nn = marEx.tracker(
    extremes, mask,
    R_fill=8,
    allow_merging=True,
    nn_partitioning=True   # Realistic partitioning
)

When to Use:

  • Always use ``nn_partitioning=True`` (default) for accurate merging/splitting

  • Only use nn_partitioning=False for comparison with legacy results or Sun et al. (2023) methodology

overlap_thresholdfloat, default=0.5

Minimum fraction of overlap between objects to consider them the same event. Fraction of the smaller object’s area that must overlap with the larger object’s area.

How overlap_threshold Works - Two-Stage Process:

═══════════════════════════════════════════════════════════════════════════════
STAGE 1: OVERLAP CHECKING (Determines IF objects are related)
═══════════════════════════════════════════════════════════════════════════════

Formula: overlap_fraction = overlap_area / min(area_parent, area_child)
Decision: IF overlap_fraction >= overlap_threshold → Objects are LINKED

This stage applies to ALL transitions (1→1, 1→many, many→1, many→many)
This stage is INDEPENDENT of nn_partitioning choice

───────────────────────────────────────────────────────────────────────────────
Example 1: Threshold MET (overlap_threshold = 0.5)
───────────────────────────────────────────────────────────────────────────────

Time t=0:          Time t=1:          Overlap Check:

┌──────────┐       ┌─────────┐        overlap_area = 60 cells
│  Object  │       │ Object  │        min(100, 80) = 80
│    A     │   →   │    C    │
│   100    │       │   80    │        overlap_fraction = 60/80 = 0.75
│  cells   │       │  cells  │
└──────────┘       └─────────┘        0.75 >= 0.5 ✓ → LINKED (same event)

───────────────────────────────────────────────────────────────────────────────
Example 2: Threshold NOT MET (overlap_threshold = 0.5)
───────────────────────────────────────────────────────────────────────────────

Time t=0:          Time t=1:          Overlap Check:

┌──────────┐       ┌─────────┐        overlap_area = 30 cells
│  Object  │       │ Object  │        min(100, 80) = 80
│    A     │   ╳   │    C    │
│   100    │       │   80    │        overlap_fraction = 30/80 = 0.375
│  cells   │       │  cells  │
└──────────┘       └─────────┘        0.375 < 0.5 ✗ → NOT LINKED

                                      Result: A terminates, C starts as NEW event

═══════════════════════════════════════════════════════════════════════════════
STAGE 2: PARTITIONING (Determines HOW to divide merged areas)
═══════════════════════════════════════════════════════════════════════════════

This stage ONLY applies when a MERGE is detected (many parents → one child),
which also requires the overlap criterion (Stage 1) to be met for each parent.
This stage does NOT apply to splits (one parent → many children)
Choice of nn_partitioning (True/False) affects ONLY this stage

───────────────────────────────────────────────────────────────────────────────
Example 3: MERGE Scenario - Partitioning Applies
───────────────────────────────────────────────────────────────────────────────

Time t=0:                Time t=1:

┌─────────┐              ┌──────────────┐
│ Object A│              │   Object C   │
│   100   │──┐           │     120      │
│  cells  │  │           │    cells     │
└─────────┘  │           └──────────────┘
             │  MERGE
┌─────────┐  │
│ Object B│──┘           STAGE 1 - Overlap Checking:
│   40    │              ────────────────────────────
│  cells  │              A→C: 80/min(100,120) = 80/100 = 0.8 >= 0.5 ✓ LINKED
└─────────┘              B→C: 35/min(40,120)  = 35/40  = 0.875 >= 0.5 ✓ LINKED

                         Both A and B linked to C → MERGE DETECTED!

                         STAGE 2 - Partitioning (triggered by merge):
                         ─────────────────────────────────────────────
                         Question: How to divide C's 120 cells between A and B lineages?

                         if nn_partitioning=True:  Use nearest parent CELL
                         if nn_partitioning=False: Use nearest parent CENTROID

                         (See nn_partitioning documentation for partition methods)

───────────────────────────────────────────────────────────────────────────────
Example 4: SPLIT Scenario - No Partitioning Occurs
───────────────────────────────────────────────────────────────────────────────

Time t=0:          Time t=1:

┌─────────┐        ┌─────────┐
│ Object A│───┬───→│ Object B│
│   100   │   │    │   60    │
│  cells  │   │    │  cells  │
└─────────┘   │    └─────────┘
              │
              │    ┌─────────┐
              └───→│ Object C│
                   │   50    │
                   │  cells  │
                   └─────────┘

STAGE 1 - Overlap Checking:
────────────────────────────
A→B: 55/min(100,60) = 55/60 = 0.917 >= 0.5 ✓ LINKED
A→C: 45/min(100,50) = 45/50 = 0.9   >= 0.5 ✓ LINKED

One parent linked to two children → SPLIT DETECTED!

STAGE 2 - Partitioning:
────────────────────────
N.B.: NO PARTITIONING occurs for splits!
All objects (A, B, C) grouped into same event with same ID
Object B & C continue to evolve disjoint, yet will ultimately be labelled with the same event ID

Key Principles:

  • Overlap checking happens FIRST for all object pairs between consecutive timesteps

  • Partitioning happens SECOND and ONLY when merges are detected (many→one)

  • nn_partitioning choice does NOT affect which objects are linked in Stage 1

  • Splits (one→many) do NOT trigger partitioning - all objects share the same event ID

When to Adjust:

  • Lower threshold (e.g., 0.3): More continuous tracking, may link distant objects

  • Higher threshold (e.g., 0.7): More conservative, stricter continuity requirement

  • Default (0.5): Balanced approach for most applications

Grid Configuration Parameters

dimensionsdict, default={‘time’: ‘time’, ‘x’: ‘lon’, ‘y’: ‘lat’}

Mapping of conceptual dimensions to actual dimension names. For unstructured grids, use: {'time': 'time', 'x': 'ncells'}

coordinatesdict, default={‘time’: ‘time’, ‘x’: ‘lon’, ‘y’: ‘lat’}

Mapping of conceptual coordinates to actual coordinate names. For unstructured grids, use: {'time': 'time', 'x': 'lon', 'y': 'lat'}

grid_resolutionfloat, optional

Grid resolution in degrees for structured grids (ignored for unstructured grids). When provided, automatically calculates physical cell areas using spherical geometry. Overrides any provided cell_areas parameter for structured grids.

# Automatic from grid resolution
tracker = marEx.tracker(
    extremes, mask,
    R_fill=8,
    grid_resolution=0.25  # Automatically calculates spherical areas for 0.25° grid
)

# This is equivalent to manually provide 2D data array of cell areas
tracker = marEx.tracker(
    extremes, mask,
    R_fill=8,
    cell_areas=my_calculated_areas  # Manual calculation required
)

The calculation accounts for latitude-dependent cell areas using spherical geometry, providing accurate physical areas (in km²) for object/event size calculations.

cell_areasxr.DataArray, optional

Physical cell areas for area calculations.

  • For structured grids: Optional. If not provided, defaults to 1.0 for each cell (i.e. cell counts). N.B.: If neither cell_areas nor grid_resolution is provided, areas are in units of cells/pixels. Note: Overridden by grid_resolution if provided.

  • For unstructured grids: Required for physical area calculations.

Output Data Structure

The tracking algorithm returns an xarray Dataset with the following structure:

Main Tracking Dataset

# tracked_events Dataset structure:
xarray.Dataset
Dimensions: (lat, lon, time, ID, component, sibling_ID)
Coordinates:
    lat         (lat)
    lon         (lon)
    time        (time)
    ID          (ID)
Data variables:
    ID_field              (time, lat, lon)        int32       # Event ID field
    global_ID             (time, ID)              int32       # Global ID mapping
    area                  (time, ID)              float32     # Event areas
    centroid              (component, time, ID)   float32     # Event centroids
    presence              (time, ID)              bool        # Event presence
    time_start            (ID)                    datetime64  # Start times
    time_end              (ID)                    datetime64  # End times
    merge_ledger          (time, ID, sibling_ID)  int32       # Merge information

Key Variables:

  • ID_field: Binary field with tracked event IDs

  • global_ID: Unique ID mapping for each event at each time

  • area: Spatial area of each event through time (in units of cell counts, or physical units if cell_areas/grid_resolution provided)

  • centroid: (x,y) centroid coordinates of each event

  • presence: Boolean indicating event presence at each time

  • time_start/time_end: Temporal bounds of each event

  • merge_ledger: Sibling IDs for merging events (-1 = no merge)

Merge Information Dataset

# merges_ds Dataset structure (when return_merges=True):
xarray.Dataset
Dimensions: (merge_ID, parent_idx, child_idx)
Data variables:
    parent_IDs      (merge_ID, parent_idx)  int32       # Parent event IDs
    child_IDs       (merge_ID, child_idx)   int32       # Child event IDs
    overlap_areas   (merge_ID, parent_idx)  int32       # Overlap areas
    merge_time      (merge_ID)              datetime64  # Merge timestamps
    n_parents       (merge_ID)              int8        # Number of parents
    n_children      (merge_ID)              int8        # Number of children

Key Variables:

  • parent_IDs/child_IDs: Original parent and child IDs in merge events

  • overlap_areas: Spatial overlap between parent and child objects

  • merge_time: Timestamp of each merge event

  • n_parents/n_children: Number of objects involved in each merge

Advanced Usage Examples

Analysing Event Properties

# Run tracking
tracked_events, merges_ds = event_tracker.run(return_merges=True)

# Analyse event durations
event_durations = (tracked_events.time_end - tracked_events.time_start).dt.days
long_events = tracked_events.ID.where(event_durations > 10, drop=True)

# Find events present at specific time
time_slice = tracked_events.sel(time='2020-07-15')
active_events = time_slice.ID.where(time_slice.presence, drop=True)

# Calculate maximum event areas
max_areas = tracked_events.area.max(dim='time')
large_events = tracked_events.ID.where(max_areas > threshold, drop=True)

Merge Event Analysis

# Analyse merge events
if merges_ds is not None:
    # Find complex merge events (multiple parents/children)
    complex_merges = merges_ds.where(
        (merges_ds.n_parents > 1) | (merges_ds.n_children > 1),
        drop=True
    )

    # Analyse merge timing
    merge_times = merges_ds.merge_time.values
    seasonal_merges = merges_ds.groupby(merges_ds.merge_time.dt.season).count()

    # Find parent-child relationships
    for merge_id in complex_merges.merge_ID:
        parents = merges_ds.parent_IDs.sel(merge_ID=merge_id).values
        children = merges_ds.child_IDs.sel(merge_ID=merge_id).values
        print(f"Merge {merge_id}: {parents}{children}")

Visualisation Integration

# Plot tracked events
config = marEx.PlotConfig(
    title='Tracked Marine Heatwave Events',
    plot_IDs=True,          # Special handling for event IDs
    cmap='tab20'            # Discrete colormap
)

# Plot single timestep
fig, ax, im = tracked_events.ID_field.isel(time=0).plotX.single_plot(config)

# Create animation
movie_path = tracked_events.ID_field.plotX.animate(
    config,
    plot_dir='./animations',
    file_name='tracked_events'
)

Performance Optimisation

Chunking Strategy

# Optimal chunking for tracking
chunk_size = {'time': 25, 'lat': -1, 'lon': -1}  # Keep spatial dims together
extremes_ds = xr.open_dataset('extremes.nc', chunks=chunk_size)

# For very large datasets
event_tracker = marEx.tracker(
    extremes_ds.extreme_events,
    extremes_ds.mask,
    R_fill=8,
    area_filter_quartile=0.5,
    # Performance optimizations
    dask_chunks=chunk_size
)

Memory Management

# For memory-constrained environments
event_tracker = marEx.tracker(
    extremes_ds.extreme_events,
    extremes_ds.mask,
    R_fill=6,                    # Smaller fill radius
    area_filter_quartile=0.75,   # Remove more small events
    T_fill=1,                    # Reduce temporal gap filling
    allow_merging=False          # Disable merge tracking for speed
)

Algorithm Details

Tracking Workflow

The tracking algorithm follows these key steps:

  1. Binary Object Identification * Connected component labeling on each time slice * Morphological operations (opening/closing) for noise reduction * Size filtering based on area quartiles

  2. Temporal Matching * Overlap-based matching between consecutive time steps * Handles one-to-one, one-to-many, and many-to-one relationships * Gap filling for short-duration interruptions

  3. Event Lifecycle Management * Track initialisation, continuation, merging, and splitting * Unique ID assignment and genealogy tracking * Statistical property calculation

  4. Post-processing * Merge event documentation * Output dataset construction

Merge/Split Algorithm

The advanced merge/split tracking implements:

# Improved merge partitioning logic:
# - Partition child objects based on nearest parent cell
# - Maintain original event identities across merges
# - Track genealogy of splitting and merging events
# - Use overlap threshold to determine event continuity

Error Handling

Common Issues and Solutions

Memory Errors:

# Solution: Reduce chunk sizes and increase filtering
event_tracker = marEx.tracker(
    data_bin, mask,
    R_fill=6,                  # Smaller fill radius
    area_filter_quartile=0.8,  # Remove more small events
    T_fill=1,                  # Reduce temporal gap filling
    dask_chunks={'time': 15}   # Smaller time chunks
)

Performance Issues:

# Solution: Optimise parameters for your data
event_tracker = marEx.tracker(
    data_bin, mask,
    R_fill=4,                  # Reduce morphological operations
    area_filter_quartile=0.75, # Remove more small events
    allow_merging=False        # Disable merge tracking
)

Tracking Quality Issues:

# Solution: Tune overlap and temporal parameters
event_tracker = marEx.tracker(
    data_bin, mask,
    R_fill=8,
    area_filter_quartile=0.5,
    overlap_threshold=0.3,     # Lower overlap requirement
    T_fill=4,                  # Allow longer temporal gaps
    allow_merging=True         # Enable merge tracking
)

Integration with Preprocessing

Complete Workflow Example

import xarray as xr
import marEx

# Step 1: Preprocess data
sst = xr.open_dataset('sst_daily.nc', chunks={'time': 365}).sst

extremes_ds = marEx.preprocess_data(
    sst,
    method_anomaly='shifting_baseline',
    method_extreme='hobday_extreme',
    threshold_percentile=95,
    window_year_baseline=15,
    smooth_days_baseline=21,
    window_days_hobday=11,
    dask_chunks={'time': 25}
)

# Step 2: Track events
event_tracker = marEx.tracker(
    extremes_ds.extreme_events,
    extremes_ds.mask,
    R_fill=8,
    area_filter_quartile=0.5,
    T_fill=2,
    allow_merging=True,
    overlap_threshold=0.5,
    nn_partitioning=True
)

# Step 3: Run tracking
tracked_events, merges_ds = event_tracker.run(return_merges=True)

# Step 4: Save results
tracked_events.to_netcdf('tracked_events.nc')
if merges_ds is not None:
    merges_ds.to_netcdf('merge_events.nc')

See Also