marEx.preprocess_data

marEx.preprocess_data(da, method_anomaly='shifting_baseline', method_extreme='hobday_extreme', threshold_percentile=95, window_year_baseline=15, smooth_days_baseline=21, window_days_hobday=11, window_spatial_hobday=None, std_normalise=False, detrend_orders=None, force_zero_mean=True, reference_period=None, method_percentile='approximate', precision=0.01, max_anomaly=5.0, dask_chunks=None, dimensions=None, coordinates=None, neighbours=None, cell_areas=None, use_temp_checkpoints=False, verbose=None, quiet=None)[source]

Complete preprocessing pipeline for marine extreme event identification.

Supports separate methods for anomaly computation and extreme identification:

Anomaly Methods:

‘detrend_harmonic’: Detrending with harmonics and polynomials – more efficient, but biases statistics
‘shifting_baseline’: Rolling climatology using previous window_year_baseline years – more “correct”, but shortens time series by window_year_baseline years
‘fixed_baseline’: Daily climatology using full time series – does _not_ remove climate trends !
‘detrend_fixed_baseline’: Polynomial detrending followed by fixed daily climatology – keeps full time-series of data, but does not account for trends in the timing of seasonal transitions (which may appear as extremes)

Extreme Methods:

‘global_extreme’: Global-in-time threshold value
‘hobday_extreme’: Local day-of-year specific thresholds with windowing

Parameters:

da (xarray.DataArray) – Raw input data
method_anomaly (str, default='shifting_baseline') – Anomaly computation method (‘detrend_harmonic’, ‘shifting_baseline’, ‘fixed_baseline’, or ‘detrend_fixed_baseline’).
method_extreme (str, default='hobday_extreme') – Extreme identification method (‘global_extreme’ or ‘hobday_extreme’).
threshold_percentile (float, default=95) – Percentile threshold for extreme event detection.
window_year_baseline (int, default=15) – Number of previous years for rolling climatology (shifting_baseline method only).
smooth_days_baseline (int, default=21) – Days for smoothing rolling climatology (shifting_baseline method only).
window_days_hobday (int, default=11) – Window size for day-of-year threshold calculation (hobday_extreme method only).
window_spatial_hobday (int, default=None) – Spatial window size (2D centred window) for the day-of-year threshold calculation (hobday_extreme method only).
std_normalise (bool, default=False) – Whether to standardise anomalies by rolling standard deviation (detrend_harmonic only).
detrend_orders (list, default=[1]) – Polynomial orders for detrending (detrend_harmonic method only). Default is 1st order (linear) detrend. [1,2] e.g. would use a linear+quadratic detrending.
force_zero_mean (bool, default=True) – Whether to enforce zero mean in detrended anomalies (detrend_harmonic method only).
reference_period (tuple of (int, int), optional) – Year range (start_year, end_year) inclusive for computing the daily climatology (fixed_baseline and detrend_fixed_baseline only). If None (default), uses all available years. Anomalies are computed for the full time series regardless. Example: reference_period=(1990, 2020) computes the climatology from 1990-2020 but outputs anomalies for the entire input time range.
method_percentile (str, default='approximate') – Method for percentile calculation (‘exact’ or ‘approximate’) for both global_extreme & hobday_extreme methods. N.B.: Using the exact percentile calculation requires both careful/thoughtful chunking & sufficient memory, in consideration of the limitations inherent to distributed parallel I/O & processing.
precision (float, default=0.01) – Precision for histogram bins in approximate percentile method.
max_anomaly (float, default=5.0) – Maximum anomaly value for histogram binning in the approximate percentile method.
dask_chunks (dict, optional) – Chunking specification for distributed computation.
dimensions (dict, default={"time": "time", "x": "lon", "y": "lat"}) – Mapping of dimensions to names in the data.
coordinates (dict, optional) – Mapping of coordinates to names in the data. Defaults to dimensions mapping.
neighbours (xarray.DataArray, optional) – Neighbour connectivity for spatial clustering.
cell_areas (xarray.DataArray, optional) – Cell areas for weighted spatial statistics.
use_temp_checkpoints (bool, default=False) – Enable checkpointing to temporary zarr stores to break Dask graph dependencies. When True, intermediate results (anomalies, thresholds, extremes) are saved to temporary zarr files and immediately reloaded, preventing expensive recomputations. Recommended for large datasets on HPC systems where the 2D histogram computation is expensive. Temporary files are automatically cleaned up after reloading.
verbose (bool, default=None) – Enable verbose logging with detailed progress information. If None, uses current global logging configuration.
quiet (bool, default=None) – Enable quiet logging with minimal output (warnings and errors only). If None, uses current global logging configuration. Note: quiet takes precedence over verbose if both are True.

Returns:

Processed dataset with anomalies and extreme event identification

Return type:

xarray.Dataset

Examples

Basic usage with gridded SST data for marine heatwave detection:

>>> import xarray as xr
>>> import marEx
>>>
>>> # Load and chunk SST data
>>> sst = xr.open_dataset('sst_data.nc', chunks={}).sst.chunk({'time': 30})
>>>
>>> # Basic preprocessing with default shifting baseline method
>>> result = marEx.preprocess_data(sst, threshold_percentile=90)
>>> print(result)
<xarray.Dataset>
Dimensions:         (time: 1461, lat: 180, lon: 360)
Data variables:
    dat_anomaly     (time, lat, lon) float32 dask.array<chunksize=(30, 180, 360)>
    mask            (lat, lon) bool dask.array<chunksize=(180, 360)>
    extreme_events  (time, lat, lon) bool dask.array<chunksize=(30, 180, 360)>
    thresholds      (lat, lon) float32 dask.array<chunksize=(180, 360)>

>>> # Check which locations have extreme events
>>> print(f"Total extreme events: {result.extreme_events.sum().compute()}")
Total extreme events: 15847

Using shifting baseline method for more accurate climatology:

>>> # Requires at least 15 years of data by default
>>> result_shifting = marEx.preprocess_data(
...     sst,
...     method_anomaly="shifting_baseline",
...     window_year_baseline=10,  # Use shorter window if needed
...     smooth_days_baseline=31   # Longer smoothing window
... )
>>> # Note: First 10 years will be removed from output

Using Hobday extreme method with day-of-year specific thresholds:

>>> result_hobday = marEx.preprocess_data(
...     sst,
...     method_extreme="hobday_extreme",
...     window_days_hobday=11,  # 11-day window for each day-of-year
...     threshold_percentile=95
... )
>>> print(result_hobday.thresholds.dims)
('dayofyear', 'lat', 'lon')

Previous configuration (marEx v2.0 default) with polynomial detrending and standardisation:

>>> result_advanced = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_harmonic",
...     detrend_orders=[1, 2],  # Linear and quadratic trends
...     std_normalise=True,     # Standardise by rolling std
...     force_zero_mean=True,
...     threshold_percentile=95
... )
>>> # Result includes both raw and standardised anomalies
>>> print('dat_stn' in result_advanced)
True

Processing unstructured data:

>>> # For ICON ocean model data
>>> icon_sst = xr.open_dataset('icon_sst.nc', chunks={}).to.chunk({'time': 50})
>>> result_unstructured = marEx.preprocess_data(
...     icon_sst,
...     dimensions={"x": "ncells"},   # Must specify the name of the spatial dimension
...     dask_chunks={"time": 50}
... )

Error handling - insufficient data for shifting baseline:

>>> short_data = sst.isel(time=slice(0, 1000))  # Only ~3 years
>>> try:
...     result = marEx.preprocess_data(
...         short_data,
...         method_anomaly="shifting_baseline",
...         window_year_baseline=15
...     )
... except ValueError as e:
...     print(f"Error: {e}")
Error: Insufficient data for shifting_baseline method. Dataset spans 3 years but window_year_baseline
requires at least 15 years.

Performance considerations with chunking:

>>> # For large datasets, adjust chunking for memory management
>>> large_sst = sst.chunk({"time": 25, "lat": 90, "lon": 180})
>>> result = marEx.preprocess_data(
...     large_sst,
...     dask_chunks={"time": 25},
...     method_percentile="approximate"  # Use approximate method (Default) for long time-series calculations
... )

Integration with tracking workflow:

>>> # Preprocess data then track events
>>> processed = marEx.preprocess_data(sst, threshold_percentile=95)
>>> tracker = marEx.tracker(
...     processed.extreme_events,
...     processed.mask,
...     R_fill=8,
...     area_filter_quartile=0.5
... )
>>> events = tracker.run()
>>> print(f"Identified {events.event.max().compute()} distinct events")

Simple fixed baseline approach:

>>> # Basic daily climatology across all years
>>> result_fixed = marEx.preprocess_data(
...     sst,
...     method_anomaly="fixed_baseline",
...     threshold_percentile=95
... )
>>> # Uses all available data for climatology computation

Combined trend removal and fixed climatology:

>>> # Remove long-term trends then compute daily climatology
>>> result_combined = marEx.preprocess_data(
...     sst,
...     method_anomaly="detrend_fixed_baseline",
...     detrend_orders=[1],  # Linear trend
...     threshold_percentile=95,
...     force_zero_mean=True
... )
>>> # Balances trend removal with simple climatology