Script to compute spatial autocorrelation of structured/unstructured datasets by shmh40 · Pull Request #1955 · ecmwf/WeatherGenerator

shmh40 · 2026-02-27T14:04:39Z

Description

Opus generated script to compute autocorrelations in structured and unstructured datasets.

How it works:

load data from anemoi or obs dataset or xarray.
Optional anomaly computation: remove climatology for gridded data, or try to do this with spatial mean/std for unstructured data.
Estimate spatial autocorrelation by sampling random pairs of points, compute haversine (great-circle) distance and then bin by distance and compute correlation per bin. Fit to the length of autocorrelation with 1/e threshold. The number of samples and time slices to use to make this estimate can be set by the user -- it is pretty cheap, and only needs to be done once, so you can run with many samples. Claude did something annoying with fallbacks for fitting the correlation, 1/e -> integrated scale -> log linear. I am not too worried about this.
The script additionally maps the spatial autocorrelation to a suggested healpix masking level (according to the user-chosen coefficient) and groups variables (for putting in the separated streams configs) and produces yaml snippets for per-stream masking overrides.

Issue Number

Closes #1952

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…a structured or unstructured dataset

clessig

Here some initial comments. Difficult to review exactly. It would be great if we can add some examples of the output to the PR. The information would in principle also be something that could go into the STAC catalogue, and we compute it for all datasets that are onboarded.

packages/science/compute_spatial_autocorrelation.py

clessig · 2026-03-01T15:53:29Z

src/weathergen/utils/compute_stream_masking.py

+
+    if len(distances) < 100:
+        logger.warning("Too few non-NaN pairs for autocorrelation estimation")
+        return 500.0, np.array([]), np.array([])


Shouldn't the script fail in this case. Is 500. really a reasonable value when l483 applies?

Agreed, this is stupid. Replaced all of these with raising errors.

clessig · 2026-03-01T15:53:51Z

src/weathergen/utils/compute_stream_masking.py

+
+    if global_var < 1e-12:
+        logger.warning("Near-zero variance, defaulting correlation length")
+        return 500.0, bin_centers, np.zeros(n_bins)


Same as above

packages/science/compute_spatial_autocorrelation.py

clessig · 2026-03-01T15:57:18Z

src/weathergen/utils/compute_stream_masking.py

+        choices=["anemoi", "fesom", "obs", "xarray", "cams", "eobs", "iconart", "iconesm"],
+        help="Dataset type",
+    )
+    parser.add_argument("--channels", nargs="*", default=None, help="Variables to analyse")


It is done for all channels by default?

Yes, added a help comment to clarify this.

packages/science/compute_spatial_autocorrelation.py

clessig · 2026-03-01T15:58:30Z

src/weathergen/utils/compute_stream_masking.py

+    n_sample_pairs: int = 100_000,
+    seed: int = 42,
+) -> tuple[float, NDArray, NDArray]:
+    """Compute spatial autocorrelation for ragged per-time observations."""


Could we add a few sentences of what the computations do

packages/science/compute_spatial_autocorrelation.py

…mpute-spatial-autocorr

…cience

shmh40 · 2026-03-03T17:01:52Z

@tjhunter how can I put this into a "science" package? If you can do this trivially, that would be great :)

clessig · 2026-03-06T07:43:13Z

packages/science/compute_spatial_autocorrelation.py

+
+Example usage::
+
+    python scripts/compute_stream_masking.py \\


Could we correct this and also document optional options (like channels). What worked for me:

uv run python packages/science/compute_spatial_autocorrelation.py --dataset ./data/observations-ea-ofb-0001-2007-2021-metop-a-iasi-radiances-v1.zarr --type obs --n-time-samples 100 --n-sample-pairs 100000 --correlation-multiplier 1.5

clessig · 2026-03-06T07:46:55Z

Please update the usage example before merging.

* Improve support for latent losses (#1963) * Revert 2D rope to false by default (#1967) Set to True by accident * Implementation of DataReaderMesh (#1840) * First implementation of DataReaderMesh * Move to datareaders extra * ruff * ruff2 * Undo ruff * undo auto-linting * correct typo in eval config (#1971) * Added all-physical-streams option and x/y axis limits (#1972) * Added all-physical-streams option and x/y axis limits * Fix * Changed flag for all streams * Removed old code * moved metric parsing to eval_from_config (#1977) Co-authored-by: buschow1 <[email protected]> * Fixed integration test (#1980) * [1974][model] Add fallback to config loading (#1985) * Add fallback to config loading * Adjust error message to be not misleading * Homegenize naming convention * Introduce bias/diff maps and animations (#1912) * Introduce bias/diff maps and animations * minor correction * Changes based on review * Introduce "plot_bias" in evaluation configuration (#1986) * Fixed index ordering to not have shuffled output (#1982) * Fixed idxs_inv to revert data point shuffeling * Fixed output handling * Handling of empty data case, addressing reviewer comment * [1893][eval] csvreader cleanup (#1906) * refactor csvreader * check if dataarray size is 0 * fix and use original logic for empty data * linting fixes * revert assertions back * [1890][eval] Move MergeReader to own module (#1892) * move mergereader * use assertions only * implement scoring for the sub-steps within the forecast window (#1896) * work in progress * working for forecast_step * working version * restore no valid times option * lint * Rename scale_z_channels to _scale_z_channels * fix 1 sample bug * Remove points_per_sample from ReaderOutput Remove points_per_sample from ReaderOutput return. * remove n_point_per_sample * fix lead time coord in compute_scores * lint * fix integration test * Fix integration test single stream (#1996) * fix test single * change yml extension and minor fixes --------- Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> * [1907][eval] clean up wegen_reader.py (#1911) * clean up wegen_reader.py * remove exception * consistent reader naming * add blank line * use assertions only * make names consistent * Merge branch 'develop' into 1907-wegenreader-cleanup * revert is_regular --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1888][eval] Refactor Reader class (#1889) * refactor Reader * use assertion only * fix npp atms --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1975][model] Load model path from private repo instead of json (#1998) * Load model path from private repo instead of json * Lint * Script to compute spatial autocorrelation of structured/unstructured datasets (#1955) * standalone script to compute spatial autocorrelation of variables in a structured or unstructured dataset * remove commits that should be in pr 1951 * lint * addressed comments * removed last failure returning 500km default, and moved to packages science * updated a note * rename autocorrelation script * update example usage * Correct EMA halflife_steps calculation with rampup_ratio (#2001) Corrected rampup calculation: https://github.com/NVlabs/edm2/blob/4bf8162f601bcc09472ce8a32dd0cbe8889dc8fc/training/phema.py#L145 Co-authored-by: Wael <[email protected]> * Reduce verbosity of output during inference and evaluation (#2006) * Fix incorrect length in validation progress bar * Removing too verbose output * [1766][1743][1332] lint and unit-test fix (#1802) * [1766][1742] fix lint and unit-test * [1766] fix linter * [1766] lint local and global consistent * [1332] add script to detect bad functions (getattr) * code quality: lint and bad functions * [1766] disable some checks * [1877] Script to populate PR labels from linked issues (#1878) * script * branch * more dirs * typo * enable * Fixed bug in linear embedding (#2012) * Adding forecast_steps feature to plot_train (#2010) * Adding forecast_steps feature to plot_train * Renamed arguement to conform to hyphen convention * Added forecast step to filename --------- Co-authored-by: Seb Hickman <[email protected]> --------- Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Seb Hickman <[email protected]> Co-authored-by: Kacper Nowak <[email protected]> Co-authored-by: Till Hauer <[email protected]> Co-authored-by: s6sebusc <[email protected]> Co-authored-by: buschow1 <[email protected]> Co-authored-by: Matthias Karlbauer <[email protected]> Co-authored-by: Savvas Melidonis <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: iluise <[email protected]> Co-authored-by: pierluigicosi <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> Co-authored-by: Wael <[email protected]> Co-authored-by: Simone Norberti <[email protected]> Co-authored-by: Timothy Hunter <[email protected]>

* Improve support for latent losses (#1963) * Revert 2D rope to false by default (#1967) Set to True by accident * Implementation of DataReaderMesh (#1840) * First implementation of DataReaderMesh * Move to datareaders extra * ruff * ruff2 * Undo ruff * undo auto-linting * correct typo in eval config (#1971) * Added all-physical-streams option and x/y axis limits (#1972) * Added all-physical-streams option and x/y axis limits * Fix * Changed flag for all streams * Removed old code * moved metric parsing to eval_from_config (#1977) Co-authored-by: buschow1 <[email protected]> * Fixed integration test (#1980) * [1974][model] Add fallback to config loading (#1985) * Add fallback to config loading * Adjust error message to be not misleading * Homegenize naming convention * Introduce bias/diff maps and animations (#1912) * Introduce bias/diff maps and animations * minor correction * Changes based on review * Introduce "plot_bias" in evaluation configuration (#1986) * Fixed index ordering to not have shuffled output (#1982) * Fixed idxs_inv to revert data point shuffeling * Fixed output handling * Handling of empty data case, addressing reviewer comment * [1893][eval] csvreader cleanup (#1906) * refactor csvreader * check if dataarray size is 0 * fix and use original logic for empty data * linting fixes * revert assertions back * [1890][eval] Move MergeReader to own module (#1892) * move mergereader * use assertions only * implement scoring for the sub-steps within the forecast window (#1896) * work in progress * working for forecast_step * working version * restore no valid times option * lint * Rename scale_z_channels to _scale_z_channels * fix 1 sample bug * Remove points_per_sample from ReaderOutput Remove points_per_sample from ReaderOutput return. * remove n_point_per_sample * fix lead time coord in compute_scores * lint * fix integration test * Fix integration test single stream (#1996) * fix test single * change yml extension and minor fixes --------- Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> * [1907][eval] clean up wegen_reader.py (#1911) * clean up wegen_reader.py * remove exception * consistent reader naming * add blank line * use assertions only * make names consistent * Merge branch 'develop' into 1907-wegenreader-cleanup * revert is_regular --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1888][eval] Refactor Reader class (#1889) * refactor Reader * use assertion only * fix npp atms --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1975][model] Load model path from private repo instead of json (#1998) * Load model path from private repo instead of json * Lint * Script to compute spatial autocorrelation of structured/unstructured datasets (#1955) * standalone script to compute spatial autocorrelation of variables in a structured or unstructured dataset * remove commits that should be in pr 1951 * lint * addressed comments * removed last failure returning 500km default, and moved to packages science * updated a note * rename autocorrelation script * update example usage * Correct EMA halflife_steps calculation with rampup_ratio (#2001) Corrected rampup calculation: https://github.com/NVlabs/edm2/blob/4bf8162f601bcc09472ce8a32dd0cbe8889dc8fc/training/phema.py#L145 Co-authored-by: Wael <[email protected]> * Reduce verbosity of output during inference and evaluation (#2006) * Fix incorrect length in validation progress bar * Removing too verbose output * [1766][1743][1332] lint and unit-test fix (#1802) * [1766][1742] fix lint and unit-test * [1766] fix linter * [1766] lint local and global consistent * [1332] add script to detect bad functions (getattr) * code quality: lint and bad functions * [1766] disable some checks * [1877] Script to populate PR labels from linked issues (#1878) * script * branch * more dirs * typo * enable * Fixed bug in linear embedding (#2012) * Adding forecast_steps feature to plot_train (#2010) * Adding forecast_steps feature to plot_train * Renamed arguement to conform to hyphen convention * Added forecast step to filename --------- Co-authored-by: Seb Hickman <[email protected]> * add noise distribution plotting * plot noise distribution and decoded noised tokens * fix noise level in validation to p_mean * rm noise and token distribution plotting --------- Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Seb Hickman <[email protected]> Co-authored-by: Kacper Nowak <[email protected]> Co-authored-by: Till Hauer <[email protected]> Co-authored-by: s6sebusc <[email protected]> Co-authored-by: buschow1 <[email protected]> Co-authored-by: Matthias Karlbauer <[email protected]> Co-authored-by: Savvas Melidonis <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: iluise <[email protected]> Co-authored-by: pierluigicosi <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> Co-authored-by: Wael <[email protected]> Co-authored-by: Simone Norberti <[email protected]> Co-authored-by: Timothy Hunter <[email protected]>

shmh40 added 4 commits February 27, 2026 13:34

enable healpix masking at the level of the data

81961aa

enable per stream masking strategy config override

b7c1342

per stream masking override test

ceb590a

standalone script to compute spatial autocorrelation of variables in …

22f0923

…a structured or unstructured dataset

shmh40 self-assigned this Feb 27, 2026

shmh40 added the data Anything related to the datasets used in the project label Feb 27, 2026

shmh40 added this to WeatherGen-dev Feb 27, 2026

shmh40 linked an issue Feb 27, 2026 that may be closed by this pull request

Compute spatial autocorrelation of variables in a dataset #1952

Closed

6 tasks

shmh40 and others added 3 commits February 27, 2026 18:03

Merge branch 'develop' into shmh40/dev/1952-compute-spatial-autocorr

db8fd3a

remove commits that should be in pr 1951

39ccc64

lint

e17219f

clessig reviewed Mar 1, 2026

View reviewed changes

shmh40 added 4 commits March 3, 2026 17:31

addressed comments

ad991d0

Merge remote-tracking branch 'origin/develop' into shmh40/dev/1952-co…

f2130bd

…mpute-spatial-autocorr

removed last failure returning 500km default, and moved to packages s…

a4fdfb0

…cience

updated a note

134365b

rename autocorrelation script

c3053b0

clessig approved these changes Mar 6, 2026

View reviewed changes

shmh40 and others added 2 commits March 6, 2026 18:17

update example usage

16e2c90

Merge branch 'develop' into shmh40/dev/1952-compute-spatial-autocorr

28bf8a7

shmh40 merged commit 2f66529 into develop Mar 6, 2026
5 checks passed

github-project-automation bot moved this to Done in WeatherGen-dev Mar 6, 2026

Conversation

shmh40 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shmh40 commented Mar 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clessig commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shmh40 commented Feb 27, 2026 •

edited

Loading