Skip to content

Script to compute spatial autocorrelation of structured/unstructured datasets#1955

Merged
shmh40 merged 14 commits intodevelopfrom
shmh40/dev/1952-compute-spatial-autocorr
Mar 6, 2026
Merged

Script to compute spatial autocorrelation of structured/unstructured datasets#1955
shmh40 merged 14 commits intodevelopfrom
shmh40/dev/1952-compute-spatial-autocorr

Conversation

@shmh40
Copy link
Contributor

@shmh40 shmh40 commented Feb 27, 2026

Description

Opus generated script to compute autocorrelations in structured and unstructured datasets.

How it works:

  1. load data from anemoi or obs dataset or xarray.

  2. Optional anomaly computation: remove climatology for gridded data, or try to do this with spatial mean/std for unstructured data.

  3. Estimate spatial autocorrelation by sampling random pairs of points, compute haversine (great-circle) distance and then bin by distance and compute correlation per bin. Fit to the length of autocorrelation with 1/e threshold. The number of samples and time slices to use to make this estimate can be set by the user -- it is pretty cheap, and only needs to be done once, so you can run with many samples. Claude did something annoying with fallbacks for fitting the correlation, 1/e -> integrated scale -> log linear. I am not too worried about this.

  4. The script additionally maps the spatial autocorrelation to a suggested healpix masking level (according to the user-chosen coefficient) and groups variables (for putting in the separated streams configs) and produces yaml snippets for per-stream masking overrides.

Issue Number

Closes #1952

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@shmh40 shmh40 self-assigned this Feb 27, 2026
@shmh40 shmh40 added the data Anything related to the datasets used in the project label Feb 27, 2026
@shmh40 shmh40 linked an issue Feb 27, 2026 that may be closed by this pull request
6 tasks
Copy link
Collaborator

@clessig clessig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here some initial comments. Difficult to review exactly. It would be great if we can add some examples of the output to the PR. The information would in principle also be something that could go into the STAC catalogue, and we compute it for all datasets that are onboarded.


if len(distances) < 100:
logger.warning("Too few non-NaN pairs for autocorrelation estimation")
return 500.0, np.array([]), np.array([])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the script fail in this case. Is 500. really a reasonable value when l483 applies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is stupid. Replaced all of these with raising errors.


if global_var < 1e-12:
logger.warning("Near-zero variance, defaulting correlation length")
return 500.0, bin_centers, np.zeros(n_bins)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

choices=["anemoi", "fesom", "obs", "xarray", "cams", "eobs", "iconart", "iconesm"],
help="Dataset type",
)
parser.add_argument("--channels", nargs="*", default=None, help="Variables to analyse")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is done for all channels by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added a help comment to clarify this.

n_sample_pairs: int = 100_000,
seed: int = 42,
) -> tuple[float, NDArray, NDArray]:
"""Compute spatial autocorrelation for ragged per-time observations."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a few sentences of what the computations do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@shmh40
Copy link
Contributor Author

shmh40 commented Mar 3, 2026

@tjhunter how can I put this into a "science" package? If you can do this trivially, that would be great :)


Example usage::

python scripts/compute_stream_masking.py \\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we correct this and also document optional options (like channels). What worked for me:

uv run python packages/science/compute_spatial_autocorrelation.py --dataset ./data/observations-ea-ofb-0001-2007-2021-metop-a-iasi-radiances-v1.zarr --type obs --n-time-samples 100 --n-sample-pairs 100000 --correlation-multiplier 1.5

@clessig
Copy link
Collaborator

clessig commented Mar 6, 2026

Please update the usage example before merging.

@shmh40 shmh40 merged commit 2f66529 into develop Mar 6, 2026
5 checks passed
Jubeku added a commit that referenced this pull request Mar 10, 2026
* Improve support for latent losses (#1963)

* Revert 2D rope to false by default (#1967)

Set to True by accident

* Implementation of DataReaderMesh (#1840)

* First implementation of DataReaderMesh

* Move to datareaders extra

* ruff

* ruff2

* Undo ruff

* undo auto-linting

* correct typo in eval config (#1971)

* Added all-physical-streams option and x/y axis limits (#1972)

* Added all-physical-streams option and x/y axis limits

* Fix

* Changed flag for all streams

* Removed old code

* moved metric parsing to eval_from_config (#1977)

Co-authored-by: buschow1 <[email protected]>

* Fixed integration test (#1980)

* [1974][model] Add fallback to config loading (#1985)

* Add fallback to config loading

* Adjust error message to be not misleading

* Homegenize naming convention

* Introduce bias/diff maps and animations (#1912)

* Introduce bias/diff maps and animations

* minor correction

* Changes based on review

* Introduce "plot_bias" in evaluation configuration (#1986)

* Fixed index ordering to not have shuffled output (#1982)

* Fixed idxs_inv to revert data point shuffeling

* Fixed output handling

* Handling of empty data case, addressing reviewer comment

* [1893][eval] csvreader cleanup (#1906)

* refactor csvreader

* check if dataarray size is 0

* fix and use original logic for empty data

* linting fixes

* revert assertions back

* [1890][eval] Move MergeReader to own module (#1892)

* move mergereader

* use assertions only

* implement scoring for the sub-steps within the forecast window    (#1896)

* work in progress

* working for forecast_step

* working version

* restore no valid times option

* lint

* Rename scale_z_channels to _scale_z_channels

* fix 1 sample bug

* Remove points_per_sample from ReaderOutput

Remove points_per_sample from ReaderOutput return.

* remove n_point_per_sample

* fix lead time coord in compute_scores

* lint

* fix integration test

* Fix integration test single stream (#1996)

* fix test single

* change yml extension and minor fixes

---------

Co-authored-by: cosi1 <[email protected]>
Co-authored-by: cosi1 <[email protected]>

* [1907][eval] clean up wegen_reader.py (#1911)

* clean up wegen_reader.py

* remove exception

* consistent reader naming

* add blank line

* use assertions only

* make names consistent

* Merge branch 'develop' into 1907-wegenreader-cleanup

* revert is_regular

---------

Co-authored-by: iluise <[email protected]>
Co-authored-by: Ilaria Luise <[email protected]>

* [1888][eval] Refactor Reader class (#1889)

* refactor Reader

* use assertion only

* fix npp atms

---------

Co-authored-by: iluise <[email protected]>
Co-authored-by: Ilaria Luise <[email protected]>

* [1975][model] Load model path from private repo instead of json (#1998)

* Load model path from private repo instead of json

* Lint

* Script to compute spatial autocorrelation of structured/unstructured datasets (#1955)

* standalone script to compute spatial autocorrelation of variables in a structured or unstructured dataset

* remove commits that should be in pr 1951

* lint

* addressed comments

* removed last failure returning 500km default, and moved to packages science

* updated a note

* rename autocorrelation script

* update example usage

* Correct EMA halflife_steps calculation with rampup_ratio (#2001)

Corrected rampup calculation: https://github.com/NVlabs/edm2/blob/4bf8162f601bcc09472ce8a32dd0cbe8889dc8fc/training/phema.py#L145

Co-authored-by: Wael <[email protected]>

* Reduce verbosity of output during inference and evaluation  (#2006)

* Fix incorrect length in validation progress bar

* Removing too verbose output

* [1766][1743][1332] lint and unit-test fix (#1802)

* [1766][1742] fix lint and unit-test

* [1766] fix linter

* [1766] lint local and global consistent

* [1332] add script to detect bad functions (getattr)

* code quality: lint and bad functions

* [1766] disable some checks

* [1877] Script to populate PR labels from linked issues (#1878)

* script

* branch

* more dirs

* typo

* enable

* Fixed bug in linear embedding (#2012)

* Adding forecast_steps feature to plot_train (#2010)

* Adding forecast_steps feature to plot_train

* Renamed arguement to conform to hyphen convention

* Added forecast step to filename

---------

Co-authored-by: Seb Hickman <[email protected]>

---------

Co-authored-by: Christian Lessig <[email protected]>
Co-authored-by: Seb Hickman <[email protected]>
Co-authored-by: Kacper Nowak <[email protected]>
Co-authored-by: Till Hauer <[email protected]>
Co-authored-by: s6sebusc <[email protected]>
Co-authored-by: buschow1 <[email protected]>
Co-authored-by: Matthias Karlbauer <[email protected]>
Co-authored-by: Savvas Melidonis <[email protected]>
Co-authored-by: Michael Tarnawa <[email protected]>
Co-authored-by: iluise <[email protected]>
Co-authored-by: pierluigicosi <[email protected]>
Co-authored-by: cosi1 <[email protected]>
Co-authored-by: cosi1 <[email protected]>
Co-authored-by: Ilaria Luise <[email protected]>
Co-authored-by: Wael <[email protected]>
Co-authored-by: Simone Norberti <[email protected]>
Co-authored-by: Timothy Hunter <[email protected]>
Jubeku added a commit that referenced this pull request Mar 12, 2026
* Improve support for latent losses (#1963)

* Revert 2D rope to false by default (#1967)

Set to True by accident

* Implementation of DataReaderMesh (#1840)

* First implementation of DataReaderMesh

* Move to datareaders extra

* ruff

* ruff2

* Undo ruff

* undo auto-linting

* correct typo in eval config (#1971)

* Added all-physical-streams option and x/y axis limits (#1972)

* Added all-physical-streams option and x/y axis limits

* Fix

* Changed flag for all streams

* Removed old code

* moved metric parsing to eval_from_config (#1977)

Co-authored-by: buschow1 <[email protected]>

* Fixed integration test (#1980)

* [1974][model] Add fallback to config loading (#1985)

* Add fallback to config loading

* Adjust error message to be not misleading

* Homegenize naming convention

* Introduce bias/diff maps and animations (#1912)

* Introduce bias/diff maps and animations

* minor correction

* Changes based on review

* Introduce "plot_bias" in evaluation configuration (#1986)

* Fixed index ordering to not have shuffled output (#1982)

* Fixed idxs_inv to revert data point shuffeling

* Fixed output handling

* Handling of empty data case, addressing reviewer comment

* [1893][eval] csvreader cleanup (#1906)

* refactor csvreader

* check if dataarray size is 0

* fix and use original logic for empty data

* linting fixes

* revert assertions back

* [1890][eval] Move MergeReader to own module (#1892)

* move mergereader

* use assertions only

* implement scoring for the sub-steps within the forecast window    (#1896)

* work in progress

* working for forecast_step

* working version

* restore no valid times option

* lint

* Rename scale_z_channels to _scale_z_channels

* fix 1 sample bug

* Remove points_per_sample from ReaderOutput

Remove points_per_sample from ReaderOutput return.

* remove n_point_per_sample

* fix lead time coord in compute_scores

* lint

* fix integration test

* Fix integration test single stream (#1996)

* fix test single

* change yml extension and minor fixes

---------

Co-authored-by: cosi1 <[email protected]>
Co-authored-by: cosi1 <[email protected]>

* [1907][eval] clean up wegen_reader.py (#1911)

* clean up wegen_reader.py

* remove exception

* consistent reader naming

* add blank line

* use assertions only

* make names consistent

* Merge branch 'develop' into 1907-wegenreader-cleanup

* revert is_regular

---------

Co-authored-by: iluise <[email protected]>
Co-authored-by: Ilaria Luise <[email protected]>

* [1888][eval] Refactor Reader class (#1889)

* refactor Reader

* use assertion only

* fix npp atms

---------

Co-authored-by: iluise <[email protected]>
Co-authored-by: Ilaria Luise <[email protected]>

* [1975][model] Load model path from private repo instead of json (#1998)

* Load model path from private repo instead of json

* Lint

* Script to compute spatial autocorrelation of structured/unstructured datasets (#1955)

* standalone script to compute spatial autocorrelation of variables in a structured or unstructured dataset

* remove commits that should be in pr 1951

* lint

* addressed comments

* removed last failure returning 500km default, and moved to packages science

* updated a note

* rename autocorrelation script

* update example usage

* Correct EMA halflife_steps calculation with rampup_ratio (#2001)

Corrected rampup calculation: https://github.com/NVlabs/edm2/blob/4bf8162f601bcc09472ce8a32dd0cbe8889dc8fc/training/phema.py#L145

Co-authored-by: Wael <[email protected]>

* Reduce verbosity of output during inference and evaluation  (#2006)

* Fix incorrect length in validation progress bar

* Removing too verbose output

* [1766][1743][1332] lint and unit-test fix (#1802)

* [1766][1742] fix lint and unit-test

* [1766] fix linter

* [1766] lint local and global consistent

* [1332] add script to detect bad functions (getattr)

* code quality: lint and bad functions

* [1766] disable some checks

* [1877] Script to populate PR labels from linked issues (#1878)

* script

* branch

* more dirs

* typo

* enable

* Fixed bug in linear embedding (#2012)

* Adding forecast_steps feature to plot_train (#2010)

* Adding forecast_steps feature to plot_train

* Renamed arguement to conform to hyphen convention

* Added forecast step to filename

---------

Co-authored-by: Seb Hickman <[email protected]>

* add noise distribution plotting

* plot noise distribution and decoded noised tokens

* fix noise level in validation to p_mean

* rm noise and token distribution plotting

---------

Co-authored-by: Christian Lessig <[email protected]>
Co-authored-by: Seb Hickman <[email protected]>
Co-authored-by: Kacper Nowak <[email protected]>
Co-authored-by: Till Hauer <[email protected]>
Co-authored-by: s6sebusc <[email protected]>
Co-authored-by: buschow1 <[email protected]>
Co-authored-by: Matthias Karlbauer <[email protected]>
Co-authored-by: Savvas Melidonis <[email protected]>
Co-authored-by: Michael Tarnawa <[email protected]>
Co-authored-by: iluise <[email protected]>
Co-authored-by: pierluigicosi <[email protected]>
Co-authored-by: cosi1 <[email protected]>
Co-authored-by: cosi1 <[email protected]>
Co-authored-by: Ilaria Luise <[email protected]>
Co-authored-by: Wael <[email protected]>
Co-authored-by: Simone Norberti <[email protected]>
Co-authored-by: Timothy Hunter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Anything related to the datasets used in the project

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Compute spatial autocorrelation of variables in a dataset

2 participants