Script to compute spatial autocorrelation of structured/unstructured datasets#1955
Script to compute spatial autocorrelation of structured/unstructured datasets#1955
Conversation
clessig
left a comment
There was a problem hiding this comment.
Here some initial comments. Difficult to review exactly. It would be great if we can add some examples of the output to the PR. The information would in principle also be something that could go into the STAC catalogue, and we compute it for all datasets that are onboarded.
|
|
||
| if len(distances) < 100: | ||
| logger.warning("Too few non-NaN pairs for autocorrelation estimation") | ||
| return 500.0, np.array([]), np.array([]) |
There was a problem hiding this comment.
Shouldn't the script fail in this case. Is 500. really a reasonable value when l483 applies?
There was a problem hiding this comment.
Agreed, this is stupid. Replaced all of these with raising errors.
|
|
||
| if global_var < 1e-12: | ||
| logger.warning("Near-zero variance, defaulting correlation length") | ||
| return 500.0, bin_centers, np.zeros(n_bins) |
| choices=["anemoi", "fesom", "obs", "xarray", "cams", "eobs", "iconart", "iconesm"], | ||
| help="Dataset type", | ||
| ) | ||
| parser.add_argument("--channels", nargs="*", default=None, help="Variables to analyse") |
There was a problem hiding this comment.
It is done for all channels by default?
There was a problem hiding this comment.
Yes, added a help comment to clarify this.
| n_sample_pairs: int = 100_000, | ||
| seed: int = 42, | ||
| ) -> tuple[float, NDArray, NDArray]: | ||
| """Compute spatial autocorrelation for ragged per-time observations.""" |
There was a problem hiding this comment.
Could we add a few sentences of what the computations do
|
@tjhunter how can I put this into a "science" package? If you can do this trivially, that would be great :) |
|
|
||
| Example usage:: | ||
|
|
||
| python scripts/compute_stream_masking.py \\ |
There was a problem hiding this comment.
Could we correct this and also document optional options (like channels). What worked for me:
uv run python packages/science/compute_spatial_autocorrelation.py --dataset ./data/observations-ea-ofb-0001-2007-2021-metop-a-iasi-radiances-v1.zarr --type obs --n-time-samples 100 --n-sample-pairs 100000 --correlation-multiplier 1.5
|
Please update the usage example before merging. |
* Improve support for latent losses (#1963) * Revert 2D rope to false by default (#1967) Set to True by accident * Implementation of DataReaderMesh (#1840) * First implementation of DataReaderMesh * Move to datareaders extra * ruff * ruff2 * Undo ruff * undo auto-linting * correct typo in eval config (#1971) * Added all-physical-streams option and x/y axis limits (#1972) * Added all-physical-streams option and x/y axis limits * Fix * Changed flag for all streams * Removed old code * moved metric parsing to eval_from_config (#1977) Co-authored-by: buschow1 <[email protected]> * Fixed integration test (#1980) * [1974][model] Add fallback to config loading (#1985) * Add fallback to config loading * Adjust error message to be not misleading * Homegenize naming convention * Introduce bias/diff maps and animations (#1912) * Introduce bias/diff maps and animations * minor correction * Changes based on review * Introduce "plot_bias" in evaluation configuration (#1986) * Fixed index ordering to not have shuffled output (#1982) * Fixed idxs_inv to revert data point shuffeling * Fixed output handling * Handling of empty data case, addressing reviewer comment * [1893][eval] csvreader cleanup (#1906) * refactor csvreader * check if dataarray size is 0 * fix and use original logic for empty data * linting fixes * revert assertions back * [1890][eval] Move MergeReader to own module (#1892) * move mergereader * use assertions only * implement scoring for the sub-steps within the forecast window (#1896) * work in progress * working for forecast_step * working version * restore no valid times option * lint * Rename scale_z_channels to _scale_z_channels * fix 1 sample bug * Remove points_per_sample from ReaderOutput Remove points_per_sample from ReaderOutput return. * remove n_point_per_sample * fix lead time coord in compute_scores * lint * fix integration test * Fix integration test single stream (#1996) * fix test single * change yml extension and minor fixes --------- Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> * [1907][eval] clean up wegen_reader.py (#1911) * clean up wegen_reader.py * remove exception * consistent reader naming * add blank line * use assertions only * make names consistent * Merge branch 'develop' into 1907-wegenreader-cleanup * revert is_regular --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1888][eval] Refactor Reader class (#1889) * refactor Reader * use assertion only * fix npp atms --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1975][model] Load model path from private repo instead of json (#1998) * Load model path from private repo instead of json * Lint * Script to compute spatial autocorrelation of structured/unstructured datasets (#1955) * standalone script to compute spatial autocorrelation of variables in a structured or unstructured dataset * remove commits that should be in pr 1951 * lint * addressed comments * removed last failure returning 500km default, and moved to packages science * updated a note * rename autocorrelation script * update example usage * Correct EMA halflife_steps calculation with rampup_ratio (#2001) Corrected rampup calculation: https://github.com/NVlabs/edm2/blob/4bf8162f601bcc09472ce8a32dd0cbe8889dc8fc/training/phema.py#L145 Co-authored-by: Wael <[email protected]> * Reduce verbosity of output during inference and evaluation (#2006) * Fix incorrect length in validation progress bar * Removing too verbose output * [1766][1743][1332] lint and unit-test fix (#1802) * [1766][1742] fix lint and unit-test * [1766] fix linter * [1766] lint local and global consistent * [1332] add script to detect bad functions (getattr) * code quality: lint and bad functions * [1766] disable some checks * [1877] Script to populate PR labels from linked issues (#1878) * script * branch * more dirs * typo * enable * Fixed bug in linear embedding (#2012) * Adding forecast_steps feature to plot_train (#2010) * Adding forecast_steps feature to plot_train * Renamed arguement to conform to hyphen convention * Added forecast step to filename --------- Co-authored-by: Seb Hickman <[email protected]> --------- Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Seb Hickman <[email protected]> Co-authored-by: Kacper Nowak <[email protected]> Co-authored-by: Till Hauer <[email protected]> Co-authored-by: s6sebusc <[email protected]> Co-authored-by: buschow1 <[email protected]> Co-authored-by: Matthias Karlbauer <[email protected]> Co-authored-by: Savvas Melidonis <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: iluise <[email protected]> Co-authored-by: pierluigicosi <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> Co-authored-by: Wael <[email protected]> Co-authored-by: Simone Norberti <[email protected]> Co-authored-by: Timothy Hunter <[email protected]>
* Improve support for latent losses (#1963) * Revert 2D rope to false by default (#1967) Set to True by accident * Implementation of DataReaderMesh (#1840) * First implementation of DataReaderMesh * Move to datareaders extra * ruff * ruff2 * Undo ruff * undo auto-linting * correct typo in eval config (#1971) * Added all-physical-streams option and x/y axis limits (#1972) * Added all-physical-streams option and x/y axis limits * Fix * Changed flag for all streams * Removed old code * moved metric parsing to eval_from_config (#1977) Co-authored-by: buschow1 <[email protected]> * Fixed integration test (#1980) * [1974][model] Add fallback to config loading (#1985) * Add fallback to config loading * Adjust error message to be not misleading * Homegenize naming convention * Introduce bias/diff maps and animations (#1912) * Introduce bias/diff maps and animations * minor correction * Changes based on review * Introduce "plot_bias" in evaluation configuration (#1986) * Fixed index ordering to not have shuffled output (#1982) * Fixed idxs_inv to revert data point shuffeling * Fixed output handling * Handling of empty data case, addressing reviewer comment * [1893][eval] csvreader cleanup (#1906) * refactor csvreader * check if dataarray size is 0 * fix and use original logic for empty data * linting fixes * revert assertions back * [1890][eval] Move MergeReader to own module (#1892) * move mergereader * use assertions only * implement scoring for the sub-steps within the forecast window (#1896) * work in progress * working for forecast_step * working version * restore no valid times option * lint * Rename scale_z_channels to _scale_z_channels * fix 1 sample bug * Remove points_per_sample from ReaderOutput Remove points_per_sample from ReaderOutput return. * remove n_point_per_sample * fix lead time coord in compute_scores * lint * fix integration test * Fix integration test single stream (#1996) * fix test single * change yml extension and minor fixes --------- Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> * [1907][eval] clean up wegen_reader.py (#1911) * clean up wegen_reader.py * remove exception * consistent reader naming * add blank line * use assertions only * make names consistent * Merge branch 'develop' into 1907-wegenreader-cleanup * revert is_regular --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1888][eval] Refactor Reader class (#1889) * refactor Reader * use assertion only * fix npp atms --------- Co-authored-by: iluise <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> * [1975][model] Load model path from private repo instead of json (#1998) * Load model path from private repo instead of json * Lint * Script to compute spatial autocorrelation of structured/unstructured datasets (#1955) * standalone script to compute spatial autocorrelation of variables in a structured or unstructured dataset * remove commits that should be in pr 1951 * lint * addressed comments * removed last failure returning 500km default, and moved to packages science * updated a note * rename autocorrelation script * update example usage * Correct EMA halflife_steps calculation with rampup_ratio (#2001) Corrected rampup calculation: https://github.com/NVlabs/edm2/blob/4bf8162f601bcc09472ce8a32dd0cbe8889dc8fc/training/phema.py#L145 Co-authored-by: Wael <[email protected]> * Reduce verbosity of output during inference and evaluation (#2006) * Fix incorrect length in validation progress bar * Removing too verbose output * [1766][1743][1332] lint and unit-test fix (#1802) * [1766][1742] fix lint and unit-test * [1766] fix linter * [1766] lint local and global consistent * [1332] add script to detect bad functions (getattr) * code quality: lint and bad functions * [1766] disable some checks * [1877] Script to populate PR labels from linked issues (#1878) * script * branch * more dirs * typo * enable * Fixed bug in linear embedding (#2012) * Adding forecast_steps feature to plot_train (#2010) * Adding forecast_steps feature to plot_train * Renamed arguement to conform to hyphen convention * Added forecast step to filename --------- Co-authored-by: Seb Hickman <[email protected]> * add noise distribution plotting * plot noise distribution and decoded noised tokens * fix noise level in validation to p_mean * rm noise and token distribution plotting --------- Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Seb Hickman <[email protected]> Co-authored-by: Kacper Nowak <[email protected]> Co-authored-by: Till Hauer <[email protected]> Co-authored-by: s6sebusc <[email protected]> Co-authored-by: buschow1 <[email protected]> Co-authored-by: Matthias Karlbauer <[email protected]> Co-authored-by: Savvas Melidonis <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: iluise <[email protected]> Co-authored-by: pierluigicosi <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: cosi1 <[email protected]> Co-authored-by: Ilaria Luise <[email protected]> Co-authored-by: Wael <[email protected]> Co-authored-by: Simone Norberti <[email protected]> Co-authored-by: Timothy Hunter <[email protected]>
Description
Opus generated script to compute autocorrelations in structured and unstructured datasets.
How it works:
load data from anemoi or obs dataset or xarray.
Optional anomaly computation: remove climatology for gridded data, or try to do this with spatial mean/std for unstructured data.
Estimate spatial autocorrelation by sampling random pairs of points, compute haversine (great-circle) distance and then bin by distance and compute correlation per bin. Fit to the length of autocorrelation with 1/e threshold. The number of samples and time slices to use to make this estimate can be set by the user -- it is pretty cheap, and only needs to be done once, so you can run with many samples. Claude did something annoying with fallbacks for fitting the correlation, 1/e -> integrated scale -> log linear. I am not too worried about this.
The script additionally maps the spatial autocorrelation to a suggested healpix masking level (according to the user-chosen coefficient) and groups variables (for putting in the separated streams configs) and produces yaml snippets for per-stream masking overrides.
Issue Number
Closes #1952
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60