Skip to content

Enable statistics recomputation when concat pre-existing anemoi-datasets #495

@yoel-zerah

Description

@yoel-zerah

Currently, anemoi-datasets doesn't recompute statistics when combining multiple datasets with concat (see https://anemoi.readthedocs.io/projects/datasets/en/latest/using/statistics.html#statistics and https://anemoi.readthedocs.io/projects/datasets/en/latest/using/combining.html#concat).

If possible, I would be interested in an option to recompute these statistics.

In particular, I had an issue that I'm associating with the behaviour of statistics.
I have several anemoi-datasets (subsets of CERRA) already on disk that I would like to assemble into a single dataset, with spans:

  • 1985 - 1989
  • 1990 - 1999
  • 2000 - 2009
  • 2010 - 2019
  • 2020
  • 2021
  • 2022
  • 2025

These datasets were generated from mars, using the following recipe (replacing XXXX and YYYY with start and end years):

description: |
  Copernicus European Regional Reanalysis

name: cerra-rr-an-oper-se-al-ec-mars-5p5km-XXXX-YYYY-3h-v1

dates:
  end: YYYY-12-31T18:00:00
  frequency: 3h
  start: XXXX-01-01T00:00:00

mars_common: &mars_common
  class: rr
  expver: prod
  origin: se-al-ec
  stream: oper

accum_base: &accum_base
  <<: *mars_common
  levtype: sfc
  type: fc
  param: ["tp","ssrd","strd"]

input:
  join:
  - mars:
      <<: *mars_common
      levtype: sfc
      type: an
      param: [10si, 10wdir, 2t, 2r, msl, sp, tcc, tciwv, sr, orog, lsm]

  - mars:
      # Maximum 10 metre wind gust since previous post-processing (10fg): 49
      # Surface long-wave (thermal) radiation downwards (strd): 175
      # Surface net long-wave (thermal) radiation (std): 177
      # Surface net short-wave (solar) radiation (ssr): 176
      # Surface short-wave (solar) radiation downwards (ssrd): 169
      # Maximum temperature at 2 metres since previous post-processing (mx2t): 201
      # Minimum temperature at 2 metres since previous post-processing (mn2t): 202
      <<: *mars_common
      levtype: sfc
      type: fc
      step: 3
      param: [201, 202, 49]
  - mars:
      <<: *mars_common
      levtype: hl
      type: an
      levelist: 100
      param: [ws, wdir]
  - constants:
      template: ${input.join.0.mars}
      param:
      - cos_latitude
      - cos_longitude
      - sin_latitude
      - sin_longitude
      - cos_julian_day
      - cos_local_time
      - sin_julian_day
      - sin_local_time
      - insolation
  # Precipitation
  - concat:
    # VALID TIME: 21Z - Forecast: 12Z - step (9 - 6)
    - dates:
        start: XXXX-01-01T21:00:00
        end: YYYY-12-31T21:00:00
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [12]
        accumulation_period: [6, 9]
    # VALID TIME: 00Z - Forecast: 12Z previous day - step (12 - 9)
    - dates:
        start: XXXX-01-01T00:00:00
        end: YYYY-12-31T00:00:00        
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [12]
        accumulation_period: [9, 12]
    # VALID TIME: 03Z - Forecast: 12Z previous day - step (15 - 12)
    - dates:
        start: XXXX-01-01T03:00:00
        end: YYYY-12-31T03:00:00
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [12]
        accumulation_period: [12, 15]
    # VALID TIME: 06Z - Forecast: 12Z previous day - step (18 - 15)
    - dates:
        start: XXXX-01-01T06:00:00
        end: YYYY-12-31T06:00:00 
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [12]
        accumulation_period: [15, 18]
    # VALID TIME: 09Z - Forecast: 00Z - step (9 - 6)
    - dates:
        start: XXXX-01-01T09:00:00
        end: YYYY-12-31T09:00:00
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [0]
        accumulation_period: [6, 9]
    # VALID TIME: 12Z - Forecast: 00Z - step (12 - 9)
    - dates:
        start: XXXX-01-01T12:00:00
        end: YYYY-12-31T12:00:00
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [0]
        accumulation_period: [9, 12]
    # VALID TIME: 15Z - Forecast: 00Z - step (15 - 12)
    - dates:
        start: XXXX-01-01T15:00:00
        end: YYYY-12-31T15:00:00
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [0]
        accumulation_period: [12, 15]
    # VALID TIME: 18Z - Forecast: 00Z - step (18 - 15)
    - dates:
        start: XXXX-01-01T18:00:00
        end: YYYY-12-31T18:00:00
        frequency: 24h
      accumulations:
        <<: *accum_base
        time: [0]
        accumulation_period: [15, 18]

These datasets have no missing dates.

But when combining them with concat, I'm having the following error when applying anemoi-datasets finalize:

--- Finalising dataset ---
2025-12-03 16:01:24 INFO 🎬 Task finalise((),{}) starting
Traceback (most recent call last):
  File "anemoi-datasets", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "anemoi/datasets/__main__.py", line 33, in main
    cli_main(__version__, __doc__, COMMANDS)
  File "anemoi/utils/cli.py", line 266, in cli_main
    cmd.run(args)
  File "anemoi/datasets/commands/finalise.py", line 59, in run
    task(step, options)
  File "anemoi/datasets/commands/create.py", line 53, in task
    result = c.run()
             ^^^^^^^
  File "anemoi/datasets/create/__init__.py", line 1579, in run
    t.run()
  File "anemoi/datasets/create/__init__.py", line 1519, in run
    stats = self.tmp_statistics.get_aggregated(dates, variables, self.allow_nans)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anemoi/datasets/create/statistics/__init__.py", line 397, in get_aggregated
    aggregator = StatAggregator(self, *args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anemoi/datasets/create/statistics/__init__.py", line 452, in __init__
    self._read()
  File "anemoi/datasets/create/statistics/__init__.py", line 507, in _read
    assert d in found, f"Statistics for date {d} not precomputed."
           ^^^^^^^^^^
AssertionError: Statistics for date 1989-12-01T00:00:00 not precomputed.py", line 397, in get_aggregated
    aggregator = StatAggregator(self, *args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anemoi/datasets/create/statistics/__init__.py", line 452, in __init__
    self._read()
  File "anemoi/datasets/create/statistics/__init__.py", line 507, in _read
    assert d in found, f"Statistics for date {d} not precomputed."
           ^^^^^^^^^^
AssertionError: Statistics for date 1989-12-01T00:00:00 not precomputed.

I'm interpreting this error as the anemoi attempting to aggregate the statistics, but being unable to due to statistics being pre-computed on each anemoi-dataset but not on the full date range .

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    To be triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions