Skip to content

scan_grib unable to process spectra data #540

@gpickering-WEL

Description

@gpickering-WEL

Hi there! I've been processing 2D wave spectra grib files using kerchunk and have found that it doesn't correctly identify the coordinates, nor the dimensions of the data . The variable d2fd has the dimensions (directionNumber, frequencyNumber, latitude, longitude), so a 2D array per lat/lon point. Xarray and cfgrib read the data format correctly, but kerchunk does not. See the below,

cfgrib

xr.load_dataset("ec_spectra/T1P12180000010106001", engine="cfgrib")

<xarray.Dataset> Size: 3MB
Dimensions:          (directionNumber: 36, frequencyNumber: 29, latitude: 21,
                      longitude: 31)
Coordinates:
    number           int64 8B 0
    time             datetime64[ns] 8B 2024-12-18
    step             timedelta64[ns] 8B 14 days 06:00:00
    meanSea          float64 8B 0.0
  * directionNumber  (directionNumber) int64 288B 1 2 3 4 5 6 ... 32 33 34 35 36
  * frequencyNumber  (frequencyNumber) int64 232B 1 2 3 4 5 6 ... 25 26 27 28 29
  * latitude         (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.9 -40.0
  * longitude        (longitude) float64 248B 141.0 141.1 141.2 ... 143.9 144.0
    valid_time       datetime64[ns] 8B 2025-01-01T06:00:00
Data variables:
    d2fd             (directionNumber, frequencyNumber, latitude, longitude) float32 3MB ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-01-23T14:57 GRIB to CDM+CF via cfgrib-0.9.1...

kerchunk

single_ref = scan_grib("./ec_spectra/T1P12180000010100001")
print(f"Lenght of single_ref: {len(single_ref)}")
with open("single_ref.json", "w") as f:
    json.dump(single_ref[0], f)

fs = fsspec.filesystem("reference", fo="single_ref.json")
xr.open_dataset(fs.get_mapper(""), engine="zarr")

Lenght of single_ref: 1
/home/gpickering/projects/welvops-Py/.venv/lib/python3.10/site-packages/xarray/backends/api.py:571: RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  backend_ds = backend.open_dataset(
<xarray.Dataset> Size: 6kB
Dimensions:     (latitude: 21, longitude: 31)
Coordinates:
  * latitude    (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.8 -39.9 -40.0
  * longitude   (longitude) float64 248B 141.0 141.1 141.2 ... 143.8 143.9 144.0
    meanSea     float64 8B ...
    number      int64 8B ...
    step        timedelta64[ns] 8B ...
    time        datetime64[ns] 8B ...
    valid_time  datetime64[ns] 8B ...
Data variables:
    d2fd        (latitude, longitude) float64 5kB ...
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_edition:            1
    GRIB_subCentre:          0
    institution:             European Centre for Medium-Range Weather Forecasts

I've been trying to understand the problem but unfortunately haven't reached a solution, here's a few things I've noticed though.

  • cfgrib has a specific variable for spectra keys (directionNumber and frequencyNumber). Kerchunk uses cfgrib.dataset.COORD_ATTRS in a few places to loop over possible coordinates. This doesn't contain either directionNumber, nor frequencyNumber.
  • _split_file doesn't break the grib file into its component messages (I think it's meant to?). The example file I've been testing has 1044 messages. I copied the ECMWF example on reading messages from a byte stream to create a new _split_file but that wasn't a complete fix.
$ grib_ls ./notebooks/ec_spectra/T1P12180000010100001 | tail -n 10
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1044 of 1044 messages in ./notebooks/ec_spectra/T1P12180000010100001

1044 of 1044 total messages in 1 files
def _split_file2(f: io.FileIO, skip=0):
    data = f.read()
    offset = 0
    part = 0

    while offset < len(data):
        h = codes_new_from_message(data[offset:])
        total_length = codes_get(h, "totalLength")
        yield offset, total_length, data[offset : offset + total_length]
        codes_release(h)
        offset += total_length
        part += 1
        if skip and part >= skip:
            break

Any suggestions would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions