-
Notifications
You must be signed in to change notification settings - Fork 95
Open
Description
Hi there! I've been processing 2D wave spectra grib files using kerchunk and have found that it doesn't correctly identify the coordinates, nor the dimensions of the data . The variable d2fd has the dimensions (directionNumber, frequencyNumber, latitude, longitude), so a 2D array per lat/lon point. Xarray and cfgrib read the data format correctly, but kerchunk does not. See the below,
cfgrib
xr.load_dataset("ec_spectra/T1P12180000010106001", engine="cfgrib")
<xarray.Dataset> Size: 3MB
Dimensions: (directionNumber: 36, frequencyNumber: 29, latitude: 21,
longitude: 31)
Coordinates:
number int64 8B 0
time datetime64[ns] 8B 2024-12-18
step timedelta64[ns] 8B 14 days 06:00:00
meanSea float64 8B 0.0
* directionNumber (directionNumber) int64 288B 1 2 3 4 5 6 ... 32 33 34 35 36
* frequencyNumber (frequencyNumber) int64 232B 1 2 3 4 5 6 ... 25 26 27 28 29
* latitude (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.9 -40.0
* longitude (longitude) float64 248B 141.0 141.1 141.2 ... 143.9 144.0
valid_time datetime64[ns] 8B 2025-01-01T06:00:00
Data variables:
d2fd (directionNumber, frequencyNumber, latitude, longitude) float32 3MB ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2025-01-23T14:57 GRIB to CDM+CF via cfgrib-0.9.1...kerchunk
single_ref = scan_grib("./ec_spectra/T1P12180000010100001")
print(f"Lenght of single_ref: {len(single_ref)}")
with open("single_ref.json", "w") as f:
json.dump(single_ref[0], f)
fs = fsspec.filesystem("reference", fo="single_ref.json")
xr.open_dataset(fs.get_mapper(""), engine="zarr")
Lenght of single_ref: 1
/home/gpickering/projects/welvops-Py/.venv/lib/python3.10/site-packages/xarray/backends/api.py:571: RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
backend_ds = backend.open_dataset(
<xarray.Dataset> Size: 6kB
Dimensions: (latitude: 21, longitude: 31)
Coordinates:
* latitude (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.8 -39.9 -40.0
* longitude (longitude) float64 248B 141.0 141.1 141.2 ... 143.8 143.9 144.0
meanSea float64 8B ...
number int64 8B ...
step timedelta64[ns] 8B ...
time datetime64[ns] 8B ...
valid_time datetime64[ns] 8B ...
Data variables:
d2fd (latitude, longitude) float64 5kB ...
Attributes:
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_edition: 1
GRIB_subCentre: 0
institution: European Centre for Medium-Range Weather ForecastsI've been trying to understand the problem but unfortunately haven't reached a solution, here's a few things I've noticed though.
- cfgrib has a specific variable for spectra keys (directionNumber and frequencyNumber). Kerchunk uses cfgrib.dataset.COORD_ATTRS in a few places to loop over possible coordinates. This doesn't contain either directionNumber, nor frequencyNumber.
- _split_file doesn't break the grib file into its component messages (I think it's meant to?). The example file I've been testing has 1044 messages. I copied the ECMWF example on reading messages from a byte stream to create a new _split_file but that wasn't a complete fix.
$ grib_ls ./notebooks/ec_spectra/T1P12180000010100001 | tail -n 10
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1 ecmf meanSea 0 20241218 336 fc 2dfd grid_simple regular_ll
1044 of 1044 messages in ./notebooks/ec_spectra/T1P12180000010100001
1044 of 1044 total messages in 1 filesdef _split_file2(f: io.FileIO, skip=0):
data = f.read()
offset = 0
part = 0
while offset < len(data):
h = codes_new_from_message(data[offset:])
total_length = codes_get(h, "totalLength")
yield offset, total_length, data[offset : offset + total_length]
codes_release(h)
offset += total_length
part += 1
if skip and part >= skip:
breakAny suggestions would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
No labels