Skip to content

Corrupted Data from pangeo-cmip6 dataset #738

@danbrowne-coder

Description

@danbrowne-coder

Description

I am trying to download the following dataset: "gs://cmip6/CMIP6/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20210317/" whose path I found by this script:

    cat_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
    cat = intake.open_esm_datastore(cat_url)
    cat_subset2 = cat.search(
        source_id="ACCESS-CM2",
        member_id=["r1i1p1f1"],
        experiment_id=["ssp585"]  ,
        variable_id=["pr"] ,
        table_id=["day"] ,
    )
    dset_dict = cat_subset2.to_dataset_dict(
            xarray_open_kwargs={'use_cftime': False, "decode_times": True, "consolidated": True},
            aggregate=False,
            storage_options={"token": "anon"},
        )

However, to_dataset_dict fails and raises an ESMDataSourceError I believe possibly due to the path being corrupted (having additional .version added so cant find the 'key'). See stacktrace.

Traceback (most recent call last):
...
  File "/home/dan/.cache/pypoetry/virtualenvs/seas-DLXvwKbf-py3.9/lib/python3.9/site-packages/intake_esm/source.py", line 208, in _get_schema
    self._open_dataset()
  File "/home/dan/.cache/pypoetry/virtualenvs/seas-DLXvwKbf-py3.9/lib/python3.9/site-packages/intake_esm/source.py", line 264, in _open_dataset
    raise ESMDataSourceError(
intake_esm.source.ESMDataSourceError: Failed to load dataset with key='ScenarioMIP.CSIRO-ARCCSS.ACCESS-CM2.ssp585.r1i1p1f1.day.pr.gn.gs://cmip6/CMIP6/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20210317/.20210317'
                 You can use `cat['ScenarioMIP.CSIRO-ARCCSS.ACCESS-CM2.ssp585.r1i1p1f1.day.pr.gn.gs://cmip6/CMIP6/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20210317/.20210317'].df` to inspect the assets/files for this key.
   

Doing this does not work: cat['ScenarioMIP.CSIRO-ARCCSS.ACCESS-CM2.ssp585.r1i1p1f1.day.pr.gn.gs://cmip6/CMIP6/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20210317/.20210317'].df

But this does work:
cat['ScenarioMIP.CSIRO-ARCCSS.ACCESS-CM2.ssp585.day.gn'].df

What I Did

So I use the path directly to open zarr. And I find there is an issue with this datasource. Could the time index be corrupted? It says: time units: days since 2251-01-01
So the dataset runs from 2251 to 2300? How can I access the rest of the dataset?

import intake
import intake_esm
import requests
import aiohttp
import xarray as xr
import dask
import gcsfs


pr_ssp585 = xr.open_zarr(
        "gs://cmip6/CMIP6/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20210317/", consolidated=True, use_cftime=False, decode_times=False)


pr_ssp585

<xarray.Dataset>
Dimensions:    (lat: 144, bnds: 2, lon: 192, time: 18262)
Coordinates:
  * lat        (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(144, 2), meta=np.ndarray>
  * lon        (lon) float64 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(192, 2), meta=np.ndarray>
  * time       (time) int64 0 1 2 3 4 5 ... 18256 18257 18258 18259 18260 18261
    time_bnds  (time, bnds) float64 dask.array<chunksize=(9131, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 dask.array<chunksize=(495, 144, 192), meta=np.ndarray>
Attributes: (12/50)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    cmor_version:           3.4.0
    ...                     ...
    title:                  ACCESS-CM2 output prepared for CMIP6
    tracking_id:            hdl:21.14100/d3a15390-8afe-4503-9669-da9b50bd9c99
    variable_id:            pr
    variant_label:          r1i1p1f1
    version:                v20210317
    version_id:             v20210317


pr_ssp585["time"]

<xarray.DataArray 'time' (time: 18262)>
array([    0,     1,     2, ..., 18259, 18260, 18261])
Coordinates:
  * time     (time) int64 0 1 2 3 4 5 6 ... 18256 18257 18258 18259 18260 18261
Attributes:
    axis:           T
    bounds:         time_bnds
    calendar:       proleptic_gregorian
    long_name:      time
    standard_name:  time
    units:          days since 2251-01-01 12:00:00.000000

Version information: output of intake_esm.show_versions()

INSTALLED VERSIONS

cftime: 1.6.4.post1
dask: 2023.12.1
fastprogress: 1.0.3
fsspec: 2025.7.0
gcsfs: 2025.7.0
intake: 0.6.8
intake_esm: 2023.11.10
netCDF4: 1.7.2
pandas: 2.3.2
requests: 2.32.5
s3fs: None
xarray: 2023.12.0
zarr: 2.18.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions