-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Currently, whenever we do something like
f = pyfive.File(file_path)
for name in f.keys():
ds = f[name]
... stuff ...
we are reading all the metadata and b-tree for each dataset. This is not optimal for use-cases like #134 (particularly on object store data).
This is fine in many cases, but not all, and in particular, CMIP data (multi GB files) with multiple coordinate variables on object store take an eternity (well, it feels like it), even when chunked nicely.
We made a big deal about the utility of having done this as it means that we can close the file at this point and use the ds variable later for read operations (which can open and close the file cheaply if all they are doing is opening/closing at the filesystem level - even on object stores, provided we are within an s3fs filesystem context).
It seems like we could and should row back a bit. There are effectively variants of two options I think:
- We don't load the b-tree until the first instance of reading data (but we don't get the benefit of having closed the file, which we think is helpful in a complex parallel environment).
- We introduce a new (or modify an) API somewhere to lazily load dataset metadata without instantiating the b-trees.