Skip to content

Over agressive optimisation of b-tree reading. #135

@bnlawrence

Description

@bnlawrence

Currently, whenever we do something like

f = pyfive.File(file_path) 
 for name in f.keys():
        ds = f[name]
        ... stuff ...

we are reading all the metadata and b-tree for each dataset. This is not optimal for use-cases like #134 (particularly on object store data).

This is fine in many cases, but not all, and in particular, CMIP data (multi GB files) with multiple coordinate variables on object store take an eternity (well, it feels like it), even when chunked nicely.

We made a big deal about the utility of having done this as it means that we can close the file at this point and use the ds variable later for read operations (which can open and close the file cheaply if all they are doing is opening/closing at the filesystem level - even on object stores, provided we are within an s3fs filesystem context).

It seems like we could and should row back a bit. There are effectively variants of two options I think:

  1. We don't load the b-tree until the first instance of reading data (but we don't get the benefit of having closed the file, which we think is helpful in a complex parallel environment).
  2. We introduce a new (or modify an) API somewhere to lazily load dataset metadata without instantiating the b-trees.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions