Over agressive optimisation of b-tree reading.

Currently, whenever we do something like

```
f = pyfive.File(file_path) 
 for name in f.keys():
        ds = f[name]
        ... stuff ...
```
we are reading all the metadata and b-tree for each dataset.  This is not optimal for use-cases like #134 (particularly on object store data).

This is fine in many cases, but not all, and in particular, CMIP data (multi GB files) with multiple coordinate variables on object store take an eternity (well, it feels like it), even when chunked nicely.

We made a big deal about the utility of having done this as it means that we can close the file at this point and use the `ds` variable later for read operations (which can open and close the file cheaply if all they are doing is opening/closing at the filesystem level - even on object stores, provided we are within an `s3fs` filesystem context).

It seems like we could and should row back a bit. There are effectively variants of two options I think:

1. We don't load the b-tree until the first instance of reading data (but we don't get the benefit of having closed the file, which we think is helpful in a complex parallel environment).
2. We introduce a new (or modify an) API somewhere to lazily load dataset metadata without instantiating the b-trees. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Over agressive optimisation of b-tree reading. #135

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Over agressive optimisation of b-tree reading. #135

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions