Skip to content

Loading Data from HDF files #3113

Open
@FeryET

Description

@FeryET

Is your feature request related to a problem? Please describe.
More often than not I come along big HDF datasets, and currently there is no straight forward way to feed them to a dataset.

Describe the solution you'd like
I would love to see a from_h5 method that gets an interface implemented by the user on how items are extracted from dataset (in case of multiple datasets containing elements like arrays and metadata and etc).

Describe alternatives you've considered
Currently I manually load hdf files using h5py and implement PyTorch dataset interface. For small h5 files I load them into a pandas dataframe and use from_pandas function in the datasets package to load them, but for big datasets this is not feasible.

Additional context
HDF files are widespread throughout different domains and are one of the go to's for many researchers/scientists/engineers who work with numerical data. Given datasets' usecases have outgrown NLP use cases, it will make a lot of sense focusing on things like supporting HDF files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgood second issueIssues a bit more difficult than "Good First" issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions