Description
Is your feature request related to a problem? Please describe.
More often than not I come along big HDF datasets, and currently there is no straight forward way to feed them to a dataset.
Describe the solution you'd like
I would love to see a from_h5
method that gets an interface implemented by the user on how items are extracted from dataset (in case of multiple datasets containing elements like arrays and metadata and etc).
Describe alternatives you've considered
Currently I manually load hdf files using h5py
and implement PyTorch dataset interface. For small h5 files I load them into a pandas dataframe and use from_pandas
function in the datasets
package to load them, but for big datasets this is not feasible.
Additional context
HDF files are widespread throughout different domains and are one of the go to's for many researchers/scientists/engineers who work with numerical data. Given datasets
' usecases have outgrown NLP use cases, it will make a lot of sense focusing on things like supporting HDF files.