Skip to content

Add GenericHDF5Reader #6356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 71 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
ee4fffd
Release 3.7.1
nikicc Nov 17, 2017
badfd66
Merge release-3.8.0
lanzagar Dec 1, 2017
6903e6f
Merge tag '3.9.0' into stable
lanzagar Jan 19, 2018
bb7876f
Merge tag '3.9.1' into stable
lanzagar Feb 2, 2018
7f4cdcb
Merge tag '3.10.0' into stable
lanzagar Feb 19, 2018
91e16e8
Merge tag '3.11.0' into stable
PrimozGodec Mar 7, 2018
cdd2135
Merge tag '3.12.0' into stable
lanzagar Apr 6, 2018
2e90567
Merge tag '3.13.0' into stable
PrimozGodec Apr 17, 2018
d5d4311
Merge branch 'stable' of github.com:biolab/orange3 into stable
PrimozGodec Apr 17, 2018
04b6175
seting right version for stable-3.13.0
PrimozGodec Apr 17, 2018
3a37226
Merge tag '3.14.0' into stable
lanzagar Jul 4, 2018
51fef0d
Merge tag '3.15.0' into stable
lanzagar Aug 6, 2018
a67f953
Merge tag '3.16.0' into stable
PrimozGodec Sep 14, 2018
6991a89
Merge tag '3.17.0' into stable
lanzagar Oct 26, 2018
3b357d8
Merge tag '3.18.0' into stable
lanzagar Nov 13, 2018
a47b602
Merge tag '3.19.0' into stable
lanzagar Dec 11, 2018
7b9a169
Merge tag '3.20.0' into stable
lanzagar Feb 1, 2019
53fd850
Merge tag '3.20.1' into stable
lanzagar Feb 12, 2019
befd767
Merge tag '3.21.0' into stable
markotoplak May 20, 2019
c258a62
Merge tag '3.22.0' into stable
PrimozGodec Jun 26, 2019
ab72369
Merge tag '3.23.0' into stable
markotoplak Sep 5, 2019
084ddf4
Bump version to 3.23.1
lanzagar Oct 3, 2019
9dda0b4
Update requirements
lanzagar Oct 3, 2019
4d225c4
Release 3.23.1
lanzagar Oct 3, 2019
ebfbd37
Merge tag '3.23.1' into stable
lanzagar Oct 3, 2019
a31d8f0
Merge tag '3.24.0' into stable
markotoplak Dec 20, 2019
5506df1
Merge tag '3.24.1' into stable
markotoplak Jan 17, 2020
4d8b2bf
Merge tag '3.25.0' into stable
markotoplak Apr 10, 2020
3cc5bfd
Merge tag '3.25.1' into stable
markotoplak May 22, 2020
4d120cf
Merge tag '3.26.0' into stable
PrimozGodec Jun 12, 2020
b1b8c61
Merge tag '3.27.0' into stable
PrimozGodec Oct 9, 2020
16d2bcd
Merge tag '3.27.1' into stable
markotoplak Oct 23, 2020
722eabc
Merge tag '3.28.0' into stable
markotoplak Mar 5, 2021
0f1b063
Merge tag '3.29.0' into stable
PrimozGodec May 28, 2021
c52d118
Merge tag '3.29.1' into stable
PrimozGodec May 31, 2021
325c056
Merge tag '3.29.2' into stable
PrimozGodec Jun 8, 2021
653175d
Merge tag '3.29.3' into stable
markotoplak Jun 9, 2021
eaa7571
Merge tag '3.30.0' into stable
markotoplak Sep 22, 2021
7073993
Merge branch 'release-3.30.1' into stable
markotoplak Sep 24, 2021
1122662
Merge tag '3.30.2' into stable
markotoplak Oct 27, 2021
eb4fe1d
Document plural form of input variables for Python Script Data widget
stellarpower Nov 15, 2021
7adf05a
Merge pull request #5694 from stellarpower/patch-2
ajdapretnar Nov 19, 2021
9de6c69
Merge tag '3.31.0' into stable
markotoplak Dec 17, 2021
e889f28
Merge tag '3.31.1' into stable
markotoplak Jan 7, 2022
7d399e7
Merge tag '3.32.0' into stable
markotoplak Apr 1, 2022
1d35646
Merge tag '3.33.0' into stable
markotoplak Sep 30, 2022
bcda58a
Modified imports, _try_load() function. Added _hdf5_tree_model()
Dec 1, 2022
4583669
Merge tag '3.34.0' into stable
lanzagar Dec 5, 2022
3076350
First non-functional version
Dec 12, 2022
0e3b4a1
Merge tag '3.34.1' into stable
markotoplak Dec 13, 2022
8bf2027
First functional version
Dec 16, 2022
de1fb2c
Optimized version with documentation
Jan 5, 2023
7e10bb2
Final version polished
Jan 16, 2023
7d85a0f
Add GenericHDF5Reader
Dec 1, 2022
15b2545
Merge branch 'alba' into devel
gjover Feb 28, 2023
41f650d
Merge tag '3.35.0' into stable
PrimozGodec May 5, 2023
fe9d4cf
Merge branch 'stable' into devel
gjover Aug 28, 2023
07eadf7
Merge tag '3.36.0' into stable
PrimozGodec Sep 8, 2023
a309db8
Merge tag '3.36.1' into stable
PrimozGodec Sep 22, 2023
8d59ef1
Merge tag '3.36.2' into stable
PrimozGodec Oct 31, 2023
a7acbfc
Merge branch 'alba' into stable
gjover Jan 15, 2024
7e195d2
Merge branch 'stable' into devel
gjover Jan 15, 2024
dd89972
Add hdf5 requirement
gjover Jan 17, 2024
8cfd71b
Merge branch 'devel'
gjover Jan 17, 2024
12524e6
Fix GenericHDF5Reader exception on no data.
gjover Jan 22, 2024
0e70a1a
Add h5py requirement in pyproject.toml
gjover Jan 22, 2024
3f8ce6d
Merge branch 'devel'
gjover Jan 22, 2024
ec6546d
Merge branch 'master' into devel
gjover Jan 25, 2024
5c957f4
Replace option box by sheet mechanism
gjover Jan 29, 2024
97214ae
Merge branch 'master' into devel
gjover Jan 29, 2024
8520bb9
Merge branch 'devel'
gjover Jan 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 110 additions & 3 deletions Orange/data/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,17 @@
from os import path, remove
from tempfile import NamedTemporaryFile
from urllib.parse import urlparse, urlsplit, urlunsplit, \
unquote as urlunquote, quote
unquote as urlunquote
from urllib.request import urlopen, Request
from pathlib import Path

import numpy as np
import pandas as pd

import xlrd
import xlsxwriter
import openpyxl
import h5py

from Orange.data import _io, Table, Domain, ContinuousVariable, update_origin
from Orange.data import Compression, open_compressed, detect_encoding, \
Expand All @@ -31,7 +33,6 @@

from Orange.util import flatten


# Support values longer than 128K (i.e. text contents features)
csv.field_size_limit(100*1024*1024)

Expand Down Expand Up @@ -164,7 +165,14 @@ def read(self):
skipinitialspace=True,
)
data = self.data_table(reader)
data.name = path.splitext(path.split(self.filename)[-1])[0]

# ToDO: Name can be set unconditionally when/if
# self.filename will always be a string with the file name.
# Currently, some tests pass StringIO instead of
# the file name to a reader.
if isinstance(self.filename, str):
data.name = path.splitext(
path.split(self.filename)[-1])[0]
if error and isinstance(error, UnicodeDecodeError):
pos, endpos = error.args[2], error.args[3]
warning = ('Skipped invalid byte(s) in position '
Expand Down Expand Up @@ -511,3 +519,102 @@ def _suggest_filename(self, content_disposition):
matches = re.findall(r"filename\*?=(?:\"|.{0,10}?'[^']*')([^\"]+)",
content_disposition or '')
return urlunquote(matches[-1]) if matches else default_name


class GenericHDF5Reader(FileFormat):
"""
Class in charge to read and write generic .hdf5 files

Parameters
----------
data (h5py._hl.dataset.Dataset): Chosen dataset to read by the class

Methods
-------
read():
Returns transforms its data attribute into an Orange.Table object
"""
EXTENSIONS = ('.hdf5', '.h5', '.nxs',)
DESCRIPTION = 'Hierarchical Data Format files'
SUPPORT_COMPRESSED = False
SUPPORT_SPARSE_DATA = False

def __init__(self, filename):
super().__init__(filename=filename)

self.h5_file = h5py.File(filename)

self.datasets = {}
self._load_group("/", self.h5_file)

@property
def sheets(self) -> List:
"""List of datasets in the file.

Returns
-------
List of dataset paths
"""
return list(self.datasets.keys())

def select_sheet(self, sheet):
"""Select dataset to be read

Parameters
----------
sheet : str
dataset path
"""
if sheet is None:
sheet = self.sheets[0]
self.sheet = sheet

def read(self):
"""Process data stored in self.data and returns it as an Orange
Table object.

Returns
-------
table (Orange.Table object):
Contains the information of the chosen dataset in the hdf5 file.
"""

if self.sheet is not None:
name = self.sheet.split('/')[-1]
else:
name = "Data"

data = self.datasets[self.sheet]

# Standard names for the columns of the dataset, can be changed manually
# in the widget itself
columns = [str(i) for i in range(len(data.shape))]

dataset = np.array(data)

# Indexs are created to keep track of the position of the values in the
# original data file
index = pd.MultiIndex.from_product([range(s) for s in dataset.shape], names=columns)
dataset = dataset.flatten()

# Combines the values and the indexes in a readable 2d structure
df = pd.DataFrame({name : dataset}, index=index).reset_index()

attrs = [ContinuousVariable(str(val)) for val in range(0, len(df.columns))]
table = Table.from_numpy(domain=Domain(attributes=attrs), X=df.values)

return table

def _load_group(self, root, group):
"""Recursive procedure that constructs the list of datasets
stored in the .hdf5 file.

Given a root, iterates over all its children to decide whether
they are a dataset or another group of data.
"""
for name, obj in group.items():
path = root + name
if isinstance(obj, h5py.Group):
self._load_group(path + "/", group[name])
elif isinstance(obj, h5py.Dataset):
self.datasets[path] = obj
17 changes: 14 additions & 3 deletions Orange/widgets/data/owfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
from orangewidget.workflow.drophandler import SingleUrlDropHandler

from Orange.data.table import Table, get_sample_datasets_dir
from Orange.data.io import FileFormat, UrlReader, class_from_qualified_name
from Orange.data.io import FileFormat, UrlReader, \
class_from_qualified_name, GenericHDF5Reader
from Orange.data.io_base import MissingReaderException
from Orange.util import log_warnings
from Orange.widgets import widget, gui
Expand Down Expand Up @@ -46,7 +47,7 @@ def add_origin(examples, filename):
"""
Adds attribute with file location to each string variable
Used for relative filenames stored in string variables (e.g. pictures)
TODO: we should consider a cleaner solution (special variable type, ...)
ToDO: we should consider a cleaner solution (special variable type, ...)
"""
if not filename:
return
Expand Down Expand Up @@ -268,6 +269,14 @@ def package(w):
box.layout().addWidget(self.reader_combo)
layout.addWidget(box, 0, 1)

# Set an options box for special types of files that require more
# specifications before loading the Orange.table
self.options_box = gui.widgetBox(self.controlArea,
orientation=QGridLayout().setSpacing(4),
box="Options")
# Hide the box until needed
self.options_box.hide()

box = gui.vBox(self.controlArea, "Info")
self.infolabel = gui.widgetLabel(box, 'No data loaded.')

Expand All @@ -282,6 +291,7 @@ def package(w):
autoDefault=False
)
gui.rubber(box)

self.apply_button = gui.button(
box, self, "Apply", callback=self.apply_domain_edit)
self.apply_button.setEnabled(False)
Expand Down Expand Up @@ -452,7 +462,7 @@ def mark_problematic_reader():
self.data = data
self.openContext(data.domain)
self.apply_domain_edit() # sends data
return None
return None

def _get_reader(self) -> FileFormat:
if self.source == self.LOCAL_FILE:
Expand Down Expand Up @@ -483,6 +493,7 @@ def _get_reader(self) -> FileFormat:
url = self.url_combo.currentText().strip()
return UrlReader(url)


def _update_sheet_combo(self):
if len(self.reader.sheets) < 2:
self.sheet_box.hide()
Expand Down
2 changes: 1 addition & 1 deletion doc/visual-programming/source/widgets/data/pythonscript.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Extends functionalities through Python scripting.
- Classifier (Orange.classification.Learner): classifier retrieved from ``out_classifier`` variable
- Object: Python object retrieved from ``out_object`` variable

**Python Script** widget can be used to run a python script in the input, when a suitable functionality is not implemented in an existing widget. The script has ``in_data``, ``in_distance``, ``in_learner``, ``in_classifier`` and ``in_object`` variables (from input signals) in its local namespace. If a signal is not connected or it did not yet receive any data, those variables contain ``None``.
**Python Script** widget can be used to run a python script in the input, when a suitable functionality is not implemented in an existing widget. The script has ``in_data``, ``in_distance``, ``in_learner``, ``in_classifier`` and ``in_object`` variables (from input signals) in its local namespace. If a signal is not connected or it did not yet receive any data, those variables contain ``None``. For the case when multiple inputs are connected to the widget, the lists ``in_datas``, ``in_distances``, ``in_learners``, ``in_classifiers`` and ``in_objects`` may be used instead.

After the script is executed variables from the script’s local namespace are extracted and used as outputs of the widget. The widget can be further connected to other widgets for visualizing the output.

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ requires = [
"setuptools>=51.0",
"sphinx",
"wheel",
"h5py",
]

build-backend = "setuptools.build_meta"
Expand Down
2 changes: 2 additions & 0 deletions requirements-core.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,5 @@ xgboost>=1.7.4
xlrd>=1.2.0
# Writing Excel Files
xlsxwriter
# HDF5 binary data format
h5py