Add topostats file helper class #945

SylviaWhittle · 2024-10-13T10:16:20Z

This PR would add a small helper class in topostats.io to assist users that want to explore & retrieve data contained in .topostats files, as we have had feedback from the experimentalists that navigating the .hdf5 file structure is prohibitively complex / difficult to do manually.

Previously, to load the file in a notebook, one had to:

from pathlib import Path

import h5py

from topostats.io import hdf5_to_dict


file = Path("./path/to/file.topostats")
with h5py.File(file, "r") as f:
    data_dict = hdf5_to_dict(f, "/")

# Then try to manually navigate the dictionary to find the specific item wanted
data = data_dict["ordered_trace_heights"]["0"]
# get the keys wrong
>>> ValueError
# manually print keys at each level, akin to doing lots of ls, cd
print(data_dict.keys())
data = data_dict["grain_trace_data"]
print(data.keys())
.
.
.

The TopoFileHelper class adds some methods to help with this:

pretty_print_structure() will print the entire structure (but not messy dictionaries of arrays!):

[./tests/resources/file.topostats]
├ filename
│   └ minicircle
├ grain_masks
│   └ above
│       └ Numpy array, shape: (1024, 1024), dtype: int64
├ grain_trace_data
│   └ above
│       ├ cropped_images
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_trace_cumulative_distances
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_trace_heights
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_traces
│       │   └ 21 keys with numpy arrays as values
│       └ splined_traces
│           └ 21 keys with numpy arrays as values
├ image
│   └ Numpy array, shape: (1024, 1024), dtype: float64
├ image_original
│   └ Numpy array, shape: (1024, 1024), dtype: float64
├ img_path
│   └ /Users/sylvi/Documents/TopoStats/tests/resources/minicircle
├ pixel_to_nm_scaling
│   └ 0.4940029296875
└ topostats_file_version
    └ 0.2

find_data() will perform a strict search for the keys (given in a list) and if no match is found, perform a partial search to find possible matches that the user intended. Eg:

topofilehelper.find_data(["ordered_trace_heights", "0"])

 [ Searching for ['ordered_trace_heights', '0'] in ./tests/resources/file.topostats ]
 | [search] No direct match found.
 | [search] Searching for partial matches.
 | [search] !! [ 1 Partial matches found] !!
 | [search] └ grain_trace_data/above/ordered_trace_heights/0
 └ [End of search]

get_data() Simply retries data when provided with a key string separated by "/"s:

ordered_trace_heights = topofilehelper.get_data("grain_trace_data/above/ordered_trace_heights/0")

data_info() Prints a little information about the value at a specific key:

topofilehelper.data_info("grain_trace_data/above/ordered_trace_heights/0")
topofilehelper.data_info("grain_trace_data/above/ordered_trace_heights")

Data at grain_trace_data/above/ordered_trace_heights/0 is a numpy array with shape: (95,), dtype: float64
Data at grain_trace_data/above/ordered_trace_heights is a dictionary with 21 keys of types {<class 'str'>} and values of types {<class 'numpy.ndarray'>}

No tests yet, would want feedback first

ns-rse · 2024-10-15T10:31:29Z

Not ignoring this, have had a scan through and it looks good but focusing on the various outstanding issues with the better tracing merger.

SylviaWhittle · 2024-10-15T10:32:56Z

Not ignoring this, have had a scan through and it looks good but focusing on the various outstanding issues with the better tracing merger.

All good, not wanting this to take time away from more important stuff, it can wait and I want user opinions first too :)

MaxGamill-Sheffield · 2024-10-16T08:42:56Z

Can we work into this a notebook on opening and extracting data from the file too, to address the comments from the workshop day?

SylviaWhittle · 2024-10-16T10:37:59Z

Can we work into this a notebook on opening and extracting data from the file too, to address the comments from the workshop day?

ye

SylviaWhittle · 2024-10-16T15:04:31Z

Added a notebook showing how to use the class in ./notebooks/

ns-rse · 2024-10-16T22:00:54Z

topostats/io.py

+    Examples
+    --------
+    Creating a helper object.
+    ```python


Not sure about how this renders when Sphinx's Autoapi-doc parses it to generate API docs in the webpage as docstrings are Restructured text. Might be worth using the .. code-block:: approach, see @MaxGamill-Sheffield solution in commit b953634 .

Very good point thank you, I'll do that

ns-rse · 2024-10-17T14:45:41Z

I think the tests under Python 3.9 failing because under that version we need the following import...

from __future__ import annotations

ns-rse · 2024-10-17T14:51:43Z

notebooks/topostats_file_helper_example.ipynb

I think the Notebook would be easier to read if the comments were moved to Markdown sections to delineate the code examples. Otherwise may as well just include a codeblock in docs/advanced/topostats_file_helper.md and be a simpler solution to documentation as its a web-page people can go to, they wouldn't need to activate a virtual environment and then start a Jupyter. Code chunks could be copy and pasted (I'd have to work out how to enable a button to support that though).

ns-rse · 2024-10-30T14:49:25Z

Just came across h5glance after the Skan developer mentioned it on Mastodon. I wonder if using this would be a simpler solution, it sounds as though it might work within Jupyter Notebooks too.

SylviaWhittle · 2024-11-24T16:41:40Z

Just came across h5glance after the Skan developer mentioned it on Mastodon. I wonder if using this would be a simpler solution, it sounds as though it might work within Jupyter Notebooks too.

Good find!

It does work Wonderfully in notebooks and it've even interactive! You can click on the items to expand / hide sub-fields.

However this is just for looking at hdf5 files and not retrieving data from them (which would still require the with h5py.File('testfile.hdf5', 'r') as f: <load stuff manually>)

I propose that I replace the code I wrote to display the contents of the file with h5glance but keep the methods I wrote to do data retrieval?

ns-rse · 2024-11-25T08:44:47Z

Sounds like a plan.

From memory these are thin wrappers around being able to access dictionary items directly and I feel that its substituting learning how to work with dictionaries directly with learning how to use the wrappers. I'm of the opinion that the more general skill (working directly with dictionaries) has broader benefits to users in the long term.

⚖️

SylviaWhittle · 2024-12-03T15:24:51Z

There is an issue with h5glance, that it must be called from within the notebook. It cannot be called in a wrapper in a standard .py file. If this happens then this is the result:

I tried this:

    def pretty_print_structure(self) -> None:
        """
        Print the structure of the data in the data dictionary.

        The structure is printed with the keys indented to show the hierarchy of the data.
        """
        LOGGER.info(f"running h5glance")
        H5Glance(self.topofile)

and this:

    def pretty_print_structure(self) -> None:
        """
        Print the structure of the data in the data dictionary.

        The structure is printed with the keys indented to show the hierarchy of the data.
        """
        LOGGER.info(f"running h5glance")
        result = H5Glance(self.topofile)
        print(result)

and neither produce the nested output needed.

If users want the interactive h5glance notebook UI I screenshotted earlier, they must call it explicitly in a notebook, h5glance.H5Glance("./file.h5").

So either they must remember h5glance as a separate tool to TopoFileHelper or we could provide my worse implementation as a built-in alternative in case they don't remember to use h5glance?

ns-rse · 2024-12-03T17:47:32Z

I guess it depends how people are using the .topostats files? Are Notebooks widely used within the group and outside of it to explore .topostats files?

I'm not a great fan of re-inventing the 🛞 and if a tool exists I'll tend to advocate for its use over making something new which adds to our codebase and the overhead of maintenance.

As a general rule though, and perhaps I'm missing something, but hdf5 files once loaded are essentially dictionaries. A wrapper to make exploring dictionaries is perhaps useful but it still requires learning how to use the wrapper. I'd personally advocate for helping people learn how to explore dictionaries as it gives them a transferable skill that can be used in lots of other scenarios.

I think I recall writing such when I wrote original Notebooks (see for example the line after from topostats.io import read_yaml in this notebook). We could point people to tutorials on how to use .keys() and .values() and iteration over dictionaries.

The h5glance README points to using the h5py package when wanting to work with such files within code but then that is what you are using here.

Having re-read the Examples you've written it looks like it is demonstrating to people how to use the helper to view files in a Notebook, which they could do with h5glance and its one less thing to maintain within TopoStats ⚖️

MaxGamill-Sheffield · 2025-01-08T10:55:19Z

Plan from the TopoStats code clean 08/01/25 is:

add as a module / in a module.
remove key finder function in favour of h5glance
take notebook from here and:
- add docs and a section on h5glance
- add docs and a section on the func that pulls the values
- remove the helical periodicity stuff

ns-rse · 2025-01-08T12:28:37Z

Sorry to miss code clean, was embroiled in some bioinformatics work and didn't notice the time.

With regards to Notebook this might be an ideal opportunity to migrate to the newer marimo which among other things has a major advantage of updating all dependent cells when an earlier one is re-run.

For more on the problems marimo solves see the faq.

I doubt there will be much of an overhead in migrating since its still a notebook running the cells so both Markdown and code cells could be copied over. There is a section on migrating from Jupyter.

ns-rse · 2025-02-04T11:29:37Z

I'm just looking through using nanosurf and whilst they don't have their
code repository shared online anywhere once installed we can browse the source. There are a couple of PDFs installed
under ~/.<path_to_venv>/<venv_name>/lib/python3.<version>/site-packages/nanosurf/doc/...

NHF_Reader\ Overview.pdf
Nanosurf_Python_Library_Overview.pdf

...which are slides.

The source for .nhf and .nih file readers can be found at
~/.<path_to_venv>/<venv_name>/lib/python3.<version>/site-packages/nanosurf/util/{nhf,nid}_reader.py and each defines a
NHFFileReader and NIDFileReader class.

The NHFFileReader has self.print_file_structure() which pretty prints the class structure (i.e. print(self)), see
lines 1077-1080.

Given the grains refactoring and use of @dataset decorators I think this could be the way to go with making it easy to
navigate the .topostats file format.

Noting here for future discussion.

ns-rse · 2025-03-21T10:31:20Z

In light of the proposed restructuring (see #1102 ) would it be reasonable to close this and address loading of .topostats in the documentation once we know what that loopk like with the classes?

SylviaWhittle added 7 commits October 13, 2024 10:39

Add topostats file helper

bed84a2

Add find data function

a327e20

Add data info function

3cd66ff

Add get data function

a9c3f82

Add pretty-print-structure function

0c806ee

Add partial search function

d90ef91

Add class documentation with examples

1677ecd

ns-rse added the user experience label Oct 15, 2024

Add example notebook for loading topostats file data

831f1c8

ns-rse reviewed Oct 16, 2024

View reviewed changes

Add tests for TopoFileHelper class (WIP)

89b6475

ns-rse reviewed Oct 17, 2024

View reviewed changes

Fix | io.py failing tests | import futures.annotations to allow pipe

56acb1f

Add: markdown sections to helper notebook

986870a

SylviaWhittle and others added 3 commits January 22, 2025 10:51

Fix code blocks in documentation

ba9e394

[pre-commit.ci] Fixing issues with pre-commit

adb6eb8

Remove superfluous functions and use H5Glance instead

87074d1

SylviaWhittle added 2 commits January 22, 2025 11:04

Resolve merge linting conflict

53b07fc

Move topostats file helper to its own module

279bc00

SylviaWhittle force-pushed the SylviaWhittle/topo-file-helper branch from e06ed14 to 279bc00 Compare January 22, 2025 11:19

[pre-commit.ci] Fixing issues with pre-commit

b4ade4a

Add topostats file helper class #945

Are you sure you want to change the base?

Add topostats file helper class #945

Uh oh!

Conversation

SylviaWhittle commented Oct 13, 2024

Uh oh!

ns-rse commented Oct 15, 2024

Uh oh!

SylviaWhittle commented Oct 15, 2024

Uh oh!

MaxGamill-Sheffield commented Oct 16, 2024

Uh oh!

SylviaWhittle commented Oct 16, 2024

Uh oh!

SylviaWhittle commented Oct 16, 2024

Uh oh!

ns-rse Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

SylviaWhittle Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

ns-rse commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ns-rse Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

ns-rse commented Oct 30, 2024

Uh oh!

SylviaWhittle commented Nov 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ns-rse commented Nov 25, 2024

Uh oh!

SylviaWhittle commented Dec 3, 2024

Uh oh!

ns-rse commented Dec 3, 2024

Uh oh!

MaxGamill-Sheffield commented Jan 8, 2025

Uh oh!

ns-rse commented Jan 8, 2025

Uh oh!

ns-rse commented Feb 4, 2025

Uh oh!

ns-rse commented Mar 21, 2025

Uh oh!

Uh oh!

ns-rse commented Oct 17, 2024 •

edited

Loading

SylviaWhittle commented Nov 24, 2024 •

edited

Loading