Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topostats file helper class #945

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

SylviaWhittle
Copy link
Collaborator

This PR would add a small helper class in topostats.io to assist users that want to explore & retrieve data contained in .topostats files, as we have had feedback from the experimentalists that navigating the .hdf5 file structure is prohibitively complex / difficult to do manually.

Previously, to load the file in a notebook, one had to:

from pathlib import Path

import h5py

from topostats.io import hdf5_to_dict


file = Path("./path/to/file.topostats")
with h5py.File(file, "r") as f:
    data_dict = hdf5_to_dict(f, "/")

# Then try to manually navigate the dictionary to find the specific item wanted
data = data_dict["ordered_trace_heights"]["0"]
# get the keys wrong
>>> ValueError
# manually print keys at each level, akin to doing lots of ls, cd
print(data_dict.keys())
data = data_dict["grain_trace_data"]
print(data.keys())
.
.
.

The TopoFileHelper class adds some methods to help with this:

  • pretty_print_structure() will print the entire structure (but not messy dictionaries of arrays!):
[./tests/resources/file.topostats]
├ filename
│   └ minicircle
├ grain_masks
│   └ above
│       └ Numpy array, shape: (1024, 1024), dtype: int64
├ grain_trace_data
│   └ above
│       ├ cropped_images
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_trace_cumulative_distances
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_trace_heights
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_traces
│       │   └ 21 keys with numpy arrays as values
│       └ splined_traces
│           └ 21 keys with numpy arrays as values
├ image
│   └ Numpy array, shape: (1024, 1024), dtype: float64
├ image_original
│   └ Numpy array, shape: (1024, 1024), dtype: float64
├ img_path
│   └ /Users/sylvi/Documents/TopoStats/tests/resources/minicircle
├ pixel_to_nm_scaling
│   └ 0.4940029296875
└ topostats_file_version
    └ 0.2
  • find_data() will perform a strict search for the keys (given in a list) and if no match is found, perform a partial search to find possible matches that the user intended. Eg:
topofilehelper.find_data(["ordered_trace_heights", "0"])
 [ Searching for ['ordered_trace_heights', '0'] in ./tests/resources/file.topostats ]
 | [search] No direct match found.
 | [search] Searching for partial matches.
 | [search] !! [ 1 Partial matches found] !!
 | [search] └ grain_trace_data/above/ordered_trace_heights/0
 └ [End of search]
  • get_data() Simply retries data when provided with a key string separated by "/"s:
ordered_trace_heights = topofilehelper.get_data("grain_trace_data/above/ordered_trace_heights/0")
  • data_info() Prints a little information about the value at a specific key:
topofilehelper.data_info("grain_trace_data/above/ordered_trace_heights/0")
topofilehelper.data_info("grain_trace_data/above/ordered_trace_heights")
Data at grain_trace_data/above/ordered_trace_heights/0 is a numpy array with shape: (95,), dtype: float64
Data at grain_trace_data/above/ordered_trace_heights is a dictionary with 21 keys of types {<class 'str'>} and values of types {<class 'numpy.ndarray'>}

No tests yet, would want feedback first

@ns-rse
Copy link
Collaborator

ns-rse commented Oct 15, 2024

Not ignoring this, have had a scan through and it looks good but focusing on the various outstanding issues with the better tracing merger.

@SylviaWhittle
Copy link
Collaborator Author

Not ignoring this, have had a scan through and it looks good but focusing on the various outstanding issues with the better tracing merger.

All good, not wanting this to take time away from more important stuff, it can wait and I want user opinions first too :)

@MaxGamill-Sheffield
Copy link
Collaborator

Can we work into this a notebook on opening and extracting data from the file too, to address the comments from the workshop day?

@SylviaWhittle
Copy link
Collaborator Author

Can we work into this a notebook on opening and extracting data from the file too, to address the comments from the workshop day?

ye

@SylviaWhittle
Copy link
Collaborator Author

Added a notebook showing how to use the class in ./notebooks/

Examples
--------
Creating a helper object.
```python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about how this renders when Sphinx's Autoapi-doc parses it to generate API docs in the webpage as docstrings are Restructured text. Might be worth using the .. code-block:: approach, see @MaxGamill-Sheffield solution in commit b953634 .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point thank you, I'll do that

@ns-rse
Copy link
Collaborator

ns-rse commented Oct 17, 2024

I think the tests under Python 3.9 failing because under that version we need the following import...

from __future__ import annotations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Notebook would be easier to read if the comments were moved to Markdown sections to delineate the code examples. Otherwise may as well just include a codeblock in docs/advanced/topostats_file_helper.md and be a simpler solution to documentation as its a web-page people can go to, they wouldn't need to activate a virtual environment and then start a Jupyter. Code chunks could be copy and pasted (I'd have to work out how to enable a button to support that though).

@ns-rse
Copy link
Collaborator

ns-rse commented Oct 30, 2024

Just came across h5glance after the Skan developer mentioned it on Mastodon. I wonder if using this would be a simpler solution, it sounds as though it might work within Jupyter Notebooks too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants