Skip to content

TopoStats classes internally and for writing/reading HDF5#1151

Closed
ns-rse wants to merge 35 commits intomainfrom
ns-rse/1102-switching-to-TopoStats-class
Closed

TopoStats classes internally and for writing/reading HDF5#1151
ns-rse wants to merge 35 commits intomainfrom
ns-rse/1102-switching-to-TopoStats-class

Conversation

@ns-rse
Copy link
Collaborator

@ns-rse ns-rse commented May 7, 2025

Closes #1102
Closes #1143

This PR (draft for now) is the logical extension of the GrainCrops GrainCropDirections and ImageGrainCrops introduced by @SylviaWhittle in #1022 and switches to using the TopoStats class @ns-rse introduced in #1145 for handling images and the derived datasets (arrays) such that the unit of interest is individual grains.

It is at the moment far from complete as the checklist shows below but because of the large amount of changes and reorganisation I was keen to share it in stages. The tests for each commit pass (thanks pytest-testmon 😀 ) but until all steps are complete the integration tests (tests_processing.py and tests_run_modules.py won't pass, I'm working on them as I go through each class).

Its perhaps worth reading the commit messages for the individual commits for a little more information on the re-organisation that has been done so far.

  • LoadScans
  • Filters
  • Grains
  • GrainStats
  • DisorderedTracing
  • NodeStats
  • OrderedTracing
  • Splining
  • Curvature

Of note...

Shared Methods

Some methods from Grains were used @staticmethod from GrainCrops and have been moved out to utils. I've setup some skeleton tests for these but they fail (get 2x5 arrays back when I would have expected 5x5 arrays from flattening 5x5x3 arrays). Couldn't see any existing tests for these.

Documentation

I intend to document the class structures and in turn the HDF5 format these are written to.

Syrupy

I've closed #1143 as the key test which used .pkl's and required manually updating has been addressed. I think we could still switch all tests to use syrupy (see #1152).

AFMReader

These changes will also require modifications to AFMReader and perhaps moving LoadScans over but I'm wary of introducing a circular dependency and have already discussed with @SylviaWhittle this. We felt that perhaps AFMReader should only load files and return dictionaries. Re-constructing these to TopoStats / ImageGrainCrops / GrainCropsDirection / GrainCrop should be the domain of TopoStats. I don't think this should be a problem for the Napari plug-in as it could import and use whatever it needs from either.


Before submitting a Pull Request please check the following.

  • Existing tests pass.
  • Documentation has been updated and builds. Remember to update as required...
  • Pre-commit checks pass.
  • New functions/methods have typehints and docstrings.
  • New functions/methods have tests which check the intended behaviour is correct.

@ns-rse ns-rse added Filters Issues pertaining to the Filters class Grains Issues pertaining to the Grains class IO Input and Output refactor Refactoring of code labels May 7, 2025
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from 1501c0d to 8e33faa Compare May 8, 2025 11:03
@ns-rse ns-rse mentioned this pull request May 9, 2025
7 tasks
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from f8aa5f0 to 00c3bee Compare May 9, 2025 12:45
Copy link
Collaborator

@SylviaWhittle SylviaWhittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good!

I've checked this out locally and just followed the threads of where data goes, and it looks very nice.

One thing, is to maybe move things out of / don't add more to utils.py given that IIRC we are wanting to eventually eliminate it? Perhaps a grain_handling.py or something?

I'll have a further look next week with Laura but looking great so far 👍

@SylviaWhittle
Copy link
Collaborator

Oops meant to merely comment, not approve, sorry!

@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch 2 times, most recently from 734e0e1 to 5047af4 Compare May 15, 2025 09:40
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch 2 times, most recently from b20878d to 89110ac Compare June 3, 2025 15:36
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch 2 times, most recently from be7360a to 6285d74 Compare June 17, 2025 13:33
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from 6285d74 to a61203b Compare June 18, 2025 14:13
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from c32a889 to acfa3c3 Compare September 17, 2025 09:58
@ns-rse
Copy link
Collaborator Author

ns-rse commented Sep 24, 2025

Now have run_nodestats working with TopoStats objects, haven't yet fixed all the unit tests against this but pushing on with ordered_tracing, splining and curvature and will address all failing unit tests once these are done.

@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from 944ccb0 to 8508ba1 Compare September 24, 2025 14:28
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch 3 times, most recently from 7dc9535 to 1ea0e3e Compare October 8, 2025 12:52
- [X] `LoadScans`
- [ ] `Filters`
- [ ] `Grains`
- [ ] `GrainStats`
- [ ] `DisorderedTracing`
- [ ] `NodeStats`
- [ ] `OrderedTracing`
- [ ] `Splining`
Switches `Filters()` class over to using `TopoStats` class objects as input. Tests directly on `Filters()` are updated,
but integration tests (i.e. of how this impacts on `run_modules.py` and `processing.py`) have _not_ been included in
this commit as they also require updating the other classes (`Grains` / `DisorderedTracing` / `NodeStats` / `OrderedTracing` / `Splining`)
The `Grains` class now works with `TopoStats` classes, however...because `GrainCrops` was used in `TopoStats` and this work meant `TopoStats` was used by `Grains` we introduced a circular dependency which Python, reasonably,  complains about. The solution has been to move the class definitions to their own modules `topostats.classes`, but that wasn't without some issues since there are static methods of the `Grains` class that were used _within_ `GrainCrop`. For now these have been moved to the `utils` module and I've started writing tests for them (as they didn't appear to have any).

As a consequence this commit has a lot of things moving around which _will_ make it a pain to review, but hopefully this will be worth it.

For now the whole test suite does _not_ pass all tests because the integration tests where the pipeline is run
end-to-end fails. No attempt has been made to correct this yet because ultimately we would like to simply update the `TopoStats` objects and pass them around and that will only be addressed once each processing step/class has been refactored to work with these.

Subsequent modules should be a little easier to refactor now that the circular dependencies have been broken.
Switches `GrainStats` to take the `TopoStats` object as an argument and extract the `ImageGrainCrops.GrainCropDirection.crops` (be that `above` or `below`) and calculates the statistics from the returned dictionary.

Tests are updated and passed for this module alone, integration tests still fail and will be addressed after all modules are updated.
Required because loading `.topostats` objects from HDF5 AFMReader returns dictionaries. This is ok and I think for now
we should not change this as it makes AFMReader very general and of use to others, but internally when we are switching
to `TopoStats` classes for all the processing each entry point that loads a `.topostats` file requires a `TopoStats`
object so we _have_ to convert these on loading.
- `padding` should be `int()` but was being read as `np.float64()`
- mistakenly always tried to set `crop["skeleton"]` even if its not present (in which case it should be `None`).
Add `_to_dict()` methods to each of the following classes...

- `MatchedBranch`
- `Molecule`
- `Node`
- `OrderedTrace`

...and ensures these are written to HDF5.

Adds dummy objects to `tests/contest.py` and tests the methods work via `tests/test_classes.py`.

Currently the types of many of these are _wrong_ because I don't know what they actually represent, that doesn't really
matter for the testing though which uses dictionary comprehension and handles any type.

Key is that the `GrainCrop.grain_crop_to_dict()` method now works with all of the additional attributes so we can write
the full `TopoStats` object to HDF5 which is required for on-going test development of the remaining `OrderedTrace`,
`Splining` and `Curvature` so we can write intermediary `.topostats` objects which we can load for tests (instead of
running the whole processing pipeline from the start).

This is however also **vital** to the additional entry-points (aka "swiss-army knife") work so we can write `.topostats`
objects with all of the data upto a given point and load it in the future (previous commit e731084 added the
necessary `dict_to_topostats()` function for converting the HDF5-based dictionaries to `TopoStats` objects).
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from fead0b4 to 907a502 Compare October 13, 2025 13:10
Successes...

- Don't attempt to order traces that do not have a disordered trace
- `OrderedTrace` class with attributes and methods
- `MatchedBranch` class

Very messy at the moment, some thoughts...

- noticing a number of places where vectorisation could be used instead of loops and some nesting that seems redundant.
- Dictionaries aren't currently mapped to the classes and their structure, many attributes are themselves dictionaries.
- 2025-10-09 - Currently need to get ordered_branches passing around correctly, they are meant to be attributes of
`MatchedBranch`.
- `tests/resources/tracing/ordered_tracing/catenane_post_nodestats.topostats` is currently 304.4MB which is too big,
- need to do something about this. It has been renamed for now to `catenane_post_nodestats_20251013.topostats` because
  of a conflict when rebasing.
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from e9b940c to 54e4b94 Compare October 13, 2025 17:15
Closes #1220 (and possibly others but I can't find them at the moment!)

TopoStats modular design, which is being improved in current refactoring, means that it should be easy to extend the
analysis pipelines by developing other packages such as [AFMSlicer](https://github.com/AFM-SPM/AFMSlicer) where work is
under way.

One of the things that will be important is to allow developers of such packages, and in turn users, to generate sample
configuration files which they can change as they desire.

Rather than have the same code duplicated across packages we can use the `io.write_config_with_comments()` function from
TopoStats to load a `<pkg_name>/default_config.yaml` from a package and write that to disk which is what this Pull
Request achieves.

I've included an early version of `docs/advanced/extending.md` to document how to develop extension packages, it _will_
change dramatically as this takes shape as this is new territory for me, but felt it important to document what I'm
doing now so that I can expand and improve on it as things change and lesson are learnt.

**NB** This branch will deliberately target `ns-rse/1102-switching-to-TopoStats-class` as that will be the basis on
which other packages are built.
Successes...

- Don't attempt to order traces that do not have a disordered trace
- `OrderedTrace` class with attributes and methods
- `MatchedBranch` class

Very messy at the moment, some thoughts...

- noticing a number of places where vectorisation could be used instead of loops and some nesting that seems
  redundant. This won't be addressed in this PR but should be addressed in the future
- Dictionaries aren't currently mapped to the classes and their structure, many attributes are themselves dictionaries.
- 2025-10-09 - Currently need to get ordered_branches passing around correctly, they are meant to be attributes of
  `MatchedBranch`.
- `tests/resources/tracing/ordered_tracing/catenane_post_nodestats.topostats` is currently 304.4MB which is too big,
- need to do something about this. It has been renamed for now to `catenane_post_nodestats_20251013.topostats` because
  of a conflict when rebasing. Working on making it so we can pickle objects (have added `__getstate__` and
  `__setstate__` to all classes see next commit)
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from 54e4b94 to 23ca66c Compare October 14, 2025 16:31
- adds `thresholds` and `threshold_method` properties to `GrainCrop` class
- adds `config` and `full_mask_tensor` properties to `TopoStats` class
- updates tests in light of these changes
- correct minor tpyo in `default_config.yaml`

The main things that this adds though is `__getstate__`/`__setstate__` methods for each of the classes. The reason for
doing so is because classes that have `@property` objects associated with them can't be pickled and so they need
explicit conversion to dictionaries.

See...

- [here](https://stackoverflow.com/a/1939384/1444043)
- [Handling stateful objects](https://docs.python.org/3/library/pickle.html#pickle-state)

Unfortunately this still fails...

```
from pathlib import Path
import pickle as pkl
from topostats.classes import TopoStats

OUTDIR = Path.cwd()
OUTFILE = OUTDIR / "empty.topostats"
empty_topostats = TopoStats(img_path = None)
with OUTFILE.open(mode="wb") as f:
    pkl.dump(empty_topostats, f)
TypeError                                 Traceback (most recent call last)
Cell In[905], line 2
      1 with OUTFILE.open(mode="wb") as f:
----> 2     pkl.dump(empty_topostats, f)

TypeError: cannot pickle 'property' object

empty_topostats.__getstate__()
{'_image_grain_crops': <property at 0x7fb40c81e0c0>,
 '_filename': <property at 0x7fb40c81d170>,
 '_pixel_to_nm_scaling': <property at 0x7fb40c81f880>,
 '_img_path': PosixPath('/home/neil/work/git/hub/AFM-SPM/TopoStats/tmp'),
 '_image': <property at 0x7fb39ce731a0>,
 '_image_original': <property at 0x7fb39ce71e40>,
 '_full_mask_tensor': <property at 0x7fb39ce72980>,
 '_topostats_version': <property at 0x7fb39ce71d00>,
 '_config': <property at 0x7fb39ce72020>}
```

Everything is _still_ a `property`.

This dummy example works fine though...

```
@DataClass
class dummy():
    var1: int | None = None
    var2: float | None = None
    var3: str | None = None
    var4: list[int] | None = None
    var5: dict[str, str] | None = None

    def __getstate__(self):
        # return {"_var1": self._var1,
        #         "_var2": self._var2,
        #         "_var3": self._var3,
        #         "_var4": self._var4,
        #         "_var5": self._var5,}
        state = self.__dict__.copy()
        return state

    def __setstate__(self, state):
        # self._var1 = state["_var1"]
        # self._var2 = state["_var2"]
        # self._var3 = state["_var3"]
        # self._var4 = state["_var4"]
        # self._var5 = state["_var5"]
        self.__dict__.update(state)

    @Property
    def var1(self) -> int:
        """
        Getter for the ``var1`` attribute.

        Returns
        -------
        int
            Returns the value of ``var1``.
        """
        return self._var1

    @var1.setter
    def var1(self, value: int) -> None:
        """
        Setter for the ``var1`` attribute.

        Parameters
        ----------
        value : int
            Value to set for ``var1``.
        """
        self._var1 = value

    @Property
    def var2(self) -> float:
        """
        Getter for the ``var2`` attribute.

        Returns
        -------
        float
            Returns the value of ``var2``.
        """
        return self._var2

    @var2.setter
    def var2(self, value: float) -> None:
        """
        Setter for the ``var2`` attribute.

        Parameters
        ----------
        value : float
            Value to set for ``var2``.
        """
        self._var2 = value

    @Property
    def var3(self) -> str:
        """
        Getter for the ``var3`` attribute.

        Returns
        -------
        str
            Returns the value of ``var3``.
        """
        return self._var3

    @var3.setter
    def var3(self, value: str) -> None:
        """
        Setter for the ``var3`` attribute.

        Parameters
        ----------
        value : str
            Value to set for ``var3``.
        """
        self._var3 = value

    @Property
    def var4(self) -> list[int]:
        """
        Getter for the ``var4`` attribute.

        Returns
        -------
        list[int]
            Returns the value of ``var4``.
        """
        return self._var4

    @var4.setter
    def var4(self, value: list[int]) -> None:
        """
        Setter for the ``var4`` attribute.

        Parameters
        ----------
        value : list[int]
            Value to set for ``var4``.
        """
        self._var4 = value

    @Property
    def var5(self) -> dict[str,str]:
        """
        Getter for the ``var5`` attribute.

        Returns
        -------
        dict[str,str]
            Returns the value of ``var5``.
        """
        return self._var5

    @var5.setter
    def var5(self, value: dict[str,str]) -> None:
        """
        Setter for the ``var5`` attribute.

        Parameters
        ----------
        value : dict[str,str]
            Value to set for ``var5``.
        """
        self._var5 = value

OUTFILE = OUTDIR / "empty.dummy"
empty_dummy = dummy()
with OUTFILE.open(mode="wb") as f:
    pkl.dump(empty_dummy, f)
```

...no error and I don't understand where I/we have gone wrong?!?!?!?!

I'm somewhat inclined to move away from `@dataclass` and using `@property` to provide the `setter` / `getter` design
pattern and instead use plain classes with attributes.
feature: write YAML configuration files from other packages
Moves to [Pydantic  dataclasses](https://docs.pydantic.dev/latest/concepts/dataclasses/) for stricter data validation.

This means we can pickle `TopoStats` objects which is useful because in the test suite we don't want to run the whole
pipeline when we want to test e.g. `Nodestats`. As a consequence we now have pickles which are loaded as
`pytest.fixtures` (from `tests/conftest.py`) rather than lines of code within tests themselves that save and modify
`.npy`/`.pkl` files.

There are therefore three sets of pickles...

- `minicircle_small`
- `catenanes`
- `rep_int`

...at different stages...

- `_post_grainstats`
- `_post_disordered_tracing`
- `_post_nodestats`

...and we will develop additional fixtures for...

- `_post_ordered_tracing`
- `_post_curvature`
- `_post_splining` (optional, not required at the moment as no subsequent processing is done after this)

A slight disconnect might arise from how these pickles were created, at the moment it is code in a `.py` file on @ns-rse
computer. @ns-rse will look at adding this as an additional script in the repository, but as more work is required its
not included at the moment.

This now allows me to finish of re-factoring and writing the integration test for `ordered-tracing`.
feature(classes): Pydantic classes and pickling
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from 46fba2d to e057771 Compare October 29, 2025 15:05
I started rebasing but didn't fancy the hell of going through repeated merg conflicts so opted to merge instead. To do
so I first create a branch from `ns-rse/1102-switching-to-Topostats-class` called `ns-rse/1102-test-merging-main`,
switched to it and...tested merging `main`. I resolved all the conflicts only once. I then renamed branches locally (but
not on `origin`)...

- `ns-rse/1102-switching-to-TopoStats-class` > `ns-rse/1102-switching-to-TopoStats-class-2025-10-29`
- `ns-rse/1102-test-merging-main` > `ns-rse/1102-switching-to-TopoStats-class`

This meant I could then push the local `ns-rse/1102-switching-to-TopoStats-class` which had `main` merged in to
`origin`. I actually think this is what GitHub does in the background when you resolve merge conflicts but am not 100%
sure.

I would do this again and see now why some people advocate for `git merge` over `git rebase`. The only thing I'd do
different is to instead simply make a backup branch of the one I want to merge `main` into so that I don't have to
bother with renaming (but still have a backup everything went tits-up!).
@ns-rse ns-rse force-pushed the ns-rse/1102-switching-to-TopoStats-class branch from e057771 to 11deabd Compare October 29, 2025 15:22
@ns-rse
Copy link
Collaborator Author

ns-rse commented Nov 4, 2025

I'm revisiting Nodestats and have some questions about the various nested dictionaries I was wondering if anyone might
know the answers to ( @Laura Wiggins I think you worked on this most with Max so maybe you have some insight.

Matched and Unmatched Branches

  1. There are Matched and Unmatched branches. At the moment I've created a class MatchedBranch to replace the former
    but have left the later as a dictionary unmatched_branches. As far as I can tell these have some overlapping
    attributes/values, they both have angles but MatchedBranches (nee matched_branches) also includes
    ordered_coords, heights, fwhm. For simplicity I think it would make sense to have a Branch class and not
    populate the additional attributes where the branch is not matched (and perhaps have a bool attribute to indicate
    whether it is matched or not).

  2. The MatchedBranch class mirrors the structure of the existing matched_branch which included
    matched_branch["fwhm"]. Reading the typehint and docstring of nodeStats.analyse_node_branches() the fwhm is
    "full-width half maximum of the branches" and is meant to be npt.NDArray[np.number] but looking at the tests its
    actually a dictionary with the following (which comes from tests/resources/tracing/nodestats/catenane_node_0_matched_branches_analyse_node_branches.pkl)

matched_branch["fwhm"]={
    'fwhm': np.float64(40.314950876985606),
    'half_maxs': [np.float64(-18.432397834711107),
                  np.float64(21.882553042274495),
                  np.float64(2.821882638708631e-09)],
    'peaks': [np.int64(93),
              np.float64(0.0),
              np.float64(3.819342748549839e-09)]
}

My queston here stems from the fact that the top-level key fwhm doesn't describe what the dictionary contains and so
I'm wondering if the nesting here can simply be removed and so the actual fwhm (the number 40.314950876985606 in
the above example) becomes an attribute of a [Matched]Branch along with half_maxs and peaks? Obviously these would
be empty/None attribute in branches that aren't matched?

@ns-rse
Copy link
Collaborator Author

ns-rse commented Nov 4, 2025

Answers after chatting with @SylivaWhittle

  1. Keep MatchedBranches and have UnmatchedBranches as a separate class.

  2. Everything nested within fwhm is required (i.e. fwhm / half_maxs / peaks) and it is fine to bump these up to
    be attributes of MatchedBranches as long as there is clear nomenclature that all three pertain to "full-width
    half-maximum".

Also discussed...

Having classes with attributes isn't too dissimilar to dictionaries with key/value pairs, but with the added benefit
that the types are clearly defined (no more 🦆 typing in the code base). Because the code base is growing it is
becoming harder for any one to keep it all in their head and reason through all steps and so people will be working on
one small feature that has to fit into a bigger picture. Ensuring datatypes are consistent as they are passed
around/through the analysis pipeline makes development considerably easier as errors manifest themselves earlier in the
development process and can be corrected (I've found this to be the case so far with some types being wrong in the tests
as I've eased Pydantic dataclass validation into the current branch I'm working on).

@ns-rse
Copy link
Collaborator Author

ns-rse commented Nov 7, 2025

I have now added config as an attribute to TopoStats class which means the configuration which is loaded when
loading files follows the TopoStats object around as it passes between the processing classes
(i.e. Filters/Grains/GrainStats/DisorderedTracing/NodeStats/OrderedTracing/Splining/Curvature).
This makes life a lot easier as we don't have to pull apart the config dictionary (loaded from
topostats/default_config.yaml or user provided option) to pass into each of the processing classes as it sits adjacent
to the image data itself.
We will have to remove the options and populate the class attributes from the TopoStats.config attribute.

For example we currently have...

class Filters:
    def __init__(
        self,
        topostats_object: TopoStats,
        row_alignment_quantile: float = 0.5,
        threshold_method: str = "otsu",
        otsu_threshold_multiplier: float = 1.7,
        threshold_std_dev: dict | None = None,
        threshold_absolute: dict | None = None,
        gaussian_size: float = None,
        gaussian_mode: str = "nearest",
        remove_scars: dict = None,
    ):
        self.topostats_object = topostats_object
        self.image = topostats_object.image_original
        self.filename = topostats_object.filename
        self.pixel_to_nm_scaling = topostats_object.pixel_to_nm_scaling
        self.gaussian_size = gaussian_size
        self.gaussian_mode = gaussian_mode
        self.row_alignment_quantile = row_alignment_quantile
        self.threshold_method = threshold_method
        self.otsu_threshold_multiplier = otsu_threshold_multiplier
        # Convert to lists since the thresholding function expects lists of thresholds but
        # we don't want to use more than one value for the filters step.
        if threshold_std_dev is None:
            threshold_std_dev = {"above": 1.0, "below": 1.0}
        else:
            self.threshold_std_dev = {
                "above": [threshold_std_dev["above"]],
                "below": [threshold_std_dev["below"]],
            }
        if threshold_absolute is None:
            threshold_absolute = {"above": 1.0, "below": 10.0}
        else:
            self.threshold_absolute = {
                "above": [threshold_absolute["above"]],
                "below": [threshold_absolute["below"]],
            }
        self.remove_scars_config = remove_scars
        self.images = {
            "pixels": self.image,
            "initial_median_flatten": None,
            "initial_tilt_removal": None,
            "initial_quadratic_removal": None,
            "initial_scar_removal": None,
            "initial_zero_average_background": None,
            "masked_median_flatten": None,
            "masked_tilt_removal": None,
            "masked_quadratic_removal": None,
            "secondary_scar_removal": None,
            "scar_mask": None,
            "mask": None,
            "final_zero_average_background": None,
            "gaussian_filtered": None,
        }
        self.thresholds = None
        self.medians = {"rows": None, "cols": None}
        self.results = {
            "diff": None,
            "median_row_height": None,
            "x_gradient": None,
            "y_gradient": None,
            "threshold": None,
        }

This would become...

EDIT 2025-11-10 : On reflection I think it could be foolish to remove the class options up-front. Instead we should populate them from the topostats_object.config attribute if they are not provided.

class Filters:
    def __init__(
        self,
        topostats_object: TopoStats,
        row_alignment_quantile: float = 0.5,
        threshold_method: str = "otsu",
        otsu_threshold_multiplier: float = 1.7,
        threshold_std_dev: dict | None = None,
        threshold_absolute: dict | None = None,
        gaussian_size: float = None,
        gaussian_mode: str = "nearest",
        remove_scars: dict = None,
    ):
        self.topostats_object = topostats_object
        filter_config = self.topostats_object["config"]["filter"]
        self.image = topostats_object.image_original
        self.filename = topostats_object.filename
        self.pixel_to_nm_scaling = topostats_object.pixel_to_nm_scaling
        self.gaussian_size = filter_config["gaussian_size"] if gaussian_size is None else gaussian_size
        self.gaussian_mode = filter["gaussian_mode"] if gaussian_mode is None else gaussian_mode
        self.row_alignment_quantile = filter["row_alignment_quantile"] if row_alignment_quantile is None else row_alignment_quantile
        self.threshold_method = filter["threshold_method"]  if threshold_method is None else threshold_method
        self.otsu_threshold_multiplier = filter["otsu_threshold_multiplier"]  if otsu_threshold_multiplier is None else otsu_threshold_multiplier
        # Convert to lists since the thresholding function expects lists of thresholds but
        # we don't want to use more than one value for the filters step.
        if threshold_std_dev is None:
            self.threshold_std_dev = {
                "above": [filter_config['threshold_std_dev["above"]']],
                "below": [filter_config['threshold_std_dev["below"]']],
            }
        else:
            self.threshold_std_dev = threshold_std_dev
        if threshold_absolute is None:
            self.threshold_absolute = {
                "above": [filter_config['threshold_absolute["above"]']],
                "below": [filter_config['threshold_absolute["below"]']],
            }
        else:
            self.threshold_absolute = threshold_absolute
        self.remove_scars_config = filter_config["remove_scars"] if remove_scars is None else remove_scars
        self.images = {
            "pixels": self.image,
            "initial_median_flatten": None,
            "initial_tilt_removal": None,
            "initial_quadratic_removal": None,
            "initial_scar_removal": None,
            "initial_zero_average_background": None,
            "masked_median_flatten": None,
            "masked_tilt_removal": None,
            "masked_quadratic_removal": None,
            "secondary_scar_removal": None,
            "scar_mask": None,
            "mask": None,
            "final_zero_average_background": None,
            "gaussian_filtered": None,
        }
        self.thresholds = None
        self.medians = {"rows": None, "cols": None}
        self.results = {
            "diff": None,
            "median_row_height": None,
            "x_gradient": None,
            "y_gradient": None,
            "threshold": None,
        }

@ns-rse
Copy link
Collaborator Author

ns-rse commented Dec 10, 2025

Copying this from Slack chat so its recorded...

2025-12-05

I've been making some decent progress on my refactoring this week and have been working on making sure all the plots are correctly generated at each stage.

However, I've got some questions about the plots for disordered tracing and would appreciate some feedback/thoughts.

Nodestats

On the main branch I run processing against good 'ol minicircle.spm and enable plotting just for nodes. This is performed by code in processing.py starting on line 593 there are "whole image plots" made of which there are....

  • minicircle_above_nodes.png
  • 25-convolved_skeleton.png
  • 26-node_centres.ong
image image image image

Astute readers will notice that these are essentially identical images and do not have nodes, convolved skeletons nor node centres plotted on them, suggesting we have problems (on the main branch) with this plotting.

Questions

  • Does anyone use these whole image plots from NodeStats?
  • Does anyone use whole image plots from other stages or is it individual grain plots that people actually look at?

I would expect the individual plots of grains to be of greater interest as the spatial relationship of grains within an image is not something I'm yet to hear people being interested in (perhaps a minor exception is excluding grains that touch the edge of images but that is because they can't be properly analysed).

Ordered traces also suffer from the same afflicition, the whole image plot is missing skeletons of the back bone.

Proposed Solution

Rather than spend time trying to fix this I am inclined to remove the generation of these plots at this stage in the refactoring and saying adieu to these plots

  1. It reduces the amount of code, particularly redundant code that isn't required.
  2. These are intermediary files, at the end we do still get whole image plots of spliend traces produced correctly (and of curvature).

Would this cause problems for anyone?

2025-12-10

Further I've discovered the "whole image" plots made on the main branch during ordered tracing suffer from a similar problem as the above, although at first it isn't apparent.

These are the three plots produced with...

plotting:
  image_set:
    - ordered_tracing
image image image

..and if you zoom in on the grains it is just possible to make out a skeleton.

Thus for the time being I'll make sure these are generated but still wonder what the utility of them actually is given they are barely legible and suspect the cropped plots are the one that are more useful.

It is also worth noting that the Mask label at the top right of the plots is incorrect.

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jan 5, 2026

Update on this.

topostats process now runs to completion using minicircle.spm however there are a number of issues that need addressing.

Grain traces are incorrect from Nodestats onwards

The "full image trace" does not match previous examples. Easiest to see in the smoothed traces plots

main ns-rse/1102-remove-unnecessary-classes
image image

NB ns-rse/1102-remove-unnecessary-classes is forked from the branch this PR is based on and the current work in progress

I've looked closely at the traces for disordered tracing and these are identical across grains, e.g.

Disordered Tracing

Grain 5

main ns-rse/1102-remove-unnecessary-classes
image image
image image
image image
image image

Grain 7

main ns-rse/1102-remove-unnecessary-classes
image image
image image
image image
image image

Thus I don't think the bug has crept in at the Disordered Tracing stage, the next stage is...

Node Statistics

In ns-rse/1102-remove-unnecessary-classes node statistics/plots are missing for some grains 3, 5, 7, 9 and 15. Can't compare Grain 5 as there is no data (no doubt related to the observed bug) but for Grain 7 we have...

Grain 7

main ns-rse/1102-remove-unnecessary-classes
image image
image image
image image
image image

This shows that there is a nick in the "Zoom of Node Skeleton" plot and suggests a point at which to investigate the differences. It happens to occur at the point where there was a branch that was pruned so this may have something to do with the problem.

Grain 8

We see the same problem at two points in Grain 8 and this carries through to the smoothe plots at the start of this comment.

main ns-rse/1102-remove-unnecessary-classes
image image

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jan 6, 2026

Progress, I found a note in my refactoring of nodestats when instantiating nodeStats classes.

Original (main)

In the original code on main we see that when we instantiate the class the arguments are simply skeleton.

    def __init__(
        self,
        filename: str,
        image: npt.NDArray,
        mask: npt.NDArray,
        smoothed_mask: npt.NDArray,
        skeleton: npt.NDArray,
        pixel_to_nm_scaling: np.float64,
        n_grain: int,
        node_joining_length: float,
        node_extend_dist: float,
        branch_pairing_length: float,
        pair_odd_branches: bool,
    ) -> None:
        """
        Initialise the nodeStats class.

        Parameters
        ----------
        filename : str
            The name of the file being processed. For logging purposes.
        image : npt.NDArray
            The array of pixels.
        mask : npt.NDArray
            The binary segmentation mask.
        smoothed_mask : npt.NDArray
            A smoothed version of the bianary segmentation mask.
        skeleton : npt.NDArray
            A binary single-pixel wide mask of objects in the 'image'.
        pixel_to_nm_scaling : float
            The pixel to nm scaling factor.
        n_grain : int
            The grain number.
        node_joining_length : float
            The length over which to join skeletal intersections to be counted as one crossing.
        node_joining_length : float
            The distance over which to join nearby odd-branched nodes.
        node_extend_dist : float
            The distance under which to join odd-branched node regions.
        branch_pairing_length : float
            The length from the crossing point to pair and trace, obtaining FWHM's.
        pair_odd_branches : bool
            Whether to try and pair odd-branched nodes.
        """
        self.filename = filename
        self.image = image
        self.mask = mask
        self.smoothed_mask = smoothed_mask  # only used to average traces
        self.skeleton = skeleton
        self.pixel_to_nm_scaling = pixel_to_nm_scaling
        self.n_grain = n_grain
        self.node_joining_length = node_joining_length
        self.node_extend_dist = node_extend_dist / self.pixel_to_nm_scaling
        self.branch_pairing_length = branch_pairing_length
        self.pair_odd_branches = pair_odd_branches

Refactor (ns-rse/1102-remove-unnecessary-classes)

Whilst refactoring though I made the following note as neither the variable name nor the doc-string indicate that it should be a pruned skeleton.

        try:
            # @ns-rse 2025-11-10 : Should this perhaps be "pruned_skeleton"?
            self.skeleton = grain_crop.disordered_trace.images["skeleton"]

Switching to using pruned_skeleton recovers a number of grains.

main ns-rse/1103-remove-unnecessary-classes
image image

This is a good example of why meaningful names are so important across large code-bases.

We no longer have the breaks in the "Zoom of Node Skeleton" plots either...

main ns-rse/1103-remove-unnecessary-classes
image image
image image

There are still five grains where there are problems in the following grains (NB Indexing starts at zero as used internally by Python)...

  • 4
  • 11
  • 12
  • 13
  • 17

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jan 6, 2026

Progress, same issue has affected Ordered tracing.

main

The two classes that perform ordered tracing OrderedTraceNodestats and OrderedTraceTopostats both only have parameters skeleton although unlike NodeStats the doc-strings here do actually state that these should be pruned which I missed and have mistakenly been passing the initial skeleton in.

Correcting this and we are almost there, grains 11, 12, 13 and 17 have been recovered and there are still some problems. Grain 5 now isn't traced/splined correctly, and there are small gaps which either shouldn't be there or are in the wrong place in grains 3 and 9.

main ns-rse/1102-remove-unnecessary-classes
image image

Grain 5

From ordered tracing we can see that the pruned skeleton that has been passed in has had the wrong section pruned, its the major loop that is removed rather than the smaller branch.

image

Grain 3

The ordered trace doesn't have a break in it.

image

I wonder if this is introduced during splining perhaps and in the old its on the right of the junction and the new the left.

Grain 9

Again no break in the ordered trace.

image

I'll keep digging as I'm slowly getting closer 🤞

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jan 6, 2026

The pruned skeleton going into OrderedTraceNodestats() is the same one that comes out of disordered tracing (based on summing the array)...

[Tue, 06 Jan 2026 17:14:00] [INFO    ] [topostats] [minicircle] : Plotting disordered traces for grain 5
Disordered Tracing (plot_name='smoothed_mask')
image_value.sum()=np.int64(2368)

Disordered Tracing (plot_name='skeleton')
image_value.sum()=np.int64(141)

Disordered Tracing (plot_name='pruned_skeleton')
image_value.sum()=np.uint64(114)    <<<<< This is the key value

Disordered Tracing (plot_name='branch_types')
image_value.sum()=np.float64(419.0)

Disordered Tracing (plot_name='branch_indexes')
image_value.sum()=np.float64(264.0)

...

[Tue, 06 Jan 2026 17:14:21] [INFO    ] [topostats] [minicircle] : Grain 5 present in NodeStats. Tracing via Nodestats.

grain_crop.disordered_trace.images["pruned_skeleton"].sum()=np.uint64(114)  <<<< Same skeleton being passed in

...

After ordered tracing though the arrays are wrong as they are only six pixels (or the sum of 1..6)

[Tue, 06 Jan 2026 17:14:29] [INFO    ] [topostats] [minicircle] : Plotting ordered traces for grain 6
Ordered Tracing (plot_name='all_molecules')
image_value.sum()=np.uint64(6)

Ordered Tracing (plot_name='ordered_traces')
image_value.sum()=np.float64(21.0)

Ordered Tracing (plot_name='over_under')
image_value.sum()=np.uint64(6)

Ordered Tracing (plot_name='trace_segments')
image_value.sum()=np.float64(6.0)

ToDo : Find out why this is the case and why just for this molecule as others are handled correctly.

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jan 8, 2026

For grain 5 the arrays going into ordered tracing are different between the current work and main. Its possibly something to do with removal of junction points.

This particular grain/molecule though has a bigger problem associated with it though in so much as the longer arm at the top is pruned off instead of a small side loop being pruned off. This leaves a trace with two junctions which I think is also causing some problems, but addressing those is beyond the scope of my current efforts.

Will return to this next week.

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jan 12, 2026

For grain 5 the arrays going into ordered tracing are different between the current work and main. Its possibly something to do with removal of junction points.

I've found the grain difference! Its a single pixel at a junction

main ns-rse/1102-remove-unnecessary-classes
image image

Now I just need to work out where this skeleton with the pixels at junctions removed resides in the restructured data.

Its not the grain_crop.nodes[0].node_area_skeleton because that has this pixel labelled with 2...

image

I think this arises because of some confusion about what skeleton is used where.

In main currently we have in Nodestats on line 226-233 the convolved skeleton, which shows junctions and end-points being reset so that anything that isn't 1 (i.e. junctions and ends) is set to 0....

        self.conv_skelly = convolve_skeleton(self.skeleton)
        if len(self.conv_skelly[self.conv_skelly == 3]) != 0:  # check if any nodes
            LOGGER.debug(f"[{self.filename}] : Nodestats - {self.n_grain} contains crossings.")
            # convolve to see crossing and end points
            # self.conv_skelly = self.tidy_branches(self.conv_skelly, self.image)
            # reset skeleton var as tidy branches may have modified it
            self.skeleton = np.where(self.conv_skelly != 0, 1, 0)
            self.image_dict["grain"]["grain_skeleton"] = self.skeleton

In ns-rse/1102-remove-unnecessary-classes these lines remain but instead the value is assigned to self.grain_crop.skeleton (although that is actually over-writing the original skeleon and should instead be assigned to something in the Node class.

To check if we use this skeleton then we recover this grain and all others look ok too.

image

Thus I need to save both the convolved skeleton and the "reset" convolved skeleton as an attribute of the Node class.

Names are so important!

@ns-rse
Copy link
Collaborator Author

ns-rse commented Feb 3, 2026

Closing, superseded by #1284

@ns-rse ns-rse closed this Feb 3, 2026
@ns-rse ns-rse deleted the ns-rse/1102-switching-to-TopoStats-class branch February 3, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Filters Issues pertaining to the Filters class Grains Issues pertaining to the Grains class IO Input and Output refactor Refactoring of code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move some regression tests to use syrupy [feature] : Save ImageGrainCrops in .topostats files

2 participants