Skip to content

Conversation

@kmuehlbauer
Copy link
Collaborator

@kmuehlbauer kmuehlbauer commented Oct 13, 2025

  • update type handling
  • full COMPOUND support (incl. nested, complex compounds and compounds with references)
  • other simplifications and deduplication

Description

The h5types-as-a-tuple approach didn't work out in #119. I got confused trying to implement for the needs of #119. Thanks @lm41 for the hacking session together over the weekend. It wouldn't have been possible without you ❤️.

So, this adds a comprehensive set of H5Type and derived classes. This makes life much easier and code more readable. Although this adds a bit of overhead compared with the basic tuple system, it should not substantially affect performance.

Closes #119, overrides #120

Before you get started

Checklist

  • This pull request has a descriptive title and labels
  • This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
  • Unit tests have been added (if codecov test fails)
  • Any changed dependencies have been added or removed correctly (if need be)
  • If you are working on the documentation, please ensure the current build passes
  • All tests pass

…ndling, full COMPOUND support (incl. nested, complex compounds and compounds with references)
@kmuehlbauer kmuehlbauer mentioned this pull request Oct 13, 2025
7 tasks
@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 87.66667% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.66%. Comparing base (bcb93ed) to head (42b208a).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
pyfive/h5d.py 77.41% 7 Missing and 7 partials ⚠️
pyfive/p5t.py 89.68% 8 Missing and 5 partials ⚠️
pyfive/dataobjects.py 90.00% 1 Missing and 2 partials ⚠️
pyfive/datatype_msg.py 90.00% 1 Missing and 2 partials ⚠️
pyfive/h5t.py 88.88% 1 Missing and 1 partial ⚠️
pyfive/misc_low_level.py 92.59% 0 Missing and 2 partials ⚠️

❌ Your patch status has failed because the patch coverage (87.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #122      +/-   ##
==========================================
+ Coverage   74.13%   74.66%   +0.53%     
==========================================
  Files          11       12       +1     
  Lines        2606     2712     +106     
  Branches      408      407       -1     
==========================================
+ Hits         1932     2025      +93     
- Misses        566      576      +10     
- Partials      108      111       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kmuehlbauer
Copy link
Collaborator Author

FYI: My other h5netcdf branch https://github.com/kmuehlbauer/h5netcdf/tree/pyfive works nicely with this branch locally.

@kmuehlbauer
Copy link
Collaborator Author

@bnlawrence @bmaranville @valeriupredoi and other's I'd like to hear your opinions on the enhancements in this PR.

This PR should tackle the shortcomings of #120 and resolve #119.

The created classes do not have a corresponding h5py class. They are merely a shim between HDF5 and the pyfive outward facing layers. I did this in the first place to be able to see how and where the tuple-types are handled.

As those classes are no outward facing API and are only internal implementation detail, it could suffice for now. If need be, we can declare with underscore and/or move this away from pyfive.h5t.

One possible path of refactoring would be to completely implement these types as pyfive counterpart of h5py.h5t.TypeID. This is already done for all types as pyfive.h5t.TypeID and special cased for TypeEnumID and TypeCompoundID (for compatibility with h5netcdf). Some of the internal information is only kept via numpy dtype. So no harm to keep the proposed approach and see how it goes.

A quick introduction on the PR:

  • Implementation of H5Type, H5IntegerType, H5FloatType, H5String, H5OpaqueType, H5CompoundType , H5ReferenceType, H5EnumType, H5Sequence (H5T_VLEN)
  • Implementation of derived types and needed subtypes H5FixedStringType, H5VlenStringType, H5CompoundField
  • All checks for branching are currently done as isinstance checks (we can use __slots__ to come in the range of the current check of the tuple-type)
  • The numpy dtype is property of that class and created on accessing that property, so will be lazy until needed (this simplifies the code at several places)
  • No dual use of dtype (tuple vs array-protocol typestring), but we might think about renaming the dtype used for the new types, so that dtype always refer to numpy dtype.
  • Datatype message type determination is simplified and returns an instance of a new type or a nested representation (eg. for VLEN/Sequence, Compound, Enum)
  • Complex Types are handled as compounds, checks are placed into the new type class, but we still can move this out into an own class
  • If storage type and numpy dtype do not fit (eg for references) there is currently a special check which translates the original data into the object type representation. This is especially problematic for nested data. There is room for improvement in that regard.

I've added other alignments and code deduplication as I saw fit.

One remark about the added typing, my IDE automatically adds these kinds of typing and I left it. How should this be tackled? I'm inclined to remove the typing, or make it available for all touched code. What's your opinion on that?

@kmuehlbauer
Copy link
Collaborator Author

Trying to test in Python 3.14, too.

@kmuehlbauer kmuehlbauer reopened this Oct 15, 2025
@valeriupredoi
Copy link
Collaborator

Trying to test in Python 3.14, too.

Works well 😃

@valeriupredoi valeriupredoi added this to the 1.0 milestone Oct 17, 2025
@bnlawrence
Copy link
Collaborator

I've had a look at this, and like moving away from tuples!

My only concern is one I had even before this enhancement, and one that you've got in your list of things, and that's that we are overusing the attribute name dtype.

Effectively we have two types of dtypes and we have to work out which one we are dealing with in any context almost by inspection. The name doesn't help us. That makes reading and understanding the pull request hard, but that hardness is not the fault of the pull request itself, even looking the code requires a lot in our mental stacks. E.g.

def _attr_value(self, dtype, buf, count, offset):
""" Retrieve an HDF5 attribute value from a buffer. """
# numpy storage dtype
_dtype = dtype.dtype

The comment at least tells us what the local _dtype is for but down at lines

new_dtype = dtype_replace_refs_with_object(dtype.dtype)
new_array = np.empty(value.shape, dtype=new_dtype)
new_array[:] = value
value = _decode_array(value, new_array)

We are not really sure what new_dtype actually is any more.

In most of the pyfive code, dtype is our dtype (was tuple, now an H5T object, which like I say, I like), except like the last snippet where we need the actual type for numpy, and when we expose the dtype to users where again, it's a numpy thing.

Given that your touching every usage of our internal tuple dtype, could we change the name in some way so we don't have this confusion? It feels like this is our best chance to do it and given your list of bullets above, it seems like you do think it, so I reckon it's worth the little refactor to do it now.

(In terms of the IDE adding types, I'd be tempted to remove it for now, if we're going to go there - with linter changes - we can do it all at the same time.)

@bnlawrence
Copy link
Collaborator

I am inclined not to worry too much about our h5t internals differing from those in h5py, but history suggests I am bad at choosing when not to worry ... one way to handle that would be to put all our types in p5t.py.

@kmuehlbauer
Copy link
Collaborator Author

kmuehlbauer commented Oct 22, 2025

Thanks @bnlawrence for looking at this. I agree with all your reasoning.

Given that your touching every usage of our internal tuple dtype, could we change the name in some way so we don't have this confusion? It feels like this is our best chance to do it and given your list of bullets above, it seems like you do think it, so I reckon it's worth the little refactor to do it now.

So I would suggest to call the types P5Type, P5FloatType etc. and put them in p5t.py. For all current .dtype which have been tuple or array-protocol typestring before I'd go for .ptype. Then everything which is .dtype will then resemble a numpy dtype.

I'll also remove the typing for now.

@kmuehlbauer
Copy link
Collaborator Author

@bnlawrence I've worked along your suggestion on renaming/refactoring. Just from looking at it, it was the right decision.

Copy link
Collaborator

@bnlawrence bnlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I'd be happy to merge this!

@kmuehlbauer
Copy link
Collaborator Author

Thanks again @bnlawrence for considering and review suggestions. Looking forward to see this released.

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Oct 27, 2025

great many thanks @kmuehlbauer and @bnlawrence 🍻

@valeriupredoi valeriupredoi merged commit fe911ef into NCAS-CMS:main Oct 27, 2025
6 of 7 checks passed
@kmuehlbauer kmuehlbauer deleted the new-types branch October 27, 2025 11:41
@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Oct 27, 2025

@kmuehlbauer I noticed the checksum test for the fletch32 filter is flaky - it had just failed the checksum test before I was silly enough to rerun it, then the rerun passed - do you expect any flakiness? 🍺 EDIT: Python 3.14 test failed only, passed at rerun

@kmuehlbauer
Copy link
Collaborator Author

Hmm, I don't expect it to be flaky. This would need a bit closer inspection. I haven't seen this locally.

@valeriupredoi
Copy link
Collaborator

Hmm, I don't expect it to be flaky. This would need a bit closer inspection. I haven't seen this locally.

thanks, Kai! No need to inspect - I reran the tests some 25-30 times, no fail - must have been been something wrong with the storage of that particular run. As we are 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue with COMPOUND including REFERENCE

3 participants