introduce new H5 types to replace current type-tuples #122

kmuehlbauer · 2025-10-13T13:48:09Z

update type handling
full COMPOUND support (incl. nested, complex compounds and compounds with references)
other simplifications and deduplication

Description

The h5types-as-a-tuple approach didn't work out in #119. I got confused trying to implement for the needs of #119. Thanks @lm41 for the hacking session together over the weekend. It wouldn't have been possible without you ❤️.

So, this adds a comprehensive set of H5Type and derived classes. This makes life much easier and code more readable. Although this adds a bit of overhead compared with the basic tuple system, it should not substantially affect performance.

Closes #119, overrides #120

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

This pull request has a descriptive title and labels
This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
Unit tests have been added (if codecov test fails)
Any changed dependencies have been added or removed correctly (if need be)
If you are working on the documentation, please ensure the current build passes
All tests pass

…ndling, full COMPOUND support (incl. nested, complex compounds and compounds with references)

codecov · 2025-10-13T13:51:48Z

Codecov Report

❌ Patch coverage is 87.66667% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.66%. Comparing base (bcb93ed) to head (42b208a).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
pyfive/h5d.py	77.41%	7 Missing and 7 partials ⚠️
pyfive/p5t.py	89.68%	8 Missing and 5 partials ⚠️
pyfive/dataobjects.py	90.00%	1 Missing and 2 partials ⚠️
pyfive/datatype_msg.py	90.00%	1 Missing and 2 partials ⚠️
pyfive/h5t.py	88.88%	1 Missing and 1 partial ⚠️
pyfive/misc_low_level.py	92.59%	0 Missing and 2 partials ⚠️

❌ Your patch status has failed because the patch coverage (87.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #122      +/-   ##
==========================================
+ Coverage   74.13%   74.66%   +0.53%     
==========================================
  Files          11       12       +1     
  Lines        2606     2712     +106     
  Branches      408      407       -1     
==========================================
+ Hits         1932     2025      +93     
- Misses        566      576      +10     
- Partials      108      111       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kmuehlbauer · 2025-10-13T14:14:54Z

FYI: My other h5netcdf branch https://github.com/kmuehlbauer/h5netcdf/tree/pyfive works nicely with this branch locally.

kmuehlbauer · 2025-10-15T06:29:47Z

@bnlawrence @bmaranville @valeriupredoi and other's I'd like to hear your opinions on the enhancements in this PR.

This PR should tackle the shortcomings of #120 and resolve #119.

The created classes do not have a corresponding h5py class. They are merely a shim between HDF5 and the pyfive outward facing layers. I did this in the first place to be able to see how and where the tuple-types are handled.

As those classes are no outward facing API and are only internal implementation detail, it could suffice for now. If need be, we can declare with underscore and/or move this away from pyfive.h5t.

One possible path of refactoring would be to completely implement these types as pyfive counterpart of h5py.h5t.TypeID. This is already done for all types as pyfive.h5t.TypeID and special cased for TypeEnumID and TypeCompoundID (for compatibility with h5netcdf). Some of the internal information is only kept via numpy dtype. So no harm to keep the proposed approach and see how it goes.

A quick introduction on the PR:

Implementation of H5Type, H5IntegerType, H5FloatType, H5String, H5OpaqueType, H5CompoundType , H5ReferenceType, H5EnumType, H5Sequence (H5T_VLEN)
Implementation of derived types and needed subtypes H5FixedStringType, H5VlenStringType, H5CompoundField
All checks for branching are currently done as isinstance checks (we can use __slots__ to come in the range of the current check of the tuple-type)
The numpy dtype is property of that class and created on accessing that property, so will be lazy until needed (this simplifies the code at several places)
No dual use of dtype (tuple vs array-protocol typestring), but we might think about renaming the dtype used for the new types, so that dtype always refer to numpy dtype.
Datatype message type determination is simplified and returns an instance of a new type or a nested representation (eg. for VLEN/Sequence, Compound, Enum)
Complex Types are handled as compounds, checks are placed into the new type class, but we still can move this out into an own class
If storage type and numpy dtype do not fit (eg for references) there is currently a special check which translates the original data into the object type representation. This is especially problematic for nested data. There is room for improvement in that regard.

I've added other alignments and code deduplication as I saw fit.

One remark about the added typing, my IDE automatically adds these kinds of typing and I left it. How should this be tackled? I'm inclined to remove the typing, or make it available for all touched code. What's your opinion on that?

kmuehlbauer · 2025-10-15T16:17:07Z

Trying to test in Python 3.14, too.

valeriupredoi · 2025-10-15T16:45:34Z

Trying to test in Python 3.14, too.

Works well 😃

bnlawrence · 2025-10-22T11:00:17Z

I've had a look at this, and like moving away from tuples!

My only concern is one I had even before this enhancement, and one that you've got in your list of things, and that's that we are overusing the attribute name dtype.

Effectively we have two types of dtypes and we have to work out which one we are dealing with in any context almost by inspection. The name doesn't help us. That makes reading and understanding the pull request hard, but that hardness is not the fault of the pull request itself, even looking the code requires a lot in our mental stacks. E.g.

pyfive/pyfive/dataobjects.py

Lines 255 to 259 in 45e1a98

    
           def _attr_value(self, dtype, buf, count, offset): 
        
               """ Retrieve an HDF5 attribute value from a buffer. """ 
        
               # numpy storage dtype 
        
               _dtype = dtype.dtype

The comment at least tells us what the local _dtype is for but down at lines

pyfive/pyfive/dataobjects.py

Lines 289 to 292 in 45e1a98

    
           new_dtype = dtype_replace_refs_with_object(dtype.dtype) 
        
           new_array = np.empty(value.shape, dtype=new_dtype) 
        
           new_array[:] = value 
        
           value = _decode_array(value, new_array)

We are not really sure what new_dtype actually is any more.

In most of the pyfive code, dtype is our dtype (was tuple, now an H5T object, which like I say, I like), except like the last snippet where we need the actual type for numpy, and when we expose the dtype to users where again, it's a numpy thing.

Given that your touching every usage of our internal tuple dtype, could we change the name in some way so we don't have this confusion? It feels like this is our best chance to do it and given your list of bullets above, it seems like you do think it, so I reckon it's worth the little refactor to do it now.

(In terms of the IDE adding types, I'd be tempted to remove it for now, if we're going to go there - with linter changes - we can do it all at the same time.)

bnlawrence · 2025-10-22T11:04:28Z

I am inclined not to worry too much about our h5t internals differing from those in h5py, but history suggests I am bad at choosing when not to worry ... one way to handle that would be to put all our types in p5t.py.

kmuehlbauer · 2025-10-22T12:58:22Z

Thanks @bnlawrence for looking at this. I agree with all your reasoning.

Given that your touching every usage of our internal tuple dtype, could we change the name in some way so we don't have this confusion? It feels like this is our best chance to do it and given your list of bullets above, it seems like you do think it, so I reckon it's worth the little refactor to do it now.

So I would suggest to call the types P5Type, P5FloatType etc. and put them in p5t.py. For all current .dtype which have been tuple or array-protocol typestring before I'd go for .ptype. Then everything which is .dtype will then resemble a numpy dtype.

I'll also remove the typing for now.

pyfive/dataobjects.py

pyfive/h5d.py

kmuehlbauer · 2025-10-22T14:04:12Z

@bnlawrence I've worked along your suggestion on renaming/refactoring. Just from looking at it, it was the right decision.

bnlawrence

This is great. I'd be happy to merge this!

kmuehlbauer · 2025-10-27T11:05:37Z

Thanks again @bnlawrence for considering and review suggestions. Looking forward to see this released.

valeriupredoi · 2025-10-27T11:10:53Z

great many thanks @kmuehlbauer and @bnlawrence 🍻

valeriupredoi · 2025-10-27T11:44:39Z

@kmuehlbauer I noticed the checksum test for the fletch32 filter is flaky - it had just failed the checksum test before I was silly enough to rerun it, then the rerun passed - do you expect any flakiness? 🍺 EDIT: Python 3.14 test failed only, passed at rerun

kmuehlbauer · 2025-10-27T12:07:55Z

Hmm, I don't expect it to be flaky. This would need a bit closer inspection. I haven't seen this locally.

valeriupredoi · 2025-10-27T12:10:21Z

Hmm, I don't expect it to be flaky. This would need a bit closer inspection. I haven't seen this locally.

thanks, Kai! No need to inspect - I reran the tests some 25-30 times, no fail - must have been been something wrong with the storage of that particular run. As we are 🍺

introduce new H5 types to replace current type-tuples, update type ha…

56e83ff

…ndling, full COMPOUND support (incl. nested, complex compounds and compounds with references)

kmuehlbauer mentioned this pull request Oct 13, 2025

full feature COMPOUND type #120

Closed

7 tasks

kmuehlbauer added 2 commits October 14, 2025 10:11

correctly specify vlen string dtype

743778e

pin netCDF4 to <= 1.7.2

5ded339

kmuehlbauer added enhancement bug labels Oct 15, 2025

Merge branch 'main' into new-types

45e1a98

kmuehlbauer closed this Oct 15, 2025

kmuehlbauer reopened this Oct 15, 2025

valeriupredoi added this to the 1.0 milestone Oct 17, 2025

kmuehlbauer added 2 commits October 22, 2025 15:54

Merge remote-tracking branch 'origin/main' into new-types-2

76fefbd

use ptype for new P5Type, use dtype for numpy dtype, remove typing

bc127a8

kmuehlbauer commented Oct 22, 2025

View reviewed changes

pyfive/dataobjects.py Outdated Show resolved Hide resolved

kmuehlbauer commented Oct 22, 2025

View reviewed changes

pyfive/h5d.py Show resolved Hide resolved

remove underscore in _dtype, remove commented code

c2e950d

Merge branch 'main' into new-types

42b208a

bnlawrence approved these changes Oct 27, 2025

View reviewed changes

valeriupredoi approved these changes Oct 27, 2025

View reviewed changes

valeriupredoi merged commit fe911ef into NCAS-CMS:main Oct 27, 2025
6 of 7 checks passed

kmuehlbauer deleted the new-types branch October 27, 2025 11:41

introduce new H5 types to replace current type-tuples #122

introduce new H5 types to replace current type-tuples #122

Uh oh!

Conversation

kmuehlbauer commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before you get started

Checklist

Uh oh!

codecov bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kmuehlbauer commented Oct 13, 2025

Uh oh!

kmuehlbauer commented Oct 15, 2025

Uh oh!

kmuehlbauer commented Oct 15, 2025

Uh oh!

valeriupredoi commented Oct 15, 2025

Uh oh!

bnlawrence commented Oct 22, 2025

Uh oh!

bnlawrence commented Oct 22, 2025

Uh oh!

kmuehlbauer commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kmuehlbauer commented Oct 22, 2025

Uh oh!

bnlawrence left a comment

Choose a reason for hiding this comment

Uh oh!

kmuehlbauer commented Oct 27, 2025

Uh oh!

valeriupredoi commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

valeriupredoi commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmuehlbauer commented Oct 27, 2025

Uh oh!

valeriupredoi commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kmuehlbauer commented Oct 13, 2025 •

edited

Loading

codecov bot commented Oct 13, 2025 •

edited

Loading

kmuehlbauer commented Oct 22, 2025 •

edited

Loading

valeriupredoi commented Oct 27, 2025 •

edited

Loading

valeriupredoi commented Oct 27, 2025 •

edited

Loading