fix(hdf): chunks may skip filters when small #575

victor-zou · 2025-08-30T14:40:01Z

Add fix and tests. RuntimeError will be raised if chucks have heterogeneous filter_mask.

martindurant · 2025-09-03T15:04:46Z

kerchunk/hdf.py

+                if filter_mask is None:
+                    filter_mask = blob.filter_mask
+                elif filter_mask != blob.filter_mask:
+                    raise RuntimeError(f"Dataset {dset.name} has heterogeneous `filter_mask` - "


Erroring here is reasonable, and I am happy to include it.
BUT, is it not possible, in the case that there are some blobs but not all, to get the bytes representation for all the chunks and/or store the whole array in one materialised chunk? That's not ideal, but it makes the workflow possible.

Erroring here is reasonable, and I am happy to include it. BUT, is it not possible, in the case that there are some blobs but not all, to get the bytes representation for all the chunks and/or store the whole array in one materialised chunk? That's not ideal, but it makes the workflow possible.

Yes, it is possible. Raise an error only when the size of the whole data is over certain threshold (default for the inline threshold while adding another custom parameter)?

A user may set malicious chunking like set chunk size 10000 for a data of length 10001. In this case, the tailing chunk of length 1 will probably not be compressed, and the whole array will be too big to be included.

Yes, I think that all makes sense. There's not too much we can do about the pathological case, but I doubt anyone will be malicious in this space!

victor-zou · 2025-09-05T15:47:19Z

@martindurant I have a question on data inlining. In hdf.py around line 500

                if data is not None:
                    try:
                        za[:] = data
                    except (ValueError, TypeError):
                        self.store_dict[f"{za.path}/0"] = kwargs["filters"][0].encode(
                            data
                        )
                    return

f"{za.path}/0", will it work if it is not one-dimensional?
= kwargs["filters"][0].encode(, what if the dataset as 0 or more filters?

martindurant · 2025-09-05T19:31:40Z

You are right you will need something like f"{za.path}/{'.'.join("0" * ndim)}"
I think we only tested for simple filters like gzip; indeed it could be uncompressed/unencoded (len(filters)->0) or have multiple steps.

victor-zou · 2025-09-06T02:17:20Z

You are right you will need something like f"{za.path}/{'.'.join("0" * ndim)}"

I think we only tested for simple filters like gzip; indeed it could be uncompressed/unencoded (len(filters)->0) or have multiple steps.

Two more thing,

It saves raw bytes here, whereas saves base64-encoded below. As, by the end of the day, the buffer will be "encode_for JSON", can I remove the "base64-encode" part and leave it to the "encode_for JSON", making the implementation more consistent?

                        if (
                            self.inline
                            and isinstance(v, dict)
                            and v["size"] < self.inline
                        ):
                            self.input_file.seek(v["offset"])
                            data = self.input_file.read(v["size"])
                            try:
                                # easiest way to test if data is ascii
                                data.decode("ascii")
                            except UnicodeDecodeError:
                                data = b"base64:" + base64.b64encode(data)

                            self.store_dict[key] = data

It seems that in this part, the data is inlined no matter its size?

martindurant · 2025-09-12T15:25:00Z

Yeah, there is a separate pass that coerces things to JSON types at the end, when doing the write. Probably they should all by bytes at this point.

So, inlining will happen per chunk based on size; but if there are types that need inlining, like the issue you are fixing, then there is no choice.

victor-zou · 2025-09-13T14:43:31Z

@martindurant Now, the code will try to inline unsupported features including unsupported filters and heterogeneous filter_mask. For heterogeneous filter_mask, will only inline the alien chunks.

More test h5 and test cases are added.

Following tests in test_hdf5.py not pass:

those who need read from s3, are those link still alive?
string related tests except 'leave', all complain about, are these codes adapted after zarr3?

ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by providing a specific Zarr data type. For a list of Zarr data types that are compatible with the numpy "Object"data type, see https://github.com/zarr-developers/zarr-python/issues/3117

martindurant · 2025-09-15T14:24:00Z

for 1. I think these are still working, see #576 . That PR also did a little cleanup regarding the HDF filters, and not the only issue is with codecs defined in kerchunk itself. Specifically, zarr checks to see, in the case of strings, whether one of a hard-coded set of filters is in use and, if not, errors. Of course, we have our own encoding, so I don't think there's anything we can do from our end. Maybe this should be specified in a new issue to zarr. The only real alternative would be to either embed or drop all strings...

victor-zou · 2025-09-15T17:02:42Z

for 1. I think these are still working, see #576 . That PR also did a little cleanup regarding the HDF filters, and not the only issue is with codecs defined in kerchunk itself. Specifically, zarr checks to see, in the case of strings, whether one of a hard-coded set of filters is in use and, if not, errors. Of course, we have our own encoding, so I don't think there's anything we can do from our end. Maybe this should be specified in a new issue to zarr. The only real alternative would be to either embed or drop all strings...

Yeah, thank you for the reply, seems fixes in #576 solve the "ValueError".
Comming back to this PR, any comment to the recent update?

martindurant · 2025-09-15T19:56:28Z

There's quite a lot there! As far as I can tell, it all looks exceedingly good and in order. Are you happy with the current state?

victor-zou · 2025-09-16T06:05:05Z

Tests / build (312) (pull_request)Cancelled after 2m

@martindurant It complains missing "hdf5plugin", which, though not used explicitly, is needed by the routine to read small chunks of unsupported compression like "lz4".

hdf.py's dependency over it is optional, so, in that file, I add some "try import catch" lines.

And for the test_hdf.py is there anyway to only add dependency over hdf5plugin for this test file instead of in the requires.txt? Or I add some try-catch also in the test_hdf.py? For those lz4-compressed files, catch the bizzare "OSError" raise by h5py when without the hdf5plugin import.

martindurant · 2025-09-16T19:41:58Z

Yes, I think importorskip is appropriate in the test, but add it to the ci env file anyway. In the code itself, we should be able to cope with the module missing as you say.

martindurant · 2025-09-26T17:03:19Z

ping @victor-zou - do you have what you need to finish here?

victor-zou · 2025-09-27T11:27:04Z

ping @victor-zou - do you have what you need to finish here?

Thanks @martindurant What I need is all finished. Anything I need to do before the PR got merged?
Such as modify the yaml in the ci directory?
Even with hdf5plugin installed, follow string-related tests still not work, do I need comment out them:

FAILED tests/test_hdf.py::test_string_embed - OSError: pytest: reading from stdin while output is captured! Consider using -s.
FAILED tests/test_hdf.py::test_string_pathlib - KeyError: 'vlen_str/0'
FAILED tests/test_hdf.py::test_string_null - KeyError: 'vlen_str'
FAILED tests/test_hdf.py::test_string_decode - FileNotFoundError: vlen_str/.zarray
FAILED tests/test_hdf.py::test_compound_string_null - KeyError: 'vlen_str'
FAILED tests/test_hdf.py::test_compound_string_encode - ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by p

martindurant · 2025-09-29T18:24:53Z

Is there a way to detect where hdf5plugin would have solved an issue, so we are seeing an exception without it?

martindurant · 2025-09-29T18:25:44Z

(and yes, I think we will have to live with the broken string/compound test errors for the time being, but don't remove/comment them)

victor-zou · 2025-09-30T00:42:34Z

Is there a way to detect where hdf5plugin would have solved an issue, so we are seeing an exception without it?

hdf5plugin is needed when read content directly like dset[:], which might need the filters registered by hdf5plugin. This only happens when meeting features not previously supported by kerchunk.hdf.

As there are numerous kinds of h5 plugins, it is not sure whether the h5py-read may succeed with hdf5plugin present; and it is also not sure whether a read failure is result from the absence of hdf5plugin. How about, when read-fail and missing hdf5plugin, issue an warning and re-throw the exception?

martindurant · 2025-09-30T15:37:21Z

How about, when read-fail and missing hdf5plugin, issue an warning and re-throw the exception?

Yes, that sounds good.

martindurant · 2025-09-30T19:58:59Z

@victor-zou , please check if eddb0ff is what you had in mind.

victor-zou · 2025-10-01T01:06:59Z

@victor-zou , please check if eddb0ff is what you had in mind.

Thank you for the impl, I made a little change:

Catch all exception instead of import error. Missing filters registered by hdf5plugin propably throw OSError (may be from the errno set in C impl?), but google results also mentions ValueError and RuntimeError. Therefore I just catch all.
Since all exception is catched, I limit the range of read_warn use to reduce false warning. Only use it when meet unsupported features. For dset[:] which does not need to register filters, I change them back. Also, for this spirit, I refactor its name to _read_unsupported_direct.
Add a use case in _storage_info where read chunks with inconsistent filter_mask.

martindurant · 2025-10-01T14:51:04Z

kerchunk/hdf.py

+    # bizarre error message in my test without `hdf5plugin` imported.
+    # Just simply catch all exceptions, as we will rethrow it anyway.
+    except Exception as e:
+        if hdf5plugin:


The warning should be when hdf5plugin is None, no?

Yeah, sorry, "hdf5plugin is None" in mind, missing rest in codes..

kerchunk/hdf.py

martindurant · 2025-10-02T14:12:06Z

I tink everything is ready?

victor-zou · 2025-10-02T14:19:09Z

I tink everything is ready?

@martindurant Yeah everything is ready. Also, changing the needed to extra does make the warning message more accurate.

martindurant · 2025-10-02T14:24:30Z

+1

fix(hdf): chucks may skip filters when small

37bbf4a

martindurant reviewed Sep 3, 2025

View reviewed changes

feat: tolerant unsupported features when inlinable

4144c49

fix: fix typo and docs

27a5049

Merge branch 'main' into fix/h5-small-chunk

0d3437b

martindurant mentioned this pull request Sep 29, 2025

Round-tripping a string array with NetCDF4 and zarr #578

Closed

add plugin to ci

db8f64e

fix lint

960d607

warn if hdf5plugin is needed but missing

eddb0ff

fix: read_warn, add one missing place, and remove for strs

2f37a7b

martindurant reviewed Oct 1, 2025

View reviewed changes

victor-zou commented Oct 1, 2025

View reviewed changes

kerchunk/hdf.py Show resolved Hide resolved

correct warning

c74578b

martindurant changed the title ~~fix(hdf): chucks may skip filters when small~~ fix(hdf): chunks may skip filters when small Oct 2, 2025

martindurant merged commit 80e0701 into fsspec:main Oct 2, 2025
2 of 4 checks passed

fix(hdf): chunks may skip filters when small #575

fix(hdf): chunks may skip filters when small #575

Uh oh!

Conversation

victor-zou commented Aug 30, 2025

Uh oh!

martindurant Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

victor-zou Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

martindurant Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

victor-zou commented Sep 5, 2025

Uh oh!

martindurant commented Sep 5, 2025

Uh oh!

victor-zou commented Sep 6, 2025

Uh oh!

martindurant commented Sep 12, 2025

Uh oh!

victor-zou commented Sep 13, 2025

Uh oh!

martindurant commented Sep 15, 2025

Uh oh!

victor-zou commented Sep 15, 2025

Uh oh!

martindurant commented Sep 15, 2025

Uh oh!

victor-zou commented Sep 16, 2025

Uh oh!

martindurant commented Sep 16, 2025

Uh oh!

martindurant commented Sep 26, 2025

Uh oh!

victor-zou commented Sep 27, 2025

Uh oh!

martindurant commented Sep 29, 2025

Uh oh!

martindurant commented Sep 29, 2025

Uh oh!

victor-zou commented Sep 30, 2025

Uh oh!

martindurant commented Sep 30, 2025

Uh oh!

martindurant commented Sep 30, 2025

Uh oh!

victor-zou commented Oct 1, 2025

Uh oh!

martindurant Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

victor-zou Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

martindurant commented Oct 2, 2025

Uh oh!

victor-zou commented Oct 2, 2025

Uh oh!

Uh oh!

martindurant commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants