Skip to content

Conversation

@victor-zou
Copy link
Contributor

FIX #573

Add fix and tests. RuntimeError will be raised if chucks have heterogeneous filter_mask.

kerchunk/hdf.py Outdated
if filter_mask is None:
filter_mask = blob.filter_mask
elif filter_mask != blob.filter_mask:
raise RuntimeError(f"Dataset {dset.name} has heterogeneous `filter_mask` - "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erroring here is reasonable, and I am happy to include it.
BUT, is it not possible, in the case that there are some blobs but not all, to get the bytes representation for all the chunks and/or store the whole array in one materialised chunk? That's not ideal, but it makes the workflow possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erroring here is reasonable, and I am happy to include it. BUT, is it not possible, in the case that there are some blobs but not all, to get the bytes representation for all the chunks and/or store the whole array in one materialised chunk? That's not ideal, but it makes the workflow possible.

Yes, it is possible. Raise an error only when the size of the whole data is over certain threshold (default for the inline threshold while adding another custom parameter)?

A user may set malicious chunking like set chunk size 10000 for a data of length 10001. In this case, the tailing chunk of length 1 will probably not be compressed, and the whole array will be too big to be included.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that all makes sense. There's not too much we can do about the pathological case, but I doubt anyone will be malicious in this space!

@victor-zou
Copy link
Contributor Author

@martindurant I have a question on data inlining. In hdf.py around line 500

                if data is not None:
                    try:
                        za[:] = data
                    except (ValueError, TypeError):
                        self.store_dict[f"{za.path}/0"] = kwargs["filters"][0].encode(
                            data
                        )
                    return
  1. f"{za.path}/0", will it work if it is not one-dimensional?
  2. = kwargs["filters"][0].encode(, what if the dataset as 0 or more filters?

@martindurant
Copy link
Member

  1. You are right you will need something like f"{za.path}/{'.'.join("0" * ndim)}"
  2. I think we only tested for simple filters like gzip; indeed it could be uncompressed/unencoded (len(filters)->0) or have multiple steps.

@victor-zou
Copy link
Contributor Author

  1. You are right you will need something like f"{za.path}/{'.'.join("0" * ndim)}"
  2. I think we only tested for simple filters like gzip; indeed it could be uncompressed/unencoded (len(filters)->0) or have multiple steps.

Two more thing,

  1. It saves raw bytes here, whereas saves base64-encoded below. As, by the end of the day, the buffer will be "encode_for JSON", can I remove the "base64-encode" part and leave it to the "encode_for JSON", making the implementation more consistent?
                        if (
                            self.inline
                            and isinstance(v, dict)
                            and v["size"] < self.inline
                        ):
                            self.input_file.seek(v["offset"])
                            data = self.input_file.read(v["size"])
                            try:
                                # easiest way to test if data is ascii
                                data.decode("ascii")
                            except UnicodeDecodeError:
                                data = b"base64:" + base64.b64encode(data)

                            self.store_dict[key] = data
  1. It seems that in this part, the data is inlined no matter its size?

@martindurant
Copy link
Member

Yeah, there is a separate pass that coerces things to JSON types at the end, when doing the write. Probably they should all by bytes at this point.

So, inlining will happen per chunk based on size; but if there are types that need inlining, like the issue you are fixing, then there is no choice.

@victor-zou
Copy link
Contributor Author

@martindurant Now, the code will try to inline unsupported features including unsupported filters and heterogeneous filter_mask. For heterogeneous filter_mask, will only inline the alien chunks.

More test h5 and test cases are added.

Following tests in test_hdf5.py not pass:

  1. those who need read from s3, are those link still alive?
  2. string related tests except 'leave', all complain about, are these codes adapted after zarr3?
ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by providing a specific Zarr data type. For a list of Zarr data types that are compatible with the numpy "Object"data type, see https://github.com/zarr-developers/zarr-python/issues/3117    

@martindurant
Copy link
Member

for 1. I think these are still working, see #576 . That PR also did a little cleanup regarding the HDF filters, and not the only issue is with codecs defined in kerchunk itself. Specifically, zarr checks to see, in the case of strings, whether one of a hard-coded set of filters is in use and, if not, errors. Of course, we have our own encoding, so I don't think there's anything we can do from our end. Maybe this should be specified in a new issue to zarr. The only real alternative would be to either embed or drop all strings...

@victor-zou
Copy link
Contributor Author

for 1. I think these are still working, see #576 . That PR also did a little cleanup regarding the HDF filters, and not the only issue is with codecs defined in kerchunk itself. Specifically, zarr checks to see, in the case of strings, whether one of a hard-coded set of filters is in use and, if not, errors. Of course, we have our own encoding, so I don't think there's anything we can do from our end. Maybe this should be specified in a new issue to zarr. The only real alternative would be to either embed or drop all strings...

Yeah, thank you for the reply, seems fixes in #576 solve the "ValueError".
Comming back to this PR, any comment to the recent update?

@martindurant
Copy link
Member

There's quite a lot there! As far as I can tell, it all looks exceedingly good and in order. Are you happy with the current state?

@victor-zou
Copy link
Contributor Author

Tests / build (312) (pull_request)Cancelled after 2m

@martindurant It complains missing "hdf5plugin", which, though not used explicitly, is needed by the routine to read small chunks of unsupported compression like "lz4".

hdf.py's dependency over it is optional, so, in that file, I add some "try import catch" lines.

And for the test_hdf.py is there anyway to only add dependency over hdf5plugin for this test file instead of in the requires.txt? Or I add some try-catch also in the test_hdf.py? For those lz4-compressed files, catch the bizzare "OSError" raise by h5py when without the hdf5plugin import.

@martindurant
Copy link
Member

Yes, I think importorskip is appropriate in the test, but add it to the ci env file anyway. In the code itself, we should be able to cope with the module missing as you say.

@martindurant
Copy link
Member

ping @victor-zou - do you have what you need to finish here?

@victor-zou
Copy link
Contributor Author

ping @victor-zou - do you have what you need to finish here?

Thanks @martindurant What I need is all finished. Anything I need to do before the PR got merged?
Such as modify the yaml in the ci directory?
Even with hdf5plugin installed, follow string-related tests still not work, do I need comment out them:

FAILED tests/test_hdf.py::test_string_embed - OSError: pytest: reading from stdin while output is captured! Consider using -s.
FAILED tests/test_hdf.py::test_string_pathlib - KeyError: 'vlen_str/0'
FAILED tests/test_hdf.py::test_string_null - KeyError: 'vlen_str'
FAILED tests/test_hdf.py::test_string_decode - FileNotFoundError: vlen_str/.zarray
FAILED tests/test_hdf.py::test_compound_string_null - KeyError: 'vlen_str'
FAILED tests/test_hdf.py::test_compound_string_encode - ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by p

@martindurant
Copy link
Member

Is there a way to detect where hdf5plugin would have solved an issue, so we are seeing an exception without it?

@martindurant
Copy link
Member

(and yes, I think we will have to live with the broken string/compound test errors for the time being, but don't remove/comment them)

@victor-zou
Copy link
Contributor Author

Is there a way to detect where hdf5plugin would have solved an issue, so we are seeing an exception without it?

hdf5plugin is needed when read content directly like dset[:], which might need the filters registered by hdf5plugin. This only happens when meeting features not previously supported by kerchunk.hdf.

As there are numerous kinds of h5 plugins, it is not sure whether the h5py-read may succeed with hdf5plugin present; and it is also not sure whether a read failure is result from the absence of hdf5plugin. How about, when read-fail and missing hdf5plugin, issue an warning and re-throw the exception?

@martindurant
Copy link
Member

How about, when read-fail and missing hdf5plugin, issue an warning and re-throw the exception?

Yes, that sounds good.

@martindurant
Copy link
Member

@victor-zou , please check if eddb0ff is what you had in mind.

@victor-zou
Copy link
Contributor Author

@victor-zou , please check if eddb0ff is what you had in mind.

Thank you for the impl, I made a little change:

  1. Catch all exception instead of import error. Missing filters registered by hdf5plugin propably throw OSError (may be from the errno set in C impl?), but google results also mentions ValueError and RuntimeError. Therefore I just catch all.
  2. Since all exception is catched, I limit the range of read_warn use to reduce false warning. Only use it when meet unsupported features. For dset[:] which does not need to register filters, I change them back. Also, for this spirit, I refactor its name to _read_unsupported_direct.
  3. Add a use case in _storage_info where read chunks with inconsistent filter_mask.

kerchunk/hdf.py Outdated
# bizarre error message in my test without `hdf5plugin` imported.
# Just simply catch all exceptions, as we will rethrow it anyway.
except Exception as e:
if hdf5plugin:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning should be when hdf5plugin is None, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, "hdf5plugin is None" in mind, missing rest in codes..

@martindurant
Copy link
Member

I tink everything is ready?

@martindurant martindurant changed the title fix(hdf): chucks may skip filters when small fix(hdf): chunks may skip filters when small Oct 2, 2025
@victor-zou
Copy link
Contributor Author

I tink everything is ready?

@martindurant Yeah everything is ready. Also, changing the needed to extra does make the warning message more accurate.

@martindurant martindurant merged commit 80e0701 into fsspec:main Oct 2, 2025
2 of 4 checks passed
@martindurant
Copy link
Member

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: hdf5 may skip filters for small chunks and store their raw binary

2 participants