-
Notifications
You must be signed in to change notification settings - Fork 95
fix(hdf): chunks may skip filters when small #575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kerchunk/hdf.py
Outdated
| if filter_mask is None: | ||
| filter_mask = blob.filter_mask | ||
| elif filter_mask != blob.filter_mask: | ||
| raise RuntimeError(f"Dataset {dset.name} has heterogeneous `filter_mask` - " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erroring here is reasonable, and I am happy to include it.
BUT, is it not possible, in the case that there are some blobs but not all, to get the bytes representation for all the chunks and/or store the whole array in one materialised chunk? That's not ideal, but it makes the workflow possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erroring here is reasonable, and I am happy to include it. BUT, is it not possible, in the case that there are some blobs but not all, to get the bytes representation for all the chunks and/or store the whole array in one materialised chunk? That's not ideal, but it makes the workflow possible.
Yes, it is possible. Raise an error only when the size of the whole data is over certain threshold (default for the inline threshold while adding another custom parameter)?
A user may set malicious chunking like set chunk size 10000 for a data of length 10001. In this case, the tailing chunk of length 1 will probably not be compressed, and the whole array will be too big to be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that all makes sense. There's not too much we can do about the pathological case, but I doubt anyone will be malicious in this space!
|
@martindurant I have a question on data inlining. In hdf.py around line 500 if data is not None:
try:
za[:] = data
except (ValueError, TypeError):
self.store_dict[f"{za.path}/0"] = kwargs["filters"][0].encode(
data
)
return
|
|
Two more thing,
if (
self.inline
and isinstance(v, dict)
and v["size"] < self.inline
):
self.input_file.seek(v["offset"])
data = self.input_file.read(v["size"])
try:
# easiest way to test if data is ascii
data.decode("ascii")
except UnicodeDecodeError:
data = b"base64:" + base64.b64encode(data)
self.store_dict[key] = data
|
|
Yeah, there is a separate pass that coerces things to JSON types at the end, when doing the write. Probably they should all by bytes at this point. So, inlining will happen per chunk based on size; but if there are types that need inlining, like the issue you are fixing, then there is no choice. |
|
@martindurant Now, the code will try to inline unsupported features including unsupported filters and heterogeneous More test h5 and test cases are added. Following tests in test_hdf5.py not pass:
|
|
for 1. I think these are still working, see #576 . That PR also did a little cleanup regarding the HDF filters, and not the only issue is with codecs defined in kerchunk itself. Specifically, zarr checks to see, in the case of strings, whether one of a hard-coded set of filters is in use and, if not, errors. Of course, we have our own encoding, so I don't think there's anything we can do from our end. Maybe this should be specified in a new issue to zarr. The only real alternative would be to either embed or drop all strings... |
Yeah, thank you for the reply, seems fixes in #576 solve the "ValueError". |
|
There's quite a lot there! As far as I can tell, it all looks exceedingly good and in order. Are you happy with the current state? |
|
Tests / build (312) (pull_request)Cancelled after 2m @martindurant It complains missing "hdf5plugin", which, though not used explicitly, is needed by the routine to read small chunks of unsupported compression like "lz4". hdf.py's dependency over it is optional, so, in that file, I add some "try import catch" lines. And for the test_hdf.py is there anyway to only add dependency over hdf5plugin for this test file instead of in the requires.txt? Or I add some try-catch also in the test_hdf.py? For those lz4-compressed files, catch the bizzare "OSError" raise by h5py when without the hdf5plugin import. |
|
Yes, I think importorskip is appropriate in the test, but add it to the ci env file anyway. In the code itself, we should be able to cope with the module missing as you say. |
|
ping @victor-zou - do you have what you need to finish here? |
Thanks @martindurant What I need is all finished. Anything I need to do before the PR got merged? FAILED tests/test_hdf.py::test_string_embed - OSError: pytest: reading from stdin while output is captured! Consider using |
|
Is there a way to detect where hdf5plugin would have solved an issue, so we are seeing an exception without it? |
|
(and yes, I think we will have to live with the broken string/compound test errors for the time being, but don't remove/comment them) |
hdf5plugin is needed when read content directly like As there are numerous kinds of h5 plugins, it is not sure whether the h5py-read may succeed with hdf5plugin present; and it is also not sure whether a read failure is result from the absence of hdf5plugin. How about, when read-fail and missing hdf5plugin, issue an warning and re-throw the exception? |
Yes, that sounds good. |
|
@victor-zou , please check if eddb0ff is what you had in mind. |
Thank you for the impl, I made a little change:
|
kerchunk/hdf.py
Outdated
| # bizarre error message in my test without `hdf5plugin` imported. | ||
| # Just simply catch all exceptions, as we will rethrow it anyway. | ||
| except Exception as e: | ||
| if hdf5plugin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning should be when hdf5plugin is None, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, "hdf5plugin is None" in mind, missing rest in codes..
|
I tink everything is ready? |
@martindurant Yeah everything is ready. Also, changing the needed to extra does make the warning message more accurate. |
|
+1 |
FIX #573
Add fix and tests.
RuntimeErrorwill be raised if chucks have heterogeneousfilter_mask.