-
Notifications
You must be signed in to change notification settings - Fork 1k
Implement hex decoding of JSON strings to binary arrays #8737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The `writer::encoder::BinaryEncoder` encodes binary arrays as hex-encoded JSON strings. This commit adds support for decoding these strings again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn decode_hex_string(hex_string: &str) -> Result<Vec<u8>, ArrowError> { | ||
| let mut decoded = Vec::with_capacity(hex_string.len() / 2); | ||
| for substr in hex_string.as_bytes().chunks(2) { | ||
| let str = std::str::from_utf8(substr).map_err(|e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure this code could be made much faster with a custom lookup table rather than using u8::from_str_radix etc
That being said, that would be a nice thing to improve in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, also because we can make stronger assumptions than the requirements of from_str_radix.
I don't have time to look into this right now though, so maybe we can leave this for a future PR.
Which issue does this PR close?
arrow-jsonsupports encoding binary arrays, but not decoding #8736Rationale for this change
See linked issue.
What changes are included in this PR?
Add JSON decoders for binary array variants that act as counterparts to #5622. This way, it becomes possible to do a full round-trip encoding/decoding of binary array.
Are these changes tested?
I added a roundtrip test based on the
test_writer_binary. It verifies that encoding and then decoding leads to the original input again. It coversBinary,LargeBinary,FixedSizeBinary, andBinaryViewarrays, all with and without explicit nulls.Are there any user-facing changes?
Yes, encoding and decoding binary arrays to/from JSON is now fully supported, given the right schema.
One limitation is that schema inference is not able to detect binary arrays as they look like normal JSON strings after encoding. However, this is already true when encoding other Arrow types, for example it's not possible to differentiate integer bit widths.
I updated the docs accordingly.