Skip to content

Conversation

@phil-opp
Copy link
Contributor

@phil-opp phil-opp commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

See linked issue.

What changes are included in this PR?

Add JSON decoders for binary array variants that act as counterparts to #5622. This way, it becomes possible to do a full round-trip encoding/decoding of binary array.

Are these changes tested?

I added a roundtrip test based on the test_writer_binary. It verifies that encoding and then decoding leads to the original input again. It covers Binary, LargeBinary, FixedSizeBinary, and BinaryView arrays, all with and without explicit nulls.

Are there any user-facing changes?

Yes, encoding and decoding binary arrays to/from JSON is now fully supported, given the right schema.

One limitation is that schema inference is not able to detect binary arrays as they look like normal JSON strings after encoding. However, this is already true when encoding other Arrow types, for example it's not possible to differentiate integer bit widths.

I updated the docs accordingly.

The `writer::encoder::BinaryEncoder` encodes binary arrays as hex-encoded JSON strings. This commit adds support for decoding these strings again.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 29, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @phil-opp -- this is a very nice contribution

This PR was a pleasure to review as it is well documented, and tested and coded

I left a few suggestions to potentially improve the code, but I don't think any of them are required.

cc @hiltontj (who added #5622)

fn decode_hex_string(hex_string: &str) -> Result<Vec<u8>, ArrowError> {
let mut decoded = Vec::with_capacity(hex_string.len() / 2);
for substr in hex_string.as_bytes().chunks(2) {
let str = std::str::from_utf8(substr).map_err(|e| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this code could be made much faster with a custom lookup table rather than using u8::from_str_radix etc

That being said, that would be a nice thing to improve in a future PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, also because we can make stronger assumptions than the requirements of from_str_radix.

I don't have time to look into this right now though, so maybe we can leave this for a future PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

arrow-json supports encoding binary arrays, but not decoding

2 participants