Add `IpcSchemaEncoder`, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

brancz · 2024-09-23T17:06:35Z

Which issue does this PR close?

Rationale for this change

It fixes a bug.

What changes are included in this PR?

This decouples dictionary IDs that end up in IPC from the schema further
because the dictionary tracker always first gathers the dict ID for each
field whether it is pre-defined and preserved or not.

Then when actually writing the IPC bytes the dictionary ID is always
taken from the dictionary tracker as opposed to falling back to the
Field of the Schema.

On top of that, when dictionary IDs are not preserved, then they are assigned depth
first, however, when reading them from the dictionary tracker to write
the IPC bytes, they were previously read from the dictionary tracker in
the order that the schema is traversed, which
caused an incorrect order of dictionaries serialized in IPC.

Are there any user-facing changes?

No API changes, just bug fixes.

@alamb @tustvold

brancz · 2024-09-23T17:37:51Z

Sorry for all the force pushing, all linters and formatting should be happy now.

thinkharderdev

very nice. thanks!

alamb

Thank you @brancz and @thinkharderdev

I think as written this PR can't be merged until we have the next major release (e.g. in December 2024): https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes

However, I have a proposal for getting it in sooner

Specifically, I suggest we

Create a new struct for serializing schemas (perhaps IpcSchemaConverter?) that does the right thing with respect to schema dictionary ids
deprecate (but don't change the signatures of schema_to_fb and schema_to_fb_offset)
Update code in this repo to use the new struct

This would have the following benefits:

It would avoid a breaking API change and get the fix in sooner
It would be a nicer API in general than having to call several independent functions
It woudl be easier to change / extend in the future

alamb · 2024-09-24T15:14:47Z

arrow-ipc/src/convert.rs

@@ -62,12 +66,13 @@ pub fn metadata_to_fb<'a>(

 pub fn schema_to_fb_offset<'a>(


this is also a a public API https://docs.rs/arrow-ipc/latest/arrow_ipc/convert/fn.schema_to_fb_offset.html

alamb · 2024-09-24T15:15:28Z

arrow-ipc/src/convert.rs

 use crate::{size_prefixed_root_as_message, KeyValue, Message, CONTINUATION_MARKER};
 use DataType::*;

 /// Serialize a schema in IPC format
-pub fn schema_to_fb(schema: &Schema) -> FlatBufferBuilder {
+pub fn schema_to_fb<'a>(


Unfortunately, this appears to be a public API https://docs.rs/arrow-ipc/latest/arrow_ipc/convert/fn.schema_to_fb.html

And thus if we make this change it is a breaking API change

brancz · 2024-09-24T15:30:29Z

Makes sense! I wasn't paying attention to that and just blindly assumed that because it was such low level functionality that it was not a public API. I'll rework it the way you suggested!

brancz · 2024-09-24T17:54:45Z

Done! I don't love the schema_to_bytes_with_dictionary_tracker function name, but I think this is the best we can do without changing the API, and then for the release where we can change the API we remove schema_to_bytes_with_dictionary_tracker and change the signature of schema_to_bytes.

This decouples dictionary IDs that end up in IPC from the schema further because the dictionary tracker always first gathers the dict ID for each field whether it is pre-defined and preserved or not. Then when actually writing the IPC bytes the dictionary ID is always taken from the dictionary tracker as opposed to falling back to the `Field` of the `Schema`.

When dictionary IDs are not preserved, then they are assigned depth first, however, when reading them from the dictionary tracker to write the IPC bytes, they were previously read from the dictionary tracker in the order that the schema is traversed (first come first serve), which caused an incorrect order of dictionaries serialized in IPC.

brancz · 2024-09-24T19:02:52Z

How can I run the integration tests locally? I'm a bit confused why they are now failing when they were previously passing.

alamb · 2024-09-24T19:27:35Z

CI failure is unrelated: #6448

alamb

Thank you (again) @brancz -- this looks really nice

I spent some time playing with the docs and API and have a suggestion for refinement: polarsignals#1

Let me know what you think

Refine IpcSchemaEncoder API and docs

brancz · 2024-09-25T08:27:36Z

Those changes are perfect, thank you, merged them!

brancz · 2024-09-25T11:06:10Z

There were a few more lints to fix. I think this is now ready!

alamb

Thanks again @brancz and @thinkharderdev

brancz · 2024-09-25T16:13:25Z

Thanks so much for all the help and reviews @alamb!

alamb · 2024-10-02T18:29:43Z

We updated this PR so it was not an API change so removing the label

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Sep 23, 2024

brancz force-pushed the ipc-set-dict-id branch 2 times, most recently from 416f753 to 6cff848 Compare September 23, 2024 17:17

brancz mentioned this pull request Sep 23, 2024

IPC not respecting not preserving dict ID #6443

Closed

brancz force-pushed the ipc-set-dict-id branch from 6cff848 to 9adaa86 Compare September 23, 2024 17:23

arrow-ipc: Add test for non preserving dict ID behavior with same ID

4edafeb

brancz force-pushed the ipc-set-dict-id branch from 9adaa86 to 09bfe61 Compare September 23, 2024 17:30

thinkharderdev approved these changes Sep 23, 2024

View reviewed changes

alamb added the api-change Changes to the arrow API label Sep 24, 2024

alamb reviewed Sep 24, 2024

View reviewed changes

alamb added the next-major-release the PR has API changes and it waiting on the next major version label Sep 24, 2024

brancz force-pushed the ipc-set-dict-id branch 3 times, most recently from 9fe8971 to bd52cd6 Compare September 24, 2024 17:51

brancz added 2 commits September 24, 2024 19:57

brancz force-pushed the ipc-set-dict-id branch from bd52cd6 to 200d1f5 Compare September 24, 2024 17:57

Refine IpcSchemaEncoder API and docs

9f2380c

alamb mentioned this pull request Sep 24, 2024

Refine IpcSchemaEncoder API and docs polarsignals/arrow-rs#1

Merged

reduce repeated code

6bc335e

alamb reviewed Sep 24, 2024

View reviewed changes

Merge pull request #1 from alamb/alamb/improve_docs_comments

4d46fcb

Refine IpcSchemaEncoder API and docs

Fix lints

41ca359

alamb changed the title ~~Fix IPC not respecting not preserving dict ID~~ Add IpcSchemaEncoder, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID Sep 25, 2024

alamb approved these changes Sep 25, 2024

View reviewed changes

alamb merged commit 62825b2 into apache:master Sep 25, 2024
28 of 29 checks passed

brancz deleted the ipc-set-dict-id branch September 25, 2024 16:12

alamb mentioned this pull request Oct 2, 2024

CI integration test failing: Archery test With other arrows #6448

Closed

alamb removed the api-change Changes to the arrow API label Oct 2, 2024

rmn-boiko mentioned this pull request Oct 15, 2024

Update arrow crate version with breaking changes kamu-data/kamu-cli#905

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `IpcSchemaEncoder`, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

Add `IpcSchemaEncoder`, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

brancz commented Sep 23, 2024 •

edited

Loading

brancz commented Sep 23, 2024

thinkharderdev left a comment

alamb left a comment

alamb Sep 24, 2024

alamb Sep 24, 2024

brancz commented Sep 24, 2024

brancz commented Sep 24, 2024

brancz commented Sep 24, 2024

alamb commented Sep 24, 2024

alamb left a comment

brancz commented Sep 25, 2024 •

edited

Loading

brancz commented Sep 25, 2024

alamb left a comment •

edited

Loading

brancz commented Sep 25, 2024

alamb commented Oct 2, 2024

		@@ -62,12 +66,13 @@ pub fn metadata_to_fb<'a>(

		pub fn schema_to_fb_offset<'a>(

Add IpcSchemaEncoder, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

Add IpcSchemaEncoder, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

Conversation

brancz commented Sep 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

brancz commented Sep 23, 2024

thinkharderdev left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 24, 2024

Choose a reason for hiding this comment

alamb Sep 24, 2024

Choose a reason for hiding this comment

brancz commented Sep 24, 2024

brancz commented Sep 24, 2024

brancz commented Sep 24, 2024

alamb commented Sep 24, 2024

alamb left a comment

Choose a reason for hiding this comment

brancz commented Sep 25, 2024 • edited Loading

brancz commented Sep 25, 2024

alamb left a comment • edited Loading

Choose a reason for hiding this comment

brancz commented Sep 25, 2024

alamb commented Oct 2, 2024

Add `IpcSchemaEncoder`, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

Add `IpcSchemaEncoder`, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

brancz commented Sep 23, 2024 •

edited

Loading

brancz commented Sep 25, 2024 •

edited

Loading

alamb left a comment •

edited

Loading