Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IpcSchemaEncoder, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID #6444

Merged
merged 7 commits into from
Sep 25, 2024

Conversation

brancz
Copy link
Contributor

@brancz brancz commented Sep 23, 2024

Which issue does this PR close?

Closes #6443

Rationale for this change

It fixes a bug.

What changes are included in this PR?

This decouples dictionary IDs that end up in IPC from the schema further
because the dictionary tracker always first gathers the dict ID for each
field whether it is pre-defined and preserved or not.

Then when actually writing the IPC bytes the dictionary ID is always
taken from the dictionary tracker as opposed to falling back to the
Field of the Schema.

On top of that, when dictionary IDs are not preserved, then they are assigned depth
first, however, when reading them from the dictionary tracker to write
the IPC bytes, they were previously read from the dictionary tracker in
the order that the schema is traversed, which
caused an incorrect order of dictionaries serialized in IPC.

Are there any user-facing changes?

No API changes, just bug fixes.

@alamb @tustvold

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Sep 23, 2024
@brancz brancz force-pushed the ipc-set-dict-id branch 2 times, most recently from 416f753 to 6cff848 Compare September 23, 2024 17:17
@brancz
Copy link
Contributor Author

brancz commented Sep 23, 2024

Sorry for all the force pushing, all linters and formatting should be happy now.

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice. thanks!

@alamb alamb added the api-change Changes to the arrow API label Sep 24, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @brancz and @thinkharderdev

I think as written this PR can't be merged until we have the next major release (e.g. in December 2024): https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes

However, I have a proposal for getting it in sooner

Specifically, I suggest we

  1. Create a new struct for serializing schemas (perhaps IpcSchemaConverter?) that does the right thing with respect to schema dictionary ids
  2. deprecate (but don't change the signatures of schema_to_fb and schema_to_fb_offset)
  3. Update code in this repo to use the new struct

This would have the following benefits:

  1. It would avoid a breaking API change and get the fix in sooner
  2. It would be a nicer API in general than having to call several independent functions
  3. It woudl be easier to change / extend in the future

@@ -62,12 +66,13 @@ pub fn metadata_to_fb<'a>(

pub fn schema_to_fb_offset<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use crate::{size_prefixed_root_as_message, KeyValue, Message, CONTINUATION_MARKER};
use DataType::*;

/// Serialize a schema in IPC format
pub fn schema_to_fb(schema: &Schema) -> FlatBufferBuilder {
pub fn schema_to_fb<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this appears to be a public API https://docs.rs/arrow-ipc/latest/arrow_ipc/convert/fn.schema_to_fb.html

And thus if we make this change it is a breaking API change

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label Sep 24, 2024
@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

Makes sense! I wasn't paying attention to that and just blindly assumed that because it was such low level functionality that it was not a public API. I'll rework it the way you suggested!

@brancz brancz force-pushed the ipc-set-dict-id branch 3 times, most recently from 9fe8971 to bd52cd6 Compare September 24, 2024 17:51
@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

Done! I don't love the schema_to_bytes_with_dictionary_tracker function name, but I think this is the best we can do without changing the API, and then for the release where we can change the API we remove schema_to_bytes_with_dictionary_tracker and change the signature of schema_to_bytes.

This decouples dictionary IDs that end up in IPC from the schema further
because the dictionary tracker always first gathers the dict ID for each
field whether it is pre-defined and preserved or not.

Then when actually writing the IPC bytes the dictionary ID is always
taken from the dictionary tracker as opposed to falling back to the
`Field` of the `Schema`.
When dictionary IDs are not preserved, then they are assigned depth
first, however, when reading them from the dictionary tracker to write
the IPC bytes, they were previously read from the dictionary tracker in
the order that the schema is traversed (first come first serve), which
caused an incorrect order of dictionaries serialized in IPC.
@brancz
Copy link
Contributor Author

brancz commented Sep 24, 2024

How can I run the integration tests locally? I'm a bit confused why they are now failing when they were previously passing.

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

CI failure is unrelated: #6448

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you (again) @brancz -- this looks really nice

I spent some time playing with the docs and API and have a suggestion for refinement: polarsignals#1

Let me know what you think

Refine IpcSchemaEncoder API and docs
@brancz
Copy link
Contributor Author

brancz commented Sep 25, 2024

Those changes are perfect, thank you, merged them!

@brancz
Copy link
Contributor Author

brancz commented Sep 25, 2024

There were a few more lints to fix. I think this is now ready!

@alamb alamb changed the title Fix IPC not respecting not preserving dict ID Add IpcSchemaEncoder, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID Sep 25, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @brancz and @thinkharderdev

@alamb alamb merged commit 62825b2 into apache:master Sep 25, 2024
28 of 29 checks passed
@brancz brancz deleted the ipc-set-dict-id branch September 25, 2024 16:12
@brancz
Copy link
Contributor Author

brancz commented Sep 25, 2024

Thanks so much for all the help and reviews @alamb!

@alamb alamb removed the api-change Changes to the arrow API label Oct 2, 2024
@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

We updated this PR so it was not an API change so removing the label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IPC not respecting not preserving dict ID
3 participants