Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent FlightData overflowing max size limit whenever possible. #6690

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

itsjunetime
Copy link
Contributor

Which issue does this PR close?

Closes #3478

What changes are included in this PR?

This reworks the encoding/writing step of flight data messages to ensure it never overflows the given limit whenever possible (specifically, it's impossible when we can't even fit a single row + header within the limit - there are still no mechanisms for splitting a single row of data between multiple messages).

It does this by first constructing a fake IPC header, then getting that header's encoded length, and then subtracting that length from the provided max size. Because the header's size stays the same with the same schema (the only thing that changes is the value within the 'length of data' fields), we don't need to continually recalculate it.

There are more tests I'd like to add before merging this, I was just hoping to get this filed first so that I could get feedback in case any behavior seemed seriously off.

Rationale for these changes

Since we are dynamically checking array data sizes to see if they can fit within the alloted size, this ensures that they will never go over if possible. Of course, as I said before, they will still go over if necessary, but I've rewritten the tests to check this behavior (if the tests sense an overage, but decode it to see that only one row was written, they allow it as there is no other way to get the data across).

Are there any user-facing changes?

Yes, there are API additions. They are documented. As far as I can tell, this shouldn't require a breaking release, but I haven't run anything like cargo-semver-checks on it to actually verify.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Nov 5, 2024
};

match array_data.data_type() {
DataType::Dictionary(_, _) => Ok(()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I wasn't able to figure out about this encoding process - it seems we don't write the child_data for dictionaries into the encoded message, but that's where all the values of the dictionary are. Without this, we only have the keys written. Does anyone know why this is?

@itsjunetime
Copy link
Contributor Author

It looks like CI failed due to some network flakiness - I'm going to close and reopen to try it again

@itsjunetime itsjunetime closed this Nov 8, 2024
@itsjunetime itsjunetime reopened this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate
Projects
None yet
1 participant