Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParquetError::NeedMoreData mark ParquetError as non_exhaustive #6630

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 25, 2024

Which issue does this PR close?

Part of #6447.

Rationale for this change

ParquetMetaDataReader currently uses ParquetError::IndexOutOfBound to indicate the need for more data. This is not the intended use for IndexOutOfBound.

What changes are included in this PR?

Adds a new enum value NeedMoreData.

Are there any user-facing changes?

Yes, this changes the ParquetError API as well as ParquetMetaDataReader.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 25, 2024
@tustvold tustvold added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Oct 25, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl ❤️ -- would it be possible to try and make the error messaegs more specific too (e.g. what ranges were needed, rather than just a single number 🤔 )

@@ -119,7 +119,7 @@ impl<F: MetadataFetch> MetadataLoader<F> {
return Err(ParquetError::EOF(format!(
"file size of {} is less than footer + metadata {}",
file_size,
length + 8
length + FOOTER_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@@ -47,6 +47,8 @@ pub enum ParquetError {
IndexOutOfBound(usize, usize),
/// An external error variant
External(Box<dyn Error + Send + Sync>),
/// Returned when a function needs more data to complete properly.
NeedMoreData(usize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also mark ParquetError non exhaustive at the same time (so future additions won't be breaking API changes)?

Like this:

#[non_exhaustive]

@@ -47,6 +47,8 @@ pub enum ParquetError {
IndexOutOfBound(usize, usize),
/// An external error variant
External(Box<dyn Error + Send + Sync>),
/// Returned when a function needs more data to complete properly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some additional comments on how to interpret the argument. Is it the number of bytes that is needed?

Maybe we can return a Range or something more specific about what is needed if it isn't always clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier POC I did have two separate errors, one with a range, but we thought that a bit redundant at the time. I suppose we could add an Option<Range<usize>> to NeedMoreData to handle the case of insufficient buffer to read the page indexes. There we know the exact range we need. Right now the needed value is set to the number of bytes read from the tail of the file. If try_parse fails when reading the page indexes, then we really only need the page index range...there's no need to read the footer all over again. But then the error handling becomes more complex...if the range in NeedMoreData is None, then read the required bytes from the tail and call try_parse again; if range is not None, then read the range and call read_page_indexes. Ofc a user is free to ignore the range if they want to keep the error handling simple at the cost of more I/O (but that would likely be negligible since those bytes should still be in buffer cache).

For now I'll try to improve the documentation. Let me know if you want me to add the optional Range to NeedMoreData.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More docs sounds good 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- this looks good to me

/// Ok(_) => break,
/// Err(ParquetError::NeedMoreData(needed)) => {
/// // Read the needed number of bytes from the end of the file
/// bytes = get_bytes(&file, len - needed..len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb changed the title Add ParquetError::NeedMoreData Add ParquetError::NeedMoreData mark ParquetError as non_exhaustive Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants