Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion parquet/src/arrow/async_reader/metadata.rs
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ impl<F: MetadataFetch> MetadataLoader<F> {
return Err(ParquetError::EOF(format!(
"file size of {} is less than footer + metadata {}",
file_size,
length + 8
length + FOOTER_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

)));
}

Expand Down
3 changes: 3 additions & 0 deletions parquet/src/errors.rs
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ pub enum ParquetError {
IndexOutOfBound(usize, usize),
/// An external error variant
External(Box<dyn Error + Send + Sync>),
/// Returned when a function needs more data to complete properly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some additional comments on how to interpret the argument. Is it the number of bytes that is needed?

Maybe we can return a Range or something more specific about what is needed if it isn't always clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier POC I did have two separate errors, one with a range, but we thought that a bit redundant at the time. I suppose we could add an Option<Range<usize>> to NeedMoreData to handle the case of insufficient buffer to read the page indexes. There we know the exact range we need. Right now the needed value is set to the number of bytes read from the tail of the file. If try_parse fails when reading the page indexes, then we really only need the page index range...there's no need to read the footer all over again. But then the error handling becomes more complex...if the range in NeedMoreData is None, then read the required bytes from the tail and call try_parse again; if range is not None, then read the range and call read_page_indexes. Ofc a user is free to ignore the range if they want to keep the error handling simple at the cost of more I/O (but that would likely be negligible since those bytes should still be in buffer cache).

For now I'll try to improve the documentation. Let me know if you want me to add the optional Range to NeedMoreData.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More docs sounds good 👍

NeedMoreData(usize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also mark ParquetError non exhaustive at the same time (so future additions won't be breaking API changes)?

Like this:

#[non_exhaustive]

Copy link
Member

@xudong963 xudong963 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb If add the #[non_exhaustive] to ParquetError, when the downstream matches ParquetError, we have to add the _, https://doc.rust-lang.org/reference/attributes/type_system.html#r-attributes.type-system.non_exhaustive.match
such as

match parquet_error {
  ... => {},
  _ => {} // We have to add the `_`, https://doc.rust-lang.org/reference/attributes/type_system.html#r-attributes.type-system.non_exhaustive.match
}

I'm worried that if we add a new error type into ParquetError, the downstream will be unsensible due to _.

}

impl std::fmt::Display for ParquetError {
Expand All @@ -63,6 +65,7 @@ impl std::fmt::Display for ParquetError {
write!(fmt, "Index {index} out of bound: {bound}")
}
ParquetError::External(e) => write!(fmt, "External: {e}"),
ParquetError::NeedMoreData(needed) => write!(fmt, "NeedMoreData: {needed}"),
}
}
}
Expand Down
36 changes: 11 additions & 25 deletions parquet/src/file/metadata/reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ impl ParquetMetaDataReader {
///
/// # Errors
///
/// This function will return [`ParquetError::IndexOutOfBound`] in the event `reader` does not
/// This function will return [`ParquetError::NeedMoreData`] in the event `reader` does not
/// provide enough data to fully parse the metadata (see example below).
///
/// Other errors returned include [`ParquetError::General`] and [`ParquetError::EOF`].
Expand All @@ -196,7 +196,7 @@ impl ParquetMetaDataReader {
/// let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
/// match reader.try_parse_sized(&bytes, len) {
/// Ok(_) => (),
/// Err(ParquetError::IndexOutOfBound(needed, _)) => {
/// Err(ParquetError::NeedMoreData(needed)) => {
/// let bytes = get_bytes(&file, len - needed..len);
/// reader.try_parse_sized(&bytes, len).unwrap();
/// }
Expand All @@ -207,12 +207,7 @@ impl ParquetMetaDataReader {
pub fn try_parse_sized<R: ChunkReader>(&mut self, reader: &R, file_size: usize) -> Result<()> {
self.metadata = match self.parse_metadata(reader) {
Ok(metadata) => Some(metadata),
// FIXME: throughout this module ParquetError::IndexOutOfBound is used to indicate the
// need for more data. This is not it's intended use. The plan is to add a NeedMoreData
// value to the enum, but this would be a breaking change. This will be done as
// 54.0.0 draws nearer.
// https://github.com/apache/arrow-rs/issues/6447
Err(ParquetError::IndexOutOfBound(needed, _)) => {
Err(ParquetError::NeedMoreData(needed)) => {
// If reader is the same length as `file_size` then presumably there is no more to
// read, so return an EOF error.
if file_size == reader.len() as usize || needed > file_size {
Expand All @@ -223,7 +218,7 @@ impl ParquetMetaDataReader {
));
} else {
// Ask for a larger buffer
return Err(ParquetError::IndexOutOfBound(needed, file_size));
return Err(ParquetError::NeedMoreData(needed));
}
}
Err(e) => return Err(e),
Expand Down Expand Up @@ -285,10 +280,7 @@ impl ParquetMetaDataReader {
));
} else {
// Ask for a larger buffer
return Err(ParquetError::IndexOutOfBound(
file_size - range.start,
file_size,
));
return Err(ParquetError::NeedMoreData(file_size - range.start));
}
}

Expand Down Expand Up @@ -484,10 +476,7 @@ impl ParquetMetaDataReader {
// check file is large enough to hold footer
let file_size = chunk_reader.len();
if file_size < (FOOTER_SIZE as u64) {
return Err(ParquetError::IndexOutOfBound(
FOOTER_SIZE,
file_size as usize,
));
return Err(ParquetError::NeedMoreData(FOOTER_SIZE));
}

let mut footer = [0_u8; 8];
Expand All @@ -500,10 +489,7 @@ impl ParquetMetaDataReader {
self.metadata_size = Some(footer_metadata_len);

if footer_metadata_len > file_size as usize {
return Err(ParquetError::IndexOutOfBound(
footer_metadata_len,
file_size as usize,
));
return Err(ParquetError::NeedMoreData(footer_metadata_len));
}

let start = file_size - footer_metadata_len as u64;
Expand Down Expand Up @@ -682,7 +668,7 @@ mod tests {
let err = ParquetMetaDataReader::new()
.parse_metadata(&test_file)
.unwrap_err();
assert!(matches!(err, ParquetError::IndexOutOfBound(8, _)));
assert!(matches!(err, ParquetError::NeedMoreData(8)));
}

#[test]
Expand All @@ -701,7 +687,7 @@ mod tests {
let err = ParquetMetaDataReader::new()
.parse_metadata(&test_file)
.unwrap_err();
assert!(matches!(err, ParquetError::IndexOutOfBound(263, _)));
assert!(matches!(err, ParquetError::NeedMoreData(263)));
}

#[test]
Expand Down Expand Up @@ -794,7 +780,7 @@ mod tests {
// should fail
match reader.try_parse_sized(&bytes, len).unwrap_err() {
// expected error, try again with provided bounds
ParquetError::IndexOutOfBound(needed, _) => {
ParquetError::NeedMoreData(needed) => {
let bytes = bytes_for_range(len - needed..len);
reader.try_parse_sized(&bytes, len).unwrap();
let metadata = reader.finish().unwrap();
Expand All @@ -818,7 +804,7 @@ mod tests {
// should fail
match reader.try_parse_sized(&bytes, len).unwrap_err() {
// expected error, try again with provided bounds
ParquetError::IndexOutOfBound(needed, _) => {
ParquetError::NeedMoreData(needed) => {
let bytes = bytes_for_range(len - needed..len);
reader.try_parse_sized(&bytes, len).unwrap();
reader.finish().unwrap();
Expand Down
Loading