-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add support for file row numbers in Parquet readers #7307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
f93d36e
e485c0b
2a62009
fb5126f
5350728
188f350
37a9d83
41e38fe
1a1e6b6
bcad87f
89c1fd1
b0d53d0
094ae81
a5858df
5e7d9a1
54c22c6
f05d470
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use crate::arrow::array_reader::ArrayReader; | ||
| use crate::errors::{ParquetError, Result}; | ||
| use crate::file::metadata::RowGroupMetaData; | ||
| use arrow_array::{ArrayRef, Int64Array}; | ||
| use arrow_schema::DataType; | ||
| use std::any::Any; | ||
| use std::collections::VecDeque; | ||
| use std::sync::Arc; | ||
|
|
||
| pub(crate) struct RowNumberReader { | ||
| row_numbers: Vec<i64>, | ||
| row_groups: RowGroupSizeIterator, | ||
| } | ||
|
|
||
| impl RowNumberReader { | ||
| pub(crate) fn try_new<I>(row_groups: impl IntoIterator<Item = I>) -> Result<Self> | ||
| where | ||
| I: TryInto<RowGroupSize, Error = ParquetError>, | ||
| { | ||
| let row_groups = RowGroupSizeIterator::try_new(row_groups)?; | ||
| Ok(Self { | ||
| row_numbers: Vec::new(), | ||
| row_groups, | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| impl ArrayReader for RowNumberReader { | ||
| fn as_any(&self) -> &dyn Any { | ||
| self | ||
| } | ||
|
|
||
| fn get_data_type(&self) -> &DataType { | ||
| &DataType::Int64 | ||
| } | ||
|
|
||
| fn read_records(&mut self, batch_size: usize) -> Result<usize> { | ||
| let read = self | ||
| .row_groups | ||
| .read_records(batch_size, &mut self.row_numbers); | ||
| Ok(read) | ||
| } | ||
|
|
||
| fn consume_batch(&mut self) -> Result<ArrayRef> { | ||
| Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..)))) | ||
| } | ||
|
|
||
| fn skip_records(&mut self, num_records: usize) -> Result<usize> { | ||
| let skipped = self.row_groups.skip_records(num_records); | ||
| Ok(skipped) | ||
| } | ||
|
|
||
| fn get_def_levels(&self) -> Option<&[i16]> { | ||
| None | ||
| } | ||
|
|
||
| fn get_rep_levels(&self) -> Option<&[i16]> { | ||
| None | ||
| } | ||
| } | ||
|
|
||
| struct RowGroupSizeIterator { | ||
| row_groups: VecDeque<RowGroupSize>, | ||
| } | ||
|
|
||
| impl RowGroupSizeIterator { | ||
| fn try_new<I>(row_groups: impl IntoIterator<Item = I>) -> Result<Self> | ||
| where | ||
| I: TryInto<RowGroupSize, Error = ParquetError>, | ||
|
||
| { | ||
| Ok(Self { | ||
| row_groups: VecDeque::from( | ||
| row_groups | ||
| .into_iter() | ||
| .map(TryInto::try_into) | ||
| .collect::<Result<Vec<_>>>()?, | ||
| ), | ||
| }) | ||
|
||
| } | ||
| } | ||
|
|
||
| impl RowGroupSizeIterator { | ||
| fn read_records(&mut self, mut batch_size: usize, row_numbers: &mut Vec<i64>) -> usize { | ||
| let mut read = 0; | ||
| while batch_size > 0 { | ||
| let Some(front) = self.row_groups.front_mut() else { | ||
| return read as usize; | ||
| }; | ||
| let to_read = std::cmp::min(front.num_rows, batch_size as i64); | ||
jkylling marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| row_numbers.extend(front.first_row_number..front.first_row_number + to_read); | ||
| front.num_rows -= to_read; | ||
| front.first_row_number += to_read; | ||
| if front.num_rows == 0 { | ||
| self.row_groups.pop_front(); | ||
| } | ||
| batch_size -= to_read as usize; | ||
| read += to_read; | ||
| } | ||
| read as usize | ||
| } | ||
|
|
||
| fn skip_records(&mut self, mut num_records: usize) -> usize { | ||
| let mut skipped = 0; | ||
| while num_records > 0 { | ||
| let Some(front) = self.row_groups.front_mut() else { | ||
| return skipped as usize; | ||
| }; | ||
| let to_skip = std::cmp::min(front.num_rows, num_records as i64); | ||
| front.num_rows -= to_skip; | ||
| front.first_row_number += to_skip; | ||
| if front.num_rows == 0 { | ||
| self.row_groups.pop_front(); | ||
| } | ||
| skipped += to_skip; | ||
| num_records -= to_skip as usize; | ||
| } | ||
| skipped as usize | ||
| } | ||
| } | ||
|
|
||
| pub(crate) struct RowGroupSize { | ||
| first_row_number: i64, | ||
| num_rows: i64, | ||
| } | ||
|
|
||
| impl TryFrom<&RowGroupMetaData> for RowGroupSize { | ||
| type Error = ParquetError; | ||
|
|
||
| fn try_from(rg: &RowGroupMetaData) -> Result<Self, Self::Error> { | ||
| Ok(Self { | ||
| first_row_number: rg | ||
| .first_row_number() | ||
| .ok_or(ParquetError::RowGroupMetaDataMissingRowNumber)?, | ||
| num_rows: rg.num_rows(), | ||
| }) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thrilled to see this!
Please let me know if I can help in any way. I can make it my top priority to work on this, as we need to make use of it in the next few weeks.
Our use-case is to leverage this from iceberg-rust, which uses
ParquetRecordBatchStreamBuilder. The API seems to work for that, but I understand from other comments that it may not be the most desirable one - happy to help either with research/proposal or with the implementation of the chosen option.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vustef! I'd be very happy if you want to help get row number support into the Parquet reader, either with this PR or through other alternatives. If you want to pick up this PR I can give you commit rights to the branch? Sadly, I don't have capacity to work on this PR at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jkylling, yes, please do that if you can, happy to continue where you left.
I'd also need some guidance from @scovich and @alamb on the preferred path forward. And potentially help from @etseidl if I hit a wall with merging metadata changes that happened in the meanwhile (but more on that once I try it out).