-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add support for file row numbers in Parquet readers #7307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jkylling
wants to merge
17
commits into
apache:main
Choose a base branch
from
jkylling:feature/parquet-reader-row-numbers
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+540
−46
Draft
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
f93d36e
Add support for file row numbers in Parquet readers
jkylling e485c0b
Add Apache license header to row_number.rs
jkylling 2a62009
Run cargo format
jkylling fb5126f
Change with_row_number_column to take impl Into<String>
jkylling 5350728
Change Option<String> -> Option<&str> in build_array_reader
jkylling 188f350
Replace ParquetError::RowGroupMetaDataMissingRowNumber with General
jkylling 37a9d83
Split test_create_array_reader test into two
jkylling 41e38fe
first_row_number -> first_row_index
jkylling 1a1e6b6
Simplify RowNumberReader with iterators
jkylling bcad87f
Merge remote-tracking branch 'origin/main' into feature/parquet-reade…
vustef 89c1fd1
add parquet-testing change from the merge
vustef b0d53d0
Fix test_arrow_reader_all_columns
vustef 094ae81
Fix first_row_number
vustef a5858df
Rename to first_row_index consistently, remove Option.
vustef 5e7d9a1
revert parquet-testing update
vustef 54c22c6
Fix baselines in file::metadata::tests::test_memory_size
vustef f05d470
Fix encryption metadata and async tests. Those features and default f…
vustef File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use crate::arrow::array_reader::ArrayReader; | ||
| use crate::errors::{ParquetError, Result}; | ||
| use crate::file::metadata::RowGroupMetaData; | ||
| use arrow_array::{ArrayRef, Int64Array}; | ||
| use arrow_schema::DataType; | ||
| use std::any::Any; | ||
| use std::sync::Arc; | ||
|
|
||
| pub(crate) struct RowNumberReader { | ||
| buffered_row_numbers: Vec<i64>, | ||
| remaining_row_numbers: std::iter::Flatten<std::vec::IntoIter<std::ops::Range<i64>>>, | ||
| } | ||
|
|
||
| impl RowNumberReader { | ||
| pub(crate) fn try_new<'a>( | ||
| row_groups: impl Iterator<Item = &'a RowGroupMetaData>, | ||
| ) -> Result<Self> { | ||
| let ranges = row_groups | ||
| .map(|rg| { | ||
| let first_row_index = rg.first_row_index(); | ||
| Ok(first_row_index..first_row_index + rg.num_rows()) | ||
| }) | ||
| .collect::<Result<Vec<_>>>()?; | ||
| Ok(Self { | ||
| buffered_row_numbers: Vec::new(), | ||
| remaining_row_numbers: ranges.into_iter().flatten(), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| impl ArrayReader for RowNumberReader { | ||
| fn read_records(&mut self, batch_size: usize) -> Result<usize> { | ||
| let starting_len = self.buffered_row_numbers.len(); | ||
| self.buffered_row_numbers | ||
| .extend((&mut self.remaining_row_numbers).take(batch_size)); | ||
| Ok(self.buffered_row_numbers.len() - starting_len) | ||
| } | ||
|
|
||
| fn skip_records(&mut self, num_records: usize) -> Result<usize> { | ||
| // TODO: Use advance_by when it stabilizes to improve performance | ||
| Ok((&mut self.remaining_row_numbers).take(num_records).count()) | ||
| } | ||
|
|
||
| fn as_any(&self) -> &dyn Any { | ||
| self | ||
| } | ||
|
|
||
| fn get_data_type(&self) -> &DataType { | ||
| &DataType::Int64 | ||
| } | ||
|
|
||
| fn consume_batch(&mut self) -> Result<ArrayRef> { | ||
| Ok(Arc::new(Int64Array::from_iter( | ||
| self.buffered_row_numbers.drain(..), | ||
| ))) | ||
| } | ||
|
|
||
| fn get_def_levels(&self) -> Option<&[i16]> { | ||
| None | ||
| } | ||
|
|
||
| fn get_rep_levels(&self) -> Option<&[i16]> { | ||
| None | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thrilled to see this!
Please let me know if I can help in any way. I can make it my top priority to work on this, as we need to make use of it in the next few weeks.
Our use-case is to leverage this from iceberg-rust, which uses
ParquetRecordBatchStreamBuilder. The API seems to work for that, but I understand from other comments that it may not be the most desirable one - happy to help either with research/proposal or with the implementation of the chosen option.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vustef! I'd be very happy if you want to help get row number support into the Parquet reader, either with this PR or through other alternatives. If you want to pick up this PR I can give you commit rights to the branch? Sadly, I don't have capacity to work on this PR at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jkylling, yes, please do that if you can, happy to continue where you left.
I'd also need some guidance from @scovich and @alamb on the preferred path forward. And potentially help from @etseidl if I hit a wall with merging metadata changes that happened in the meanwhile (but more on that once I try it out).