-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally write version 2 page headers in Parquet writer #13751
Optionally write version 2 page headers in Parquet writer #13751
Conversation
Pull requests from external contributors require approval from a |
/ok to test |
/ok to test |
ParquetFieldInt32(3, p->compressed_page_size), | ||
ParquetFieldStruct(5, p->data_page_header), | ||
ParquetFieldStruct(7, p->dictionary_page_header)); | ||
ParquetFieldStruct(7, p->dictionary_page_header), | ||
ParquetFieldStruct(8, p->data_page_header_v2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is number 6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 would be the un-implemented IndexPageHeader
. It's a placeholder type in the Parquet spec.
And before you ask, 4 is an optional CRC that no one adds either. 😉
…o feature/write_v2_header
/ok to test |
/ok to test |
@@ -275,6 +276,18 @@ bool CompactProtocolReader::read(DictionaryPageHeader* d) | |||
return function_builder(this, op); | |||
} | |||
|
|||
bool CompactProtocolReader::read(DataPageHeaderV2* d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this wasn't needed in #11778?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the header decoder is device code, so there's a duplicate thrift decoder buried in page_hdr.cu
We've got little bits of thrift decoding sprinkled all over the place :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one minor suggestion.
cpp/src/io/parquet/page_enc.cu
Outdated
comp_in[blockIdx.x] = {base, actual_data_size}; | ||
comp_out[blockIdx.x] = {s->page.compressed_data + s->page.max_hdr_size, 0}; // size is unused | ||
// V2 does not compress rep and def level data | ||
size_t const offset = s->page.def_lvl_bytes + s->page.rep_lvl_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a bit to connect the dots in this block. Maybe a rename?
size_t const offset = s->page.def_lvl_bytes + s->page.rep_lvl_bytes; | |
size_t const skip_comp_size = s->page.def_lvl_bytes + s->page.rep_lvl_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. Oh, the memcpy just below...I'm thinking that should be parallelized, but keep forgetting. Thoughts? The level data can get pretty large (100s of bytes?).
Or I can just address this in the next PR.
/ok to test |
Co-authored-by: Nghia Truong <[email protected]>
/ok to test |
/merge |
Description
Part of #13501. This adds the ability to write V2 page headers to the Parquet writer.
Checklist