-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DELTA_BINARY_PACKED encoder for Parquet writer #14100
Add DELTA_BINARY_PACKED encoder for Parquet writer #14100
Conversation
…code_delta_binary
/ok to test |
cpp/src/io/parquet/delta_enc.cuh
Outdated
constexpr int values_per_mini_block = block_size / num_mini_blocks; | ||
constexpr int buffer_size = 2 * block_size; | ||
|
||
// extra sanity checks to enforce compliance with the parquet specification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It is cleaner to have any sentence (including comments) begins with a capitalized letter and ends with a period. Sorry if I'm nitpicking too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries...hitting caps is an extra keystroke and I'm lazy 😅
cpp/src/io/parquet/delta_enc.cuh
Outdated
uint8_t* _dst; // sink to dump encoded values to | ||
size_type _current_idx; // index of first value in buffer | ||
uint32_t _num_values; // total number of values to encode | ||
size_type _values_in_buffer; // current number of values stored in _buffer | ||
T* _buffer; // buffer to store values to be encoded | ||
uint8_t _mb_bits[delta::num_mini_blocks]; // bitwidth for each mini-block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you group pointers and values please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just left a few comments about coding style.
The tests here seem to do round trip write/read. Can we have file written somewhere else that can be tested here? I'm not sure if we can do that in C++ code or have to do it in python? |
So the inverse of |
FWIW, we do have reader tests for this encoding that does not depend on the cudf writer. I guess it does not prove that reader only reads correctly encoded data, so writer could write something only cudf can read. Might be best to add the V2 API to Python... |
Agree. The test can be added here or either in a separate PR. |
/ok to test |
/merge |
During the review of #14100 there was a suggestion to add a test of writing using cudf and then reading the resulting file back with pyarrow. This PR adds the necessary python bindings to perform this test. NOTE: there is currently an issue with encoding 32-bit values where the deltas exceed 32-bits. parquet-mr and arrow truncate the deltas for the INT32 physical type and allow values to overflow, whereas cudf currently uses 64-bit deltas, which avoids the overflow, but can result in requiring 33-bits when encoding. The current cudf behavior is allowed by the specification (and in fact is readable by parquet-mr), but using the extra bit is not in the Parquet spirit of least output file size. This will be addressed in follow-on work. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - MithunR (https://github.com/mythrocks) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14316
Description
Part of #13501. Adds ability to fall back on DELTA_BINARY_PACKED encoding when V2 page headers are selected and dictionary encoding is not possible.
Checklist