-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316
Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316
Conversation
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a rounding concern
Co-authored-by: Vukasin Milovanovic <[email protected]>
/ok to test |
@galipremsagar Would you please take a look at this testing improvement? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a cuIO dev, but this looks good to me.
A minor nitpick regarding the naming of a function, but that's unrelated to this change.
@@ -191,6 +199,8 @@ cdef extern from "cudf/io/parquet.hpp" namespace "cudf::io" nogil: | |||
void set_row_group_size_rows(size_type val) except + | |||
void set_max_page_size_bytes(size_t val) except + | |||
void set_max_page_size_rows(size_type val) except + | |||
void enable_write_v2_headers(bool val) except + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it isn't the fault of this current PR, but one does wish enable_write_v2_headers
were named set_write_v2_headers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use enable_ for bool options, so this should be consistent (for better or for worse, apparently).
@@ -6370,6 +6370,8 @@ def to_parquet( | |||
max_page_size_rows=None, | |||
storage_options=None, | |||
return_metadata=False, | |||
use_dictionary=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document these two parameters here:
cudf/python/cudf/cudf/utils/ioutils.py
Line 222 in 16051a7
_docstring_to_parquet = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @galipremsagar. I would have never thought to look there for the docstring 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time hearing about it as well 🤷♂️ (just don't git blame ioutils.py
:P )
Co-authored-by: GALI PREM SAGAR <[email protected]>
/ok to test |
/ok to test |
/merge |
Description
During the review of #14100 there was a suggestion to add a test of writing using cudf and then reading the resulting file back with pyarrow. This PR adds the necessary python bindings to perform this test.
NOTE: there is currently an issue with encoding 32-bit values where the deltas exceed 32-bits. parquet-mr and arrow truncate the deltas for the INT32 physical type and allow values to overflow, whereas cudf currently uses 64-bit deltas, which avoids the overflow, but can result in requiring 33-bits when encoding. The current cudf behavior is allowed by the specification (and in fact is readable by parquet-mr), but using the extra bit is not in the Parquet spirit of least output file size. This will be addressed in follow-on work.
Checklist