-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support V2 encodings in Parquet reader and writer #13501
Labels
2 - In Progress
Currently a work in progress
cuIO
cuIO issue
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Comments
GregoryKimball
added
feature request
New feature or request
2 - In Progress
Currently a work in progress
libcudf
Affects libcudf (C++/CUDA) code.
cuIO
cuIO issue
Spark
Functionality that helps Spark RAPIDS
improvement
Improvement / enhancement to an existing function
labels
Jun 2, 2023
Closed
3 tasks
3 tasks
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Aug 15, 2023
Part of #13501. This adds the ability to write V2 page headers to the Parquet writer. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #13751
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Aug 17, 2023
While working on #13707 it was noticed that RLE encoding of booleans had been implemented and then disabled (see [this comment](#13707 (comment)) for details). This PR re-enables RLE encoding for booleans, but only when V2 headers are being used. Part of #13501. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #13886
rapids-bot bot
pushed a commit
that referenced
this issue
Aug 23, 2023
) Part of #13501. This adds support for decoding Parquet pages that are DELTA_BINARY_PACKED. In addition to adding delta support, this PR incorporates changes introduced in #13622, such as using a mask to determine which decoding kernels to run, and adding parameters to the `page_state_buffers_s` struct to reduce the amount of shared memory used. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13637
This was referenced Sep 13, 2023
@GregoryKimball should there be an entry for adding python bindings for the V2 options? |
rapids-bot bot
pushed a commit
that referenced
this issue
Oct 20, 2023
Part of #13501. Adds ability to fall back on DELTA_BINARY_PACKED encoding when V2 page headers are selected and dictionary encoding is not possible. Authors: - Ed Seidl (https://github.com/etseidl) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #14100
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Nov 16, 2023
Part of #13501. Adds ability to decode DELTA_BYTE_ARRAY encoded pages. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14101
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Dec 20, 2023
Part of #13501. This adds the ability to read and write Parquet pages with DELTA_LENGTH_BYTE_ARRAY encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Michael Wang (https://github.com/isVoid) - Nghia Truong (https://github.com/ttnghia) URL: #14590
3 tasks
3 tasks
GregoryKimball
changed the title
[FEA] Support DELTA encoding in Parquet reader and writer
[FEA] Support V2 encodings in Parquet reader and writer
Mar 6, 2024
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 8, 2024
Re-submission of #14938. Final (delta) piece of #13501. Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #15239
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 24, 2024
Closes #15226. Part of #13501. Adds support for reading and writing `BYTE_STREAM_SPLIT` encoded Parquet data. Includes a "microkernel" version like those introduced by #15159. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) URL: #15311
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
May 22, 2024
…PI (#15613) Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'. Addresses "Add cuDF-python API support for specifying encodings" task in #13501. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15613
Congratulations @etseidl! Everyone, please stay tuned for a technical blog on this topic! 😄 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2 - In Progress
Currently a work in progress
cuIO
cuIO issue
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) (reference from Spark Jira). The newer and evolving Parquet V2 specification adds support for several additional encodings, including DELTA_BINARY_PACKED for
INT32
andINT64
types, DELTA_BYTE_ARRAY forstrings
logical type, and DELTA_LENGTH_BYTE_ARRAY forstrings
logical type.In the parquet reader and writer, libcudf should support V2 metadata as well as the three variants of DELTA encoding.
The text was updated successfully, but these errors were encountered: