[FEA] Support V2 encodings in Parquet reader and writer #13501

GregoryKimball · 2023-06-02T03:38:40Z

Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) (reference from Spark Jira). The newer and evolving Parquet V2 specification adds support for several additional encodings, including DELTA_BINARY_PACKED for INT32 and INT64 types, DELTA_BYTE_ARRAY for strings logical type, and DELTA_LENGTH_BYTE_ARRAY for strings logical type.

In the parquet reader and writer, libcudf should support V2 metadata as well as the three variants of DELTA encoding.

Feature	Status	Notes
Add V2 reader support	✅ #11778
Multi-warp decode of Dremel data streams	✅ #13203
Use efficient strings column factory in decoder	✅ #13302
Implement DELTA_BINARY_PACKED decoding	✅ #13637	see #12948 for reference
Implement DELTA_BYTE_ARRAY decoding	✅ #14101	see #12948 for reference
Add V2 writer support	✅ #13751
Implement DELTA_BINARY_PACKED encoding	✅ #14100
Add python bindings for V2 header and options	✅ #14316
Implement DELTA_BYTE_ARRAY encoding	✅ #15239	some outdated reviews in #14938
Implement DELTA_LENGTH_BYTE_ARRAY encoding and decoding for unsorted data	✅ #14590
Add C++ API support for specifying encodings	✅ #15081
Add cuDF-python API support for specifying encodings	✅ #15613
Add BYTE_STREAM_SPLIT encoding and decoding	✅ #15311	see issue #15226 and parquet reference

The text was updated successfully, but these errors were encountered:

Part of #13501. This adds the ability to write V2 page headers to the Parquet writer. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #13751

While working on #13707 it was noticed that RLE encoding of booleans had been implemented and then disabled (see [this comment](#13707 (comment)) for details). This PR re-enables RLE encoding for booleans, but only when V2 headers are being used. Part of #13501. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #13886

) Part of #13501. This adds support for decoding Parquet pages that are DELTA_BINARY_PACKED. In addition to adding delta support, this PR incorporates changes introduced in #13622, such as using a mask to determine which decoding kernels to run, and adding parameters to the `page_state_buffers_s` struct to reduce the amount of shared memory used. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13637

etseidl · 2023-09-14T00:07:10Z

@GregoryKimball should there be an entry for adding python bindings for the V2 options?

Part of #13501. Adds ability to fall back on DELTA_BINARY_PACKED encoding when V2 page headers are selected and dictionary encoding is not possible. Authors: - Ed Seidl (https://github.com/etseidl) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #14100

Part of #13501. Adds ability to decode DELTA_BYTE_ARRAY encoded pages. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14101

Part of #13501. This adds the ability to read and write Parquet pages with DELTA_LENGTH_BYTE_ARRAY encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Michael Wang (https://github.com/isVoid) - Nghia Truong (https://github.com/ttnghia) URL: #14590

Re-submission of #14938. Final (delta) piece of #13501. Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #15239

Closes #15226. Part of #13501. Adds support for reading and writing `BYTE_STREAM_SPLIT` encoded Parquet data. Includes a "microkernel" version like those introduced by #15159. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) URL: #15311

…PI (#15613) Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'. Addresses "Add cuDF-python API support for specifying encodings" task in #13501. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15613

GregoryKimball · 2024-06-11T20:25:03Z

Congratulations @etseidl! Everyone, please stay tuned for a technical blog on this topic! 😄

GregoryKimball added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function labels Jun 2, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Jun 2, 2023

etseidl mentioned this issue Jun 2, 2023

[WIP] Add support for DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY encodings to Parquet reader #12948

Closed

3 tasks

etseidl mentioned this issue Jun 29, 2023

[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader #13637

Merged

3 tasks

etseidl mentioned this issue Jul 25, 2023

Optionally write version 2 page headers in Parquet writer #13751

Merged

3 tasks

revans2 mentioned this issue Jul 26, 2023

[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos NVIDIA/spark-rapids#8778

Closed

etseidl mentioned this issue Aug 15, 2023

Enable RLE boolean encoding for v2 Parquet files #13886

Merged

3 tasks

This was referenced Sep 13, 2023

Add DELTA_BINARY_PACKED encoder for Parquet writer #14100

Merged

Add decoder for DELTA_BYTE_ARRAY to Parquet reader #14101

Merged

GregoryKimball mentioned this issue Nov 13, 2023

Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded #14392

Merged

3 tasks

etseidl mentioned this issue Dec 6, 2023

Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet #14590

Merged

3 tasks

etseidl mentioned this issue Jan 30, 2024

Encode DELTA_BYTE_ARRAY in Parquet writer #14938

Closed

3 tasks

GregoryKimball mentioned this issue Mar 1, 2024

[FEA] Confirm correct and efficient dictionary processing in libcudf #15199

Open

etseidl mentioned this issue Mar 6, 2024

Add DELTA_BYTE_ARRAY encoder for Parquet #15239

Merged

3 tasks

GregoryKimball changed the title ~~[FEA] Support DELTA encoding in Parquet reader and writer~~ [FEA] Support V2 encodings in Parquet reader and writer Mar 6, 2024

etseidl mentioned this issue Mar 14, 2024

Add BYTE_STREAM_SPLIT support to Parquet #15311

Merged

3 tasks

etseidl mentioned this issue May 10, 2024

Expose some Parquet per-column configuration options via the python API #15613

Merged

3 tasks

GregoryKimball closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support V2 encodings in Parquet reader and writer #13501

[FEA] Support V2 encodings in Parquet reader and writer #13501

GregoryKimball commented Jun 2, 2023 •

edited

Loading

etseidl commented Sep 14, 2023

GregoryKimball commented Jun 11, 2024

[FEA] Support V2 encodings in Parquet reader and writer #13501

[FEA] Support V2 encodings in Parquet reader and writer #13501

Comments

GregoryKimball commented Jun 2, 2023 • edited Loading

etseidl commented Sep 14, 2023

GregoryKimball commented Jun 11, 2024

GregoryKimball commented Jun 2, 2023 •

edited

Loading