Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-24.02 to branch-24.04 #14875

Merged
merged 1 commit into from
Jan 25, 2024
Merged

Conversation

GPUtester
Copy link
Collaborator

Forward-merge triggered by push to branch-24.02 that creates a PR to keep branch-24.04 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

closes #14270

Implementation of sub-rowgroup reading of Parquet files.  This PR implements an additional layer on top of the existing chunking system.  Currently, the reader takes two parameters:  `input_pass_read_limit` which specifies a limit on temporary memory usage when reading and decompressing file data;  and `output_pass_read_limit` which specifies a limit on how large an output chunk (a table) can be.

Currently when the user specifies a limit via `input_pass_read_limit`, the reader will perform multiple `passes` over the file at row-group granularity.  That is, it will control how many row groups it will read at once to conform to the specified limit.

However, there are cases where this is not sufficient.  So this PR changes things so that we now have `subpasses` below the top level `passes`.  It works as follows:

- We read a set of input chunks based on the `input_pass_read_limit` but we do not decompress them immediately. This constitutes a `pass`.
- Within each pass of compressed data, we progressively decompress batches of pages as `subpasses`.
- Within each `subpass` we apply the output limit to produce `chunks`.

So the overall structure of the reader is:  (read) `pass` -> (decompress) `subpass` -> (decode) `chunk`

Major sections of code changes:

- Previously the incoming page data in the file was unsorted. To handle this we later on produced a `page_index` that could be applied to the array to get them in schema-sorted order.  This was getting very unwieldy so I just sort the pages up front now and the `page_index` array has gone away.

- There are now two sets of pages to be aware of in the code.  Within each `pass_intermediate_data` there is the set of all pages within the current set of loaded row groups.  And then within the `subpass_intermediate_data` struct there is a separate array of pages representing the current batch of decompressed data we are processing.  To keep the confusion down I changed a good amount of code to always reference it's array though it's associated struct.  Ie,  `pass.pages` or `subpass.pages`. In addition, I removed the `page_info` from `ColumnChunkDesc` to help prevent the kernels from getting confused. `ColumnChunkDesc` now only has a `dict_page` field which is constant across all subpasses.

- The primary entry point for the chunking mechanism is in `handle_chunking`. Here we iterate through passes, subpasses and output chunks.  Successive subpasses are computed and preprocessed through here. 

- The volume of diffs you'll see in `reader_impl_chunking.cu` is a little deceptive.  A lot of this is just functions (or pieces of functions) that have been moved over from either `reader_impl_preprocess.cu` or `reader_impl_helpers.cpp`.   The most relevant actual changes are in: ` handle_chunking`, `compute_input_passes`, `compute_next_subpass`, and `compute_chunks_for_subpass`.

Note on tests:   I renamed `parquet_chunked_reader_tests.cpp` to `parquet_chunked_reader_test.cu` as I needed to use thrust. The only actual changes in the file are the addition of the `ParquetChunkedReaderInputLimitConstrainedTest` and `ParquetChunkedReaderInputLimitTest` test suites at the bottom.

Authors:
  - https://github.com/nvdbaranec
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #14360
@GPUtester GPUtester requested a review from a team as a code owner January 25, 2024 00:51
@GPUtester GPUtester merged commit f5118c2 into branch-24.04 Jan 25, 2024
22 checks passed
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 25, 2024
@GPUtester
Copy link
Collaborator Author

SUCCESS - forward-merge complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants