Forward-merge branch-24.02 to branch-24.04 #14875

GPUtester · 2024-01-25T00:51:21Z

Forward-merge triggered by push to branch-24.02 that creates a PR to keep branch-24.04 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

closes #14270 Implementation of sub-rowgroup reading of Parquet files. This PR implements an additional layer on top of the existing chunking system. Currently, the reader takes two parameters: `input_pass_read_limit` which specifies a limit on temporary memory usage when reading and decompressing file data; and `output_pass_read_limit` which specifies a limit on how large an output chunk (a table) can be. Currently when the user specifies a limit via `input_pass_read_limit`, the reader will perform multiple `passes` over the file at row-group granularity. That is, it will control how many row groups it will read at once to conform to the specified limit. However, there are cases where this is not sufficient. So this PR changes things so that we now have `subpasses` below the top level `passes`. It works as follows: - We read a set of input chunks based on the `input_pass_read_limit` but we do not decompress them immediately. This constitutes a `pass`. - Within each pass of compressed data, we progressively decompress batches of pages as `subpasses`. - Within each `subpass` we apply the output limit to produce `chunks`. So the overall structure of the reader is: (read) `pass` -> (decompress) `subpass` -> (decode) `chunk` Major sections of code changes: - Previously the incoming page data in the file was unsorted. To handle this we later on produced a `page_index` that could be applied to the array to get them in schema-sorted order. This was getting very unwieldy so I just sort the pages up front now and the `page_index` array has gone away. - There are now two sets of pages to be aware of in the code. Within each `pass_intermediate_data` there is the set of all pages within the current set of loaded row groups. And then within the `subpass_intermediate_data` struct there is a separate array of pages representing the current batch of decompressed data we are processing. To keep the confusion down I changed a good amount of code to always reference it's array though it's associated struct. Ie, `pass.pages` or `subpass.pages`. In addition, I removed the `page_info` from `ColumnChunkDesc` to help prevent the kernels from getting confused. `ColumnChunkDesc` now only has a `dict_page` field which is constant across all subpasses. - The primary entry point for the chunking mechanism is in `handle_chunking`. Here we iterate through passes, subpasses and output chunks. Successive subpasses are computed and preprocessed through here. - The volume of diffs you'll see in `reader_impl_chunking.cu` is a little deceptive. A lot of this is just functions (or pieces of functions) that have been moved over from either `reader_impl_preprocess.cu` or `reader_impl_helpers.cpp`. The most relevant actual changes are in: ` handle_chunking`, `compute_input_passes`, `compute_next_subpass`, and `compute_chunks_for_subpass`. Note on tests: I renamed `parquet_chunked_reader_tests.cpp` to `parquet_chunked_reader_test.cu` as I needed to use thrust. The only actual changes in the file are the addition of the `ParquetChunkedReaderInputLimitConstrainedTest` and `ParquetChunkedReaderInputLimitTest` test suites at the bottom. Authors: - https://github.com/nvdbaranec - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #14360

GPUtester · 2024-01-25T00:51:34Z

SUCCESS - forward-merge complete.

GPUtester requested a review from a team as a code owner January 25, 2024 00:51

GPUtester requested review from karthikeyann and davidwendt January 25, 2024 00:51

GPUtester merged commit f5118c2 into branch-24.04 Jan 25, 2024
22 checks passed

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward-merge branch-24.02 to branch-24.04 #14875

Forward-merge branch-24.02 to branch-24.04 #14875

GPUtester commented Jan 25, 2024

GPUtester commented Jan 25, 2024

Forward-merge branch-24.02 to branch-24.04 #14875

Forward-merge branch-24.02 to branch-24.04 #14875

Conversation

GPUtester commented Jan 25, 2024

GPUtester commented Jan 25, 2024