Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Improving Clarity and Consistency in the set_row_groups Doc in libcudf #17772

Open
JigaoLuo opened this issue Jan 21, 2025 · 0 comments
Open
Labels
doc Documentation

Comments

@JigaoLuo
Copy link

Report incorrect documentation

Location of incorrect documentation

/**
* @brief Sets vector of individual row groups to read.
*
* @param row_groups Vector of row groups to read
*/
void set_row_groups(std::vector<std::vector<size_type>> row_groups);

https://docs.rapids.ai/api/cudf/legacy/libcudf_docs/api_docs/io_readers/#_CPPv4N4cudf2io22parquet_reader_options14set_row_groupsENSt6vectorINSt6vectorI9size_typeEEEE

Describe the problems or issues found in the documentation

The set_row_groups function with the parameter std::vector<std::vector<size_type>> row_groups is quite confusing at first glance. After some experimentation, I realized that each std::vector<size_type> corresponds to a single input source, making the outer std::vector<std::vector<size_type>> represent multiple input sources.
Unfortunately, this is not clear from the documentation alone. In comparison, the Python API documentation posted below for the equivalent parameter is much more intuitive and easier to understand.

Additionally, the set_columns function only accepts a single std::vector, making its interface inconsistent with set_row_groups. This inconsistency further adds to the confusion. Improved documentation and a more consistent API design would greatly enhance usability.

Steps taken to verify documentation is incorrect

In cudf Python API: https://docs.rapids.ai/api/cudf/legacy/user_guide/api_docs/api/cudf.read_parquet/#cudf.read_parquet
it is much better understandable:

row_groupsint, or list, or a list of lists default None
    If not None, specifies, for each input file, which row groups to read. If reading multiple inputs, a list of lists should be passed, one list for each input.

Suggested fix for documentation

Should be the same as the cudf Python doc

@JigaoLuo JigaoLuo added the doc Documentation label Jan 21, 2025
@JigaoLuo JigaoLuo changed the title [DOC] [DOC] Improving Clarity and Consistency in the set_row_groups Doc in libcudf Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation
Projects
None yet
Development

No branches or pull requests

1 participant