-
Notifications
You must be signed in to change notification settings - Fork 1k
ENH: Expose split_part to Python API via pylibcudf #21068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: Expose split_part to Python API via pylibcudf #21068
Conversation
|
You need to fix the style errors identified by the pre-commit.ci runner here. |
|
pre-commit.ci autofix |
|
Something has gone wrong with your commit. Surely you do not mean to erase all of the .sh files. |
40c99a0 to
cbab72f
Compare
|
pre-commit.ci autofix |
|
Also, could you add tests for when the delimiter is empty -- indicates split on whitespace. |
Co-authored-by: David Wendt <[email protected]>
|
pre-commit.ci autofix |
| from cudf.testing._utils import assert_eq | ||
|
|
||
|
|
||
| def test_split_part(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems these should be moved to the cudf/python/cudf/cudf/tests/series/accessors/test_str.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be deleted if everything has been moved to test_str.py.
Co-authored-by: David Wendt <[email protected]>
|
pre-commit.ci autofix |
| from cudf.testing._utils import assert_eq | ||
|
|
||
|
|
||
| def test_split_part(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be deleted if everything has been moved to test_str.py.
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this source should be moved to https://github.com/rapidsai/cudf/blob/main/python/pylibcudf/tests/test_string_split_split.py
Co-authored-by: David Wendt <[email protected]>
Co-authored-by: David Wendt <[email protected]>
|
pre-commit.ci autofix |
Co-authored-by: David Wendt <[email protected]>
|
/ok to test f3cb63f |
Co-authored-by: David Wendt <[email protected]>
Co-authored-by: David Wendt <[email protected]>
|
/ok to test 7a75d80 |
| if delimiter is None: | ||
| delimiter = "" | ||
| return self._return_or_inplace( | ||
| self._column.split_part(delimiter, index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the delimiter annotation, you'll need to call plc.Scalar.from_py(delimiter) before passing it to split_part
Description
This PR implements the Python bindings and API for
cudf::strings::split_part, as discussed in Issue #21042.This allows users to extract a specific token from a split string without materializing the entire list, significantly improving performance for ETL/Log parsing workloads.
Changes
split.pxdandsplit.pyx..str.split_part(delimiter, index)method toStringMethodsincore/column/string.py.tests/test_string.py.Checklist
Note to Reviewers
I am developing in a constrained environment (Google Colab) and could not run the full local build/test suite due to environment limitations. I am opening this as a Draft PR to rely on the CI/CD pipeline for compilation verification and testing.