Skip to content

Conversation

@Umang-projects
Copy link

Description

This PR implements the Python bindings and API for cudf::strings::split_part, as discussed in Issue #21042.

This allows users to extract a specific token from a split string without materializing the entire list, significantly improving performance for ETL/Log parsing workloads.

Changes

  • pylibcudf: Added Cython bindings in split.pxd and split.pyx.
  • cuDF API: Added .str.split_part(delimiter, index) method to StringMethods in core/column/string.py.
  • Tests: Added unit tests in tests/test_string.py.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Note to Reviewers

I am developing in a constrained environment (Google Colab) and could not run the full local build/test suite due to environment limitations. I am opening this as a Draft PR to rely on the CI/CD pipeline for compilation verification and testing.

@Umang-projects Umang-projects requested a review from a team as a code owner January 16, 2026 11:31
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jan 16, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Jan 16, 2026
@davidwendt davidwendt added feature request New feature or request non-breaking Non-breaking change labels Jan 16, 2026
@davidwendt
Copy link
Contributor

You need to fix the style errors identified by the pre-commit.ci runner here.

@Umang-projects Umang-projects requested review from a team as code owners January 16, 2026 17:33
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas labels Jan 16, 2026
@Umang-projects
Copy link
Author

pre-commit.ci autofix

@davidwendt
Copy link
Contributor

Something has gone wrong with your commit. Surely you do not mean to erase all of the .sh files.
Looks like you need rebase the changes onto the main branch.

@Umang-projects
Copy link
Author

pre-commit.ci autofix

@davidwendt davidwendt removed request for a team January 18, 2026 22:57
@davidwendt
Copy link
Contributor

Also, could you add tests for when the delimiter is empty -- indicates split on whitespace.

@Umang-projects
Copy link
Author

pre-commit.ci autofix

from cudf.testing._utils import assert_eq


def test_split_part():
Copy link
Contributor

@davidwendt davidwendt Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems these should be moved to the cudf/python/cudf/cudf/tests/series/accessors/test_str.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be deleted if everything has been moved to test_str.py.

@Umang-projects
Copy link
Author

pre-commit.ci autofix

from cudf.testing._utils import assert_eq


def test_split_part():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be deleted if everything has been moved to test_str.py.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# SPDX-License-Identifier: Apache-2.0

# SPDX-License-Identifier: Apache-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Umang-projects
Copy link
Author

pre-commit.ci autofix

@davidwendt
Copy link
Contributor

/ok to test f3cb63f

@davidwendt
Copy link
Contributor

/ok to test 7a75d80

if delimiter is None:
delimiter = ""
return self._return_or_inplace(
self._column.split_part(delimiter, index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the delimiter annotation, you'll need to call plc.Scalar.from_py(delimiter) before passing it to split_part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf.pandas Issues specific to cudf.pandas feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants