Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudf-polars string slicing #16082

Merged

Conversation

brandon-b-miller
Copy link
Contributor

This PR plumbs the libcudf/pylibcudf slice_strings function through to cudf-polars. Depends on #15988

@brandon-b-miller brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jun 25, 2024
@brandon-b-miller brandon-b-miller self-assigned this Jun 25, 2024
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner June 25, 2024 14:40
@github-actions github-actions bot added CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jun 25, 2024

# libcudf slices via [start,stop).
# polars slices with offset + length where start == offset
# stop = start + length. Do this math on the host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# stop = start + length. Do this math on the host
# stop = start + length. Do this maths on the host

;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check if the polars logic for slicing strings is the same as "dataframe" slicing, as implemented in

start, length = zlice
if start < 0:
start += self.num_rows
# Polars implementation wraps negative start by num_rows, then
# adds length to start to get the end, then clamps both to
# [0, num_rows)
end = start + length
start = max(min(start, self.num_rows), 0)
end = max(min(end, self.num_rows), 0)

python/cudf_polars/cudf_polars/dsl/expr.py Outdated Show resolved Hide resolved
Comment on lines 724 to 731
stop = Literal(
expr_length.dtype,
pa.scalar(
expr_start.value.as_py() + expr_length.value.as_py(),
type=pa.int32(),
),
).evaluate(df, context=context, mapping=mapping)
start = expr_start.evaluate(df, context=context, mapping=mapping)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you know you're making scalars, you can just do plc.interop.from_arrow on the pyarrow scalar. for the one you computed.

@github-actions github-actions bot removed the CMake CMake build issue label Jun 26, 2024
@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jul 1, 2024
@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Jul 1, 2024
@wence-
Copy link
Contributor

wence- commented Jul 1, 2024

(please merge trunk so that the cudf_polars test suite runs on this PR)

@brandon-b-miller
Copy link
Contributor Author

(please merge trunk so that the cudf_polars test suite runs on this PR)

I see some test failures here, but they seem unrelated to these changes - is there a blocking PR needed thats needed to pass things here?

polars.exceptions.ComputeError: 'cuda' conversion failed: TypeError: argument 'node': 'list' object cannot be interpreted as an integer

@vyasr
Copy link
Contributor

vyasr commented Jul 1, 2024

Possibly needs #16149?

@vyasr
Copy link
Contributor

vyasr commented Jul 1, 2024

Yeah I'm seeing all of the same failures in a completely unrelated PR #15904

@wence-
Copy link
Contributor

wence- commented Jul 1, 2024

Yeah I'm seeing all of the same failures in a completely unrelated PR #15904

This was why I didn't want these tests to block everything. But I guess that touches pylibcudf

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Some very minor nits

python/cudf_polars/cudf_polars/dsl/expr.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/expr.py Outdated Show resolved Hide resolved
(-2, 2),
(-100, 3),
(0, 0),
(0, 1000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a case where the computed start is negative and the computed stop is positive to ensure that does the right thing? e.g. suppose you have (-3, 4), I think that will produce a start stop pair of [-3, 1) which will be empty for long strings, whereas in polars it would slice from strlen - 3 to the end of the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some logic for this.



@pytest.mark.parametrize(
"offset,length", [(1, 3), (0, 3), (0, 0), (-3, 1), (-100, 5), (1, 1), (100, 100)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these parametrizations different to those with columns above? As there, can we particularly add a case where the computed start is negative but the computed stop is non-negative (e.g. (-3, 4) or (-2, 2)? I think those will not do the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged these

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One fencepost error, and then I think we're good to go!

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brandon!

@wence-
Copy link
Contributor

wence- commented Jul 3, 2024

/merge

@rapids-bot rapids-bot bot merged commit 3aedeea into rapidsai:branch-24.08 Jul 3, 2024
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants