-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudf-polars
string slicing
#16082
cudf-polars
string slicing
#16082
Conversation
|
||
# libcudf slices via [start,stop). | ||
# polars slices with offset + length where start == offset | ||
# stop = start + length. Do this math on the host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# stop = start + length. Do this math on the host | |
# stop = start + length. Do this maths on the host |
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please check if the polars logic for slicing strings is the same as "dataframe" slicing, as implemented in
cudf/python/cudf_polars/cudf_polars/containers/dataframe.py
Lines 212 to 220 in ac0f79a
start, length = zlice | |
if start < 0: | |
start += self.num_rows | |
# Polars implementation wraps negative start by num_rows, then | |
# adds length to start to get the end, then clamps both to | |
# [0, num_rows) | |
end = start + length | |
start = max(min(start, self.num_rows), 0) | |
end = max(min(end, self.num_rows), 0) |
stop = Literal( | ||
expr_length.dtype, | ||
pa.scalar( | ||
expr_start.value.as_py() + expr_length.value.as_py(), | ||
type=pa.int32(), | ||
), | ||
).evaluate(df, context=context, mapping=mapping) | ||
start = expr_start.evaluate(df, context=context, mapping=mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you know you're making scalars, you can just do plc.interop.from_arrow
on the pyarrow scalar. for the one you computed.
(please merge trunk so that the cudf_polars test suite runs on this PR) |
I see some test failures here, but they seem unrelated to these changes - is there a blocking PR needed thats needed to pass things here?
|
Possibly needs #16149? |
Yeah I'm seeing all of the same failures in a completely unrelated PR #15904 |
This was why I didn't want these tests to block everything. But I guess that touches pylibcudf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Some very minor nits
(-2, 2), | ||
(-100, 3), | ||
(0, 0), | ||
(0, 1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a case where the computed start is negative and the computed stop is positive to ensure that does the right thing? e.g. suppose you have (-3, 4)
, I think that will produce a start stop pair of [-3, 1)
which will be empty for long strings, whereas in polars it would slice from strlen - 3
to the end of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some logic for this.
|
||
|
||
@pytest.mark.parametrize( | ||
"offset,length", [(1, 3), (0, 3), (0, 0), (-3, 1), (-100, 5), (1, 1), (100, 100)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these parametrizations different to those with columns above? As there, can we particularly add a case where the computed start is negative but the computed stop is non-negative (e.g. (-3, 4)
or (-2, 2)
? I think those will not do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One fencepost error, and then I think we're good to go!
Co-authored-by: Lawrence Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brandon!
/merge |
This PR plumbs the libcudf/pylibcudf
slice_strings
function through to cudf-polars. Depends on #15988