Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent pylibcudf serialization in cudf-polars #17449

Open
wants to merge 14 commits into
base: branch-25.02
Choose a base branch
from

Conversation

pentschev
Copy link
Member

Description

Reorganize cudf-polars implementation to prevent the need to serialize pylibcudf objects.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pentschev pentschev added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Nov 26, 2024
@pentschev pentschev requested a review from a team as a code owner November 26, 2024 15:46
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 26, 2024
Comment on lines 34 to 71
class Operator(IntEnum):
"""Internal and picklable representation of pylibcudf's `BinaryOperator`."""

ADD = auto()
ATAN2 = auto()
BITWISE_AND = auto()
BITWISE_OR = auto()
BITWISE_XOR = auto()
DIV = auto()
EQUAL = auto()
FLOOR_DIV = auto()
GENERIC_BINARY = auto()
GREATER = auto()
GREATER_EQUAL = auto()
INT_POW = auto()
INVALID_BINARY = auto()
LESS = auto()
LESS_EQUAL = auto()
LOGICAL_AND = auto()
LOGICAL_OR = auto()
LOG_BASE = auto()
MOD = auto()
MUL = auto()
NOT_EQUAL = auto()
NULL_EQUALS = auto()
NULL_LOGICAL_AND = auto()
NULL_LOGICAL_OR = auto()
NULL_MAX = auto()
NULL_MIN = auto()
NULL_NOT_EQUALS = auto()
PMOD = auto()
POW = auto()
PYMOD = auto()
SHIFT_LEFT = auto()
SHIFT_RIGHT = auto()
SHIFT_RIGHT_UNSIGNED = auto()
SUB = auto()
TRUE_DIV = auto()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What?

import pylibcudf as plc
import pickle
assert pickle.loads(pickle.dumps(plc.binaryop.BinaryOperator.ADD)) == plc.binaryop.BinaryOperator.ADD

Works fine.

Comment on lines -77 to 80
self._regex_program = plc.strings.regex_program.RegexProgram.create(
plc.strings.regex_program.RegexProgram.create(
pattern,
flags=plc.strings.regex_flags.RegexFlags.DEFAULT,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of hate this, because we have to do this twice now, once for control-flow here, and then once to actually use the program.

@pentschev
Copy link
Member Author

pentschev commented Nov 29, 2024

The CI errors look related to this PR but I don't think they are in fact. I can locally reproduce them only with upstream Dask/Distributed, but not with dask=2024.11.2 distributed=2024.11.2 dask-expr=1.1.19, it is also reproducible with branch-25.02 without changes from this PR.

EDIT: Commented in wrong PR, this was meant for #17469 .

@@ -947,6 +944,7 @@ def do_evaluate(
# TODO: uniquify
requests = []
replacements: list[expr.Expr] = []
agg_infos = [req.collect_agg(depth=0) for req in agg_requests]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a good way to avoid AggInfo serialization. The disadvantage is that we don't raise an error from collect_agg when the GroupBy object is created at translation time.

Comment on lines 890 to -892
options,
self.agg_infos,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found that options can be an unserializable Polars objects (though the only thing we actually use in do_evaluate is options.slice).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, do you see any tests we could rely immediately upon for this case?

f"Unreachable, supported agg {name=} has no implementation"
) # pragma: no cover
self.op = op
self.request = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to somehow validate that the name is supported within __init__ so that we catch a problem at translation time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -124,6 +77,46 @@ def __init__(
"linear": plc.types.Interpolation.LINEAR,
}

def _fill_request(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just define request as a property (or cached property), and move this same logic there?

@property
def request(self):
   ...
``

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done in faf42d0 .

@pentschev
Copy link
Member Author

After discussing with @wence- , he suggested the appropriate way to deal with GroupbyOptions would be to create a wrapper class that allows us to serialization the options. With the current test set, we're now down to only 43 failing tests with Dask cluster:

43 failed, 3780 passed, 2 skipped, 86 xfailed in 138.63s (0:02:18)

Comment on lines +847 to +848
self.dynamic = polars_groupby_options.dynamic
self.rolling = polars_groupby_options.rolling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually these will also need translated, but for now we can dodge it because they are always None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I couldn't find any examples in our tests to aid that, so for now I left it as above to make things simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants