[FEA] Serialization of libcudf classes and exposing implementation details #17630

pentschev · 2024-12-19T15:39:28Z

Is your feature request related to a problem? Please describe.
For multi-gpu polars, we will require serializing certain data in Python to be passed between Dask workers, for example aggregations. In #17469 I've proposed a way to do that, however, that proposal requires certain implementation details from aggreation.hpp,
more specifically classes derived from aggregation, such as std_var_aggregation. @vyasr has pointed out to the fact that those details are not exposed to pylibcudf and would be best if it continues like that.

Describe the solution you'd like
The solution proposed in #17469 seems to be the lowest hanging fruit, but as described above may not be considered optimal for several reasons.

Describe alternatives you've considered
Exposing attributes of the classes publicly may be an alternative, but that would incur in a different set of potential issues.

I'm not familiar with most of the design and options available in libcudf, so it's likely core developers will see other potentially better alternatives.

The text was updated successfully, but these errors were encountered:

rjzamora · 2025-01-09T20:45:50Z

@pentschev - Is the challenge specific to to Aggregation expressions in cudf-polars (due to problems serializing pylibcudf.aggregation objects)?

Perhaps we just need to serialize the "name" of these requests when we serialize an Agg expression and just re-construct the request and op attributes during deserialization. @wence- - Does this sound reasonable?

pentschev · 2025-01-09T21:24:28Z

Is the challenge specific to to Aggregation expressions in cudf-polars (due to problems serializing pylibcudf.aggregation objects)?

Yes.

Perhaps we just need to serialize the "name" of these requests when we serialize an Agg expression and just re-construct the request and op attributes during deserialization. @wence- - Does this sound reasonable?

No, that will not suffice, besides their names they have different underlying attributes that need to be serialized as well, see here the attributes I had to serialize for each different type.

rjzamora · 2025-01-09T21:33:28Z

I think we may be talking about different things. I'm saying that a new cudf-polars Agg object is constructed with a name argument (there are no pylibcudf.aggregation-based arguments). Therefore, I don't think we really need to serialize pylibcudf.aggregation object in order to serialize Agg objects. No? Maybe we still need to serialize aggregation objects in other places?

pentschev · 2025-01-10T21:28:18Z

Thanks @rjzamora it turns you are (most likely) right. I was able to do that now and the aggregation tests pass:

$ python -m pytest --cache-clear tests --executor dask-experimental --dask-cluster -k test_agg
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.12.8, pytest-7.4.4, pluggy-1.5.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
configfile: pyproject.toml
plugins: anyio-4.8.0, hypothesis-6.123.13, benchmark-5.1.0, cases-3.8.6, cov-6.0.0, xdist-3.6.1
collected 3911 items / 3365 deselected / 546 selected

tests/expressions/test_agg.py ...................................................................................................................................... [ 24%]
.................................................................................................................................................................... [ 54%]
.................................................................................................................................................................... [ 84%]
.............................................................x.......ss..........xx.                                                                                 [100%]

======================================================= 541 passed, 2 skipped, 3365 deselected, 3 xfailed in 24.52s ========================================================

However, why it did work now is unclear to me, I was already attempting to follow the same path in this commit from November, either I overlooked some detail or something else changed in cudf-polars that ultimately helped with this after merging latest changes and doing a few more changes.

With that said, I think it's possible we indeed will not need to serialize aggregations, at least not the ones that are currently supported. I also have to check all other tests to see if there isn't any aggregation operation in tests other than test_agg.py that may still be failing.

@nirandaperera FYI

pentschev added the feature request New feature or request label Dec 19, 2024

pentschev mentioned this issue Dec 19, 2024

Support Aggregation serialization in pylibcudf #17469

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Serialization of libcudf classes and exposing implementation details #17630

[FEA] Serialization of libcudf classes and exposing implementation details #17630

pentschev commented Dec 19, 2024

rjzamora commented Jan 9, 2025

pentschev commented Jan 9, 2025

rjzamora commented Jan 9, 2025 •

edited

Loading

pentschev commented Jan 10, 2025

[FEA] Serialization of libcudf classes and exposing implementation details #17630

[FEA] Serialization of libcudf classes and exposing implementation details #17630

Comments

pentschev commented Dec 19, 2024

rjzamora commented Jan 9, 2025

pentschev commented Jan 9, 2025

rjzamora commented Jan 9, 2025 • edited Loading

pentschev commented Jan 10, 2025

rjzamora commented Jan 9, 2025 •

edited

Loading