Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Serialization of libcudf classes and exposing implementation details #17630

Open
pentschev opened this issue Dec 19, 2024 · 4 comments
Open
Labels
feature request New feature or request

Comments

@pentschev
Copy link
Member

Is your feature request related to a problem? Please describe.
For multi-gpu polars, we will require serializing certain data in Python to be passed between Dask workers, for example aggregations. In #17469 I've proposed a way to do that, however, that proposal requires certain implementation details from aggreation.hpp,
more specifically classes derived from aggregation, such as std_var_aggregation. @vyasr has pointed out to the fact that those details are not exposed to pylibcudf and would be best if it continues like that.

Describe the solution you'd like
The solution proposed in #17469 seems to be the lowest hanging fruit, but as described above may not be considered optimal for several reasons.

Describe alternatives you've considered
Exposing attributes of the classes publicly may be an alternative, but that would incur in a different set of potential issues.

I'm not familiar with most of the design and options available in libcudf, so it's likely core developers will see other potentially better alternatives.

@rjzamora
Copy link
Member

rjzamora commented Jan 9, 2025

@pentschev - Is the challenge specific to to Aggregation expressions in cudf-polars (due to problems serializing pylibcudf.aggregation objects)?

Perhaps we just need to serialize the "name" of these requests when we serialize an Agg expression and just re-construct the request and op attributes during deserialization. @wence- - Does this sound reasonable?

@pentschev
Copy link
Member Author

  • Is the challenge specific to to Aggregation expressions in cudf-polars (due to problems serializing pylibcudf.aggregation objects)?

Yes.

Perhaps we just need to serialize the "name" of these requests when we serialize an Agg expression and just re-construct the request and op attributes during deserialization. @wence- - Does this sound reasonable?

No, that will not suffice, besides their names they have different underlying attributes that need to be serialized as well, see here the attributes I had to serialize for each different type.

@rjzamora
Copy link
Member

rjzamora commented Jan 9, 2025

I think we may be talking about different things. I'm saying that a new cudf-polars Agg object is constructed with a name argument (there are no pylibcudf.aggregation-based arguments). Therefore, I don't think we really need to serialize pylibcudf.aggregation object in order to serialize Agg objects. No? Maybe we still need to serialize aggregation objects in other places?

@pentschev
Copy link
Member Author

Thanks @rjzamora it turns you are (most likely) right. I was able to do that now and the aggregation tests pass:

$ python -m pytest --cache-clear tests --executor dask-experimental --dask-cluster -k test_agg
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.12.8, pytest-7.4.4, pluggy-1.5.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
configfile: pyproject.toml
plugins: anyio-4.8.0, hypothesis-6.123.13, benchmark-5.1.0, cases-3.8.6, cov-6.0.0, xdist-3.6.1
collected 3911 items / 3365 deselected / 546 selected

tests/expressions/test_agg.py ...................................................................................................................................... [ 24%]
.................................................................................................................................................................... [ 54%]
.................................................................................................................................................................... [ 84%]
.............................................................x.......ss..........xx.                                                                                 [100%]

======================================================= 541 passed, 2 skipped, 3365 deselected, 3 xfailed in 24.52s ========================================================

However, why it did work now is unclear to me, I was already attempting to follow the same path in this commit from November, either I overlooked some detail or something else changed in cudf-polars that ultimately helped with this after merging latest changes and doing a few more changes.

With that said, I think it's possible we indeed will not need to serialize aggregations, at least not the ones that are currently supported. I also have to check all other tests to see if there isn't any aggregation operation in tests other than test_agg.py that may still be failing.

@nirandaperera FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants