Implement distributed sorted for `cudf_polars` #18912

seberg · 2025-05-21T10:52:37Z

This implements distributed sorting for cudf-polars, it should work but is missing new sorting tests and I am also not quite sure that local tests exercised the rapidsmpf paths properly.

Right now, it is structured around introducing a ShuffleSorted IR but maybe it would be nicer to merge the steps further.

It does some smaller refactors to the shuffling to re-use some code there (but not actually move too much sorting related logic into the file).

Main missing things:

The biggest thing missing for sure is new tests which may flush out a bug or two.
As a follow up maybe: I use concat in both paths (rapidsmpf and not) for merging the exchanged chunks. If the sort isn't stable, then this should use plc...merge, this will require changing rapidsmpf, a bit more.
I have not modified zlice handling (top/bottom limits). For small values there is probably little point but for large result slices this needs to be threaded in.

copy-pr-bot · 2025-05-21T10:52:41Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

vyasr · 2025-05-21T17:07:16Z

Moved to 25.08

vyasr · 2025-05-22T22:13:13Z

@seberg @rjzamora will this resolve #18527?

rjzamora · 2025-05-22T22:23:46Z

python/cudf_polars/cudf_polars/experimental/sort.py

+        ir.by,
+        ir.order,
+        ir.null_order,
+        ir.config_options,


Just a note that you can pass in rec.state["config_options"] here. This way you don't need to add the config_options attribute to Sort. We only need to add that attribute if the IR object needs to access the configs within generate_ir_tasks.

copy-pr-bot · 2025-05-30T17:35:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

TomAugspurger · 2025-06-02T20:18:34Z

python/cudf_polars/cudf_polars/experimental/sort.py

+        null_order + [plc.types.NullOrder.AFTER] * 2,
+    )
+    global_split_points = plc.Column.from_arrow(
+        pa.array(


I think there's an ongoing effort to remove the hard requirement on pyarrow. I'm not sure what the alternative is, but perhaps plc.Column.from_iterable_of_py?

TomAugspurger · 2025-06-02T20:22:04Z

python/cudf_polars/cudf_polars/experimental/sort.py

+    # the partition id and the local row number of the final split values
+    *split_values, split_part_id, split_local_row = split_values.columns()
+    split_values = plc.Table(split_values)
+    # Now we find the first and last row in the local table corresponding to the split value


Do you think this is common and expensive enough to merit a function that does the lower and upper in a single pass?

rjzamora

Took a first pass. Thanks for working on this @seberg !

rjzamora · 2025-06-03T19:03:56Z

python/cudf_polars/cudf_polars/experimental/sort.py

+    shuffle_method = ir.config_options.executor.shuffle_method
+
+    by = [ne.value.name for ne in ir.by if isinstance(ne.value, Col)]
+    if len(by) != len(ir.by):


Suggested change

if len(by) != len(ir.by):

if len(by) != len(ir.by): # pragma: no cover

rjzamora · 2025-06-03T19:06:54Z

python/cudf_polars/cudf_polars/experimental/sort.py

+        raise NotImplementedError("Sorting columns must be column names.")
+
+    sort_boundaries_name, graph = _sort_boundaries_graph(
+        get_key_name(ir.children[0]),


Nit: Minor preference to do child, = ir.children earlier in this function, and then refer to child instead of ir.children[0] throughout.

rjzamora · 2025-06-03T19:07:57Z

python/cudf_polars/cudf_polars/experimental/sort.py

+        except (ImportError, ValueError) as err:
+            # ImportError: rapidsmpf is not installed
+            # ValueError: rapidsmpf couldn't find a distributed client
+            if shuffle_method == "rapidsmpf":


Probably need a # pragma: no cover here?

rjzamora · 2025-06-03T19:20:47Z

python/cudf_polars/cudf_polars/experimental/sort.py

+    row id columns).  The columns are already in the order of `by`.
+    """
+    df = df.select(by)
+    candidates = [i * df.num_rows // num_partitions for i in range(num_partitions)]


candidates = range(0, df.num_rows, df.num_rows // num_partitions)

Maybe?

I suppose I did that because it get's fewer rounding errors towards the end. But the worst case for that is probably the last partition being too big by num_partitions/2.
(Kept it for now, but happy to change)

rjzamora · 2025-06-03T19:32:32Z

python/cudf_polars/cudf_polars/experimental/sort.py

+    Shuffling is performed by extracting sort boundary candidates from all partitions,
+    sharing them all-to-all and then exchanging data accordingly.
+    The sorting information is required to be passed in identically
+    to the initial sort.


Can we somehow establish as clearly as possible that the child of this IR node must have locally sorted partitions? I think you have stated this both implicitly and explicitly in other places. I suspect it would be good information to have here as well.

rjzamora · 2025-06-03T19:48:06Z

python/cudf_polars/cudf_polars/experimental/sort.py

+    The reason for much of the complexity is to get the result sizes as
+    precise as possible even when e.g. all values are equal.
+    In other words, this goes through extra effort to split the data at the
+    precise boundaries (which includes part_id and local_row_number).


Okay - I was initially confused by the fact that this function wasn't already applied to the concatenated result in _sort_boundaries_graph (before being broadcasted to the shuffler insertion).

If I understand correctly, this is the motivation. We may be doing redundant work for every input partition, but this allows us to handle pathological data distributions. How much of an overhead do you think this is for uniformly-distributed data (where local adjustments are unnecessary)?

_sort_boundaries_graph (I'll rename to _split_boundary_candidates_graph, I think) are the values at which we wish to split.

To find the indices at which to split, we also need the local sorted data, so that is a clear second step (and the one that adds the complexity, so it is not avoidable).

Let me pull out the sort+extract part and move it into _sort_boundaries_graph. I suppose, right now dask is likely pulling it all into one worker anyway and then distributing, so might as well sort+extract there.

Plus, I thought a bit more about a case where we have one partition per worker/GPU where repeating that work locally seems fine, but if we have many partitions per worker/GPU it may be nice not do.

How much of an overhead do you think this is for uniformly-distributed data (where local adjustments are unnecessary)?

Ok, the above comment was about something else. So doing this double work to find the right split points, seemed to add maybe 10% overhead in worst case (data already sorted, so no exchange happens and sort is fast) -- not an exact science right now.

If that seems not nice, maybe the solution is to rewrite it by extracting the points and doing an equality check instead, since that avoids the second full binary search.

This implements distributed sorting for cudf-polars, it should work but is still low on tests and I am not sure that local tests exercised the rapidsmpf paths properly. Right now, it is structured around introducing a ``ShuffleSorted`` IR but I am not sure that is better than e.g. merging everything. It does some smaller refactors to the shuffling to re-use some code there (but not actually move too much sorting related logic into the file). Signed-off-by: Sebastian Berg <[email protected]>

…plits Signed-off-by: Sebastian Berg <[email protected]>

Signed-off-by: Sebastian Berg <[email protected]>

seberg added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels May 21, 2025

github-project-automation bot added this to cuDF Python May 21, 2025

github-actions bot added the Python Affects Python cuDF API. label May 21, 2025

github-actions bot assigned seberg May 21, 2025

GPUtester moved this to In Progress in cuDF Python May 21, 2025

vyasr changed the base branch from branch-25.06 to branch-25.08 May 21, 2025 17:07

rjzamora reviewed May 22, 2025

View reviewed changes

rjzamora linked an issue May 22, 2025 that may be closed by this pull request

[FEA] Streaming Sort in cudf-polars #18527

Open

3 tasks

seberg force-pushed the sorting branch from f2c20d0 to 60c7c14 Compare May 30, 2025 17:35

seberg force-pushed the sorting branch from 60c7c14 to 1614338 Compare May 30, 2025 17:38

seberg marked this pull request as ready for review May 30, 2025 17:38

seberg requested a review from a team as a code owner May 30, 2025 17:38

seberg requested review from TomAugspurger and rjzamora May 30, 2025 17:38

seberg changed the title ~~WIP: Implement distributed sorted for cudf_polars~~ Implement distributed sorted for cudf_polars Jun 2, 2025

TomAugspurger reviewed Jun 2, 2025

View reviewed changes

rjzamora reviewed Jun 3, 2025

View reviewed changes

seberg added 4 commits June 16, 2025 05:45

Undo config, fix-up from_arrow, add some tests and manually check s…

7c35553

…plits Signed-off-by: Sebastian Berg <[email protected]>

fixup

efb8c22

Signed-off-by: Sebastian Berg <[email protected]>

Fixes based on review

72f84fa

Signed-off-by: Sebastian Berg <[email protected]>

seberg force-pushed the sorting branch from 0a1bb91 to 72f84fa Compare June 16, 2025 13:02

apply fixes from pre-commit ruff/mypy

6b2f759

	if len(by) != len(ir.by):
	if len(by) != len(ir.by): # pragma: no cover

Implement distributed sorted for cudf_polars #18912

Are you sure you want to change the base?

Implement distributed sorted for cudf_polars #18912

Uh oh!

Conversation

seberg commented May 21, 2025

Uh oh!

copy-pr-bot bot commented May 21, 2025

Uh oh!

vyasr commented May 21, 2025

Uh oh!

vyasr commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot bot commented May 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Implement distributed sorted for `cudf_polars` #18912

Implement distributed sorted for `cudf_polars` #18912

seberg Jun 16, 2025 •

edited

Loading