Skip to content

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Jan 15, 2026

Description

  • Changes GroupByandUnique` behavior to preserve partition count by default
    • Hints or statistics are needed to reduce the partition count
  • Adds a new "selectivity-hints" configuration option
    • Users can specify a mapping between specific operations (by _ir_repr label) and their "selectivity".
    • Used to reduce the output partition count for labeled GoupBy and Union
    • Used to introduce a Repartition after other labeled nodes

Notes:

  • New pdsh hints were informed by "profiling" the row count processed by each IR node in an sf300 run

    Performance:
    There are probably still opportunites for improvement, but this branch is giving good 8xH100 sf3000 numbers for TPC-H (especially q9!):

  Iteration Summary
=======================================
...
=======================================
query: 9
path: /raid/rapidsmpf/data/tpch/scale-3000
scale_factor: 3000
executor: streaming
stream_policy: None
runtime: rapidsmpf
cluster: distributed
blocksize: 2000000000
shuffle_method: None
broadcast_join_limit: 3
stats_planning: False
native_parquet: False
n_workers: 8
threads: 1
rmm_async: True
rapidsmpf_oom_protection: False
spill_device: 0.5
rapidsmpf_spill: False
iterations: 2
---------------------------------------
min time : 6.4565
max time : 6.5054
mean time: 6.4809
=======================================
...
=======================================
Total mean time across all queries: 189.6582 seconds

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora self-assigned this Jan 15, 2026
@rjzamora rjzamora added the 2 - In Progress Currently a work in progress label Jan 15, 2026
@rjzamora rjzamora added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Jan 15, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 15, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Jan 15, 2026
@rjzamora
Copy link
Member Author

/ok to test

@rjzamora rjzamora changed the title [WIP][DEMO] Use "selectivity-hints" for query planning [WIP] Use "selectivity-hints" for query planning Jan 15, 2026
@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant