[Story] Integrate cuDF-Polars with the ``rapidsmpf`` streaming engine

## Goal & Background

This work depends on the features being tracked in https://github.com/rapidsai/rapidsmpf/issues/461

Cudf-polars currently uses a static task-graph representation to enable "streaming" and multi-GPU execution. In this issue, we shall refer to the existing engine as the **task engine**. The goal of this work is to enable a distinct/alternative **rapidsmpf streaming engine** in cudf-polars.

[Companion Doc](https://docs.google.com/document/d/1ACBY6YX5Pm7DGCVsORVXWCPzyqmdVBIE1Kqtu36PYfI/edit?usp=sharing)

### Criteria
In order for this initial POC to be considered a "success", it must enable highly-competitive performance for both single- and multi-GPU execution of the PDS-H benchmark suite.

## Action Items

### Translation

In order to leverage the `rapidsmpf` streaming engine in cudf-polars, we must be able to translate the Polars logical plan (i.e. the `IR` graph) into a valid streaming network. This fundamental work can be divided into several distinct steps:

1. **Basic configuration**: Update `ConfigOptions` so that the user can opt into the "rapidsmf" streaming engine. We don't actually need the new options the be "supported" in this PR. We just want to make sure we iron out a decent UX for configuring the new engine.
2. **Basic single-GPU  translation**: Implement single-GPU translation with static-partitioning assumptions. In other words, we re-use most of the existing IR-lowering logic in cudf-polars, but generate a streaming-network instead of a task graph for final execution.
4. **Dynamic partitioning**: We introduce new (smart) nodes to align input data and partition output data dynamically. This step **may** require additional features in `rapidsmpf` to pass metadata through the streaming network and optionally shuffle/all-reduce data at runtime.
5. **Basic Multi-GPU  translation**: Implement multi-GPU translation by injecting the necessary global communication primitives (or ensure the "smart" Join/GroupBy/HConcat nodes handle multiple ranks).

### IO/stream Parallelism

In order to achieve competitive performance, it will be necessary to leverage multiple IO threads and multiple CUDA-streams. Ideal features include:

- A multi-threaded C++ read-parquet node in `rapidsmpf`
- Multi-stream support in `IR.do_evaluate` (mostly orthogonal to the other cudf-polars changes in this story).

### Stability

We expect dynamic partitioning to help with memory stability, but many shuffle-heavy ETL pipelines will certainly require larger-than-GPU-memory computing. The only way to support these workflows (which are common in practice) is to implement robust end-to-end spilling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Story] Integrate cuDF-Polars with the `rapidsmpf` streaming engine #20238

Goal & Background

Criteria

Action Items

Translation

IO/stream Parallelism

Stability

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Story] Integrate cuDF-Polars with the rapidsmpf streaming engine #20238

Description

Goal & Background

Criteria

Action Items

Translation

IO/stream Parallelism

Stability

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Story] Integrate cuDF-Polars with the `rapidsmpf` streaming engine #20238