Skip to content

[Story] Integrate cuDF-Polars with the rapidsmpf streaming engine #20238

@rjzamora

Description

@rjzamora

Goal & Background

This work depends on the features being tracked in rapidsai/rapidsmpf#461

Cudf-polars currently uses a static task-graph representation to enable "streaming" and multi-GPU execution. In this issue, we shall refer to the existing engine as the task engine. The goal of this work is to enable a distinct/alternative rapidsmpf streaming engine in cudf-polars.

Companion Doc

Criteria

In order for this initial POC to be considered a "success", it must enable highly-competitive performance for both single- and multi-GPU execution of the PDS-H benchmark suite.

Action Items

Translation

In order to leverage the rapidsmpf streaming engine in cudf-polars, we must be able to translate the Polars logical plan (i.e. the IR graph) into a valid streaming network. This fundamental work can be divided into several distinct steps:

  1. Basic configuration: Update ConfigOptions so that the user can opt into the "rapidsmf" streaming engine. We don't actually need the new options the be "supported" in this PR. We just want to make sure we iron out a decent UX for configuring the new engine.
  2. Basic single-GPU translation: Implement single-GPU translation with static-partitioning assumptions. In other words, we re-use most of the existing IR-lowering logic in cudf-polars, but generate a streaming-network instead of a task graph for final execution.
  3. Dynamic partitioning: We introduce new (smart) nodes to align input data and partition output data dynamically. This step may require additional features in rapidsmpf to pass metadata through the streaming network and optionally shuffle/all-reduce data at runtime.
  4. Basic Multi-GPU translation: Implement multi-GPU translation by injecting the necessary global communication primitives (or ensure the "smart" Join/GroupBy/HConcat nodes handle multiple ranks).

IO/stream Parallelism

In order to achieve competitive performance, it will be necessary to leverage multiple IO threads and multiple CUDA-streams. Ideal features include:

  • A multi-threaded C++ read-parquet node in rapidsmpf
  • Multi-stream support in IR.do_evaluate (mostly orthogonal to the other cudf-polars changes in this story).

Stability

We expect dynamic partitioning to help with memory stability, but many shuffle-heavy ETL pipelines will certainly require larger-than-GPU-memory computing. The only way to support these workflows (which are common in practice) is to implement robust end-to-end spilling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cudf-polarsIssues specific to cudf-polarsfeature requestNew feature or request

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions