[Data] Added distinct function #53460

soffer-anyscale · 2025-05-31T20:32:33Z

Why are these changes needed?

Distinct is one of the most-needed functions to improve Ray Data user experience. There currently isn't a way to dedupe data out-of-the-box with Ray Data, which is one of the most common ETL functions.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Adds Dataset.distinct(keys=None) to remove duplicate rows (optionally by subset of columns), with unit tests and Bazel test target.

Data API:
- Dataset.distinct(keys=None): New public API to drop duplicate rows; supports dedup across all columns or a specified subset; validates inputs; implemented via groupby(...).map_groups() with Arrow batches; includes docstring examples.
Tests/Build:
- Add python/ray/data/tests/test_distinct.py covering global, subset, edge cases, and error handling.
- Register py_test(name="test_distinct") in python/ray/data/BUILD.bazel.

^{Written by Cursor Bugbot for commit 030df62. This will update automatically on new commits. Configure here.}

Signed-off-by: soffer-anyscale <[email protected]>

python/ray/data/dataset.py

Signed-off-by: soffer-anyscale <[email protected]>

python/ray/data/dataset.py

Signed-off-by: Hao Chen <[email protected]>

Signed-off-by: soffer-anyscale <[email protected]>

python/ray/data/dataset.py

alexeykudinkin

Reqeust

python/ray/data/dataset.py

Signed-off-by: soffer-anyscale <[email protected]>

…to data_distinct Signed-off-by: soffer-anyscale <[email protected]>

Signed-off-by: soffer-anyscale <[email protected]>

- Fix r[keys] to iterate over keys: tuple(r[k] for k in keys) - Fix empty keys case to use all columns instead of constant (0,) - Prevents incorrect list indexing and improper deduplication Signed-off-by: soffer-anyscale <[email protected]>

- Remove local 'import pyarrow as pa' at line 6085 (to_arrow_refs method) - Remove local 'import pyarrow as pa' at line 6738 (Schema.types property) - These imports are redundant as pa is already imported at module level (line 26) Signed-off-by: soffer-anyscale <[email protected]>

- Remove unreliable hasattr check in _key_fn fallback - TableRow is always a Mapping with .values() method - Previous fallback tuple(r) would iterate keys not values - Ensures consistent key extraction for all-column deduplication Signed-off-by: soffer-anyscale <[email protected]>

Applied auto-fixes from pre-commit hooks: 1. Fixed end-of-file issues (added missing newlines): - python/ray/data/_internal/planner/exchange/distinct_task_spec.py - python/ray/data/_internal/planner/distinct.py 2. Applied black formatting: - python/ray/data/_internal/planner/exchange/distinct_task_spec.py All pre-commit checks now passing: - ruff: no errors - black: formatted - pydoclint: all docstrings valid - buildifier: passed - trailing whitespace: clean - end of files: clean - Ray import order: valid Signed-off-by: soffer-anyscale <[email protected]>

python/ray/data/tests/test_distinct.py

@richardliaw

- Add ray_start_regular_shared fixture to all 9 test functions for proper cluster cleanup - Fix test_distinct_across_blocks to use sample_data fixture instead of inline table - Add ray.data.tests.conftest import for consistency with other Ray Data tests Addresses PR feedback from @richardliaw Signed-off-by: soffer-anyscale <[email protected]>

python/ray/data/tests/test_distinct.py

python/ray/data/_internal/planner/exchange/distinct_task_spec.py

python/ray/data/_internal/planner/distinct.py

python/ray/data/_internal/table_block.py

@richardliaw

- Remove sample_data parameter from test_distinct_all_unique (unused) - Add test_distinct_with_multiple_blocks_range: tests with 10 blocks and union - Add test_distinct_with_varying_block_sizes: tests blocks of different sizes - Add test_distinct_subset_with_multiple_blocks: tests keys parameter with multiple blocks - Add test_distinct_large_dataset_multiple_blocks: tests larger dataset (150 items, 10 blocks) Addresses PR feedback from @richardliaw for more multi-block test coverage Signed-off-by: soffer-anyscale <[email protected]>

@richardliaw

This is a major architectural change from sort-based to hash-based distinct: - Add HashDistinctOperator: extends HashShufflingOperatorBase - Add DistinctAggregation: implements StatefulShuffleAggregation - Uses hash-based partitioning (same key hash -> same partition) - Deduplicates within partitions using dict-based lookup - Update plan_all_to_all_op to use hash shuffle when shuffle_strategy=HASH_SHUFFLE Benefits of hash-based approach: - Simpler implementation (no merge-sort complexity) - Better performance for deduplication workloads - Aligns with modern Ray Data infrastructure - Eliminates bugs in sort-based empty block handling Addresses PR feedback from @richardliaw to use AggregateFnV2 and hash shuffles instead of older sort-based TaskSpec infrastructure Signed-off-by: soffer-anyscale <[email protected]>

- Get all column names from schema when distinct keys parameter is None - Pass actual column names to hash partitioning (cannot be empty) - Simplify DistinctAggregation logic since key_columns is always provided - Add assertion to ensure key_columns is non-empty in HashDistinctOperator This ensures hash partitioning works correctly for both: - .distinct() - deduplicates on all columns - .distinct(keys=[...]) - deduplicates on specified columns Signed-off-by: soffer-anyscale <[email protected]>

- Remove trailing whitespace from docstrings - Fix import ordering (stdlib before third-party) - Remove unused imports (GiB, MiB) Signed-off-by: soffer-anyscale <[email protected]>

Code Quality Improvements: - Add comprehensive module-level docstring explaining hash-based algorithm - Enhance class and method docstrings with detailed explanations - Add inline comments explaining memory estimation logic - Document O(1) dict lookup performance characteristics - Improve code readability with better variable names and structure Bug Fixes (addressing cursor[bot] feedback): - Fix empty blocks handling in DistinctTaskSpec.reduce (IndexError) - Fix empty blocks handling in _merge_sorted_blocks_and_keep_first (IndexError) - Add empty blocks checks before accessing blocks[0] or normalized_blocks[0] These fixes prevent crashes when reduce tasks receive no input blocks during distributed shuffle operations. Signed-off-by: soffer-anyscale <[email protected]>

Remove obsolete sort-based distinct implementation: - Delete distinct.py (sort-based planner function) - Delete distinct_task_spec.py (sort-based TaskSpec) - Remove _merge_sorted_blocks_and_keep_first method (only used by sort-based) - Remove fallback logic from plan_all_to_all_op.py Hash-based distinct is now the only implementation. This simplifies the codebase and removes unnecessary complexity. Hash shuffle is the default strategy and is more efficient for deduplication workloads. Benefits: - Cleaner codebase with single implementation path - No confusing fallback logic - Reduced maintenance burden - Aligns with Ray Data's modern infrastructure Signed-off-by: soffer-anyscale <[email protected]>

python/ray/data/_internal/planner/plan_all_to_all_op.py

Remove trailing whitespace from blank lines in docstrings to pass ruff checks Signed-off-by: soffer-anyscale <[email protected]>

Add proper error handling for schema inference failures in _plan_hash_shuffle_distinct. Now raises clear ValueError if schema cannot be inferred or is empty, preventing None from being passed to HashDistinctOperator. Signed-off-by: soffer-anyscale <[email protected]>

python/ray/data/_internal/planner/plan_all_to_all_op.py

python/ray/data/dataset.py

Remove expensive count() call that forced full dataset materialization for empty check. Distinguish between schema inference failure (None) and empty column list when validating dataset schema, providing clearer error messages. Signed-off-by: soffer-anyscale <[email protected]>

richardliaw · 2025-10-13T22:08:17Z

python/ray/data/_internal/execution/operators/distinct_aggregation.py

+    def finalize(self, partition_id: int) -> Block:
+        """Finalize deduplication for a partition and return the result block.
+
+        Converts the deduplicated rows stored in the dictionary back into a
+        PyArrow table block.
+
+        Args:
+            partition_id: Partition ID to finalize.
+
+        Returns:
+            Deduplicated block for this partition as a PyArrow table.
+        """
+        seen_keys = self._partition_data[partition_id]
+
+        # Build output block from deduplicated rows
+        builder = ArrowBlockBuilder()
+        for row in seen_keys.values():
+            builder.add(row)
+
+        return builder.build()
+
+    def clear(self, partition_id: int):
+        """Clear accumulated state for a partition.
+
+        Args:
+            partition_id: Partition ID to clear.
+        """
+        self._partition_data.pop(partition_id, None)


I don't think you need this; just use https://docs.ray.io/en/latest/data/api/doc/ray.data.aggregate.AggregateFnV2.html

Reformat assert statement and remove trailing newline to pass ruff checks. Signed-off-by: soffer-anyscale <[email protected]>

cursor · 2025-10-13T22:23:42Z

python/ray/data/_internal/logical/operators/all_to_all_operator.py

+
+    def infer_schema(self) -> Optional["Schema"]:
+        """The output schema is the same as the input schema."""
+        return self._input_dependencies[0].infer_schema()


Bug: Missing Metadata Inference and Incorrect Progress Bar

The Distinct logical operator is missing its infer_metadata method, unlike other AbstractAllToAll subclasses, which can cause issues during query planning. Also, its progress bar incorrectly shows "Sort Sample" for a hash-based shuffle, which might be confusing to users.

cursor · 2025-10-13T22:23:42Z

python/ray/data/_internal/execution/operators/distinct_aggregation.py

+
+            # Only keep first occurrence of each key
+            if key not in seen_keys:
+                seen_keys[key] = row


Bug: DistinctAggregation Key Columns Initialization Bug

The DistinctAggregation class's __init__ method accepts key_columns=None and its docstring suggests this means using all columns for uniqueness. However, the accept method directly iterates self._key_columns, which would raise a TypeError if key_columns were None. This creates a mismatch between the DistinctAggregation's public interface and its internal implementation, even though the HashDistinctOperator currently ensures key_columns is always provided.

initial commit, added distinct function

bfca63b

Signed-off-by: soffer-anyscale <[email protected]>

soffer-anyscale requested a review from a team as a code owner May 31, 2025 20:32

soffer-anyscale and others added 7 commits May 31, 2025 13:51

updated lint issues with the unit test

49c8796

Signed-off-by: soffer-anyscale <[email protected]>

fixed unit test

66e85d4

Signed-off-by: soffer-anyscale <[email protected]>

updated drop function

47f352d

Signed-off-by: soffer-anyscale <[email protected]>

added fix for blank datasets

aa2d2fa

Signed-off-by: soffer-anyscale <[email protected]>

Merge branch 'master' into data_distinct

f0e65e6

updated distinct function for subset and keep

281631b

Signed-off-by: soffer-anyscale <[email protected]>

updated distinct docstring

a338f01

Signed-off-by: soffer-anyscale <[email protected]>

soffer-anyscale requested review from alexeykudinkin and srinathk10 June 4, 2025 16:38

raulchen approved these changes Jun 4, 2025

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

updated subset -> keys and improved typing hits

7f082ad

Signed-off-by: soffer-anyscale <[email protected]>

raulchen reviewed Jun 4, 2025

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

Update python/ray/data/dataset.py

f3a4342

Signed-off-by: Hao Chen <[email protected]>

raulchen enabled auto-merge (squash) June 4, 2025 21:41

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 4, 2025

soffer-anyscale added 2 commits June 4, 2025 15:05

updated docstring

df53072

Signed-off-by: soffer-anyscale <[email protected]>

updated docstring

99231df

Signed-off-by: soffer-anyscale <[email protected]>

github-actions bot disabled auto-merge June 4, 2025 22:07

Merge branch 'master' into data_distinct

abfe38f

srinathk10 reviewed Jun 5, 2025

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Jun 5, 2025

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

soffer-anyscale added 7 commits June 10, 2025 10:16

removed pandas req, used pyarrow instead

b9f195a

Signed-off-by: soffer-anyscale <[email protected]>

Merge branch 'data_distinct' of https://github.com/ray-project/ray in…

ae34210

…to data_distinct Signed-off-by: soffer-anyscale <[email protected]>

updated typing

35c2069

Signed-off-by: soffer-anyscale <[email protected]>

removed subset arg

b31b8b3

Signed-off-by: soffer-anyscale <[email protected]>

updated docstring

211a4f7

Signed-off-by: soffer-anyscale <[email protected]>

updated distinct function

3146f51

Signed-off-by: soffer-anyscale <[email protected]>

updated imports

c872592

Signed-off-by: soffer-anyscale <[email protected]>

This comment was marked as outdated.

Sign in to view

soffer-anyscale added 2 commits October 10, 2025 22:36

This comment was marked as outdated.

Sign in to view

soffer-anyscale added 2 commits October 10, 2025 23:08

richardliaw reviewed Oct 13, 2025

View reviewed changes

python/ray/data/tests/test_distinct.py Outdated Show resolved Hide resolved

richardliaw reviewed Oct 13, 2025

View reviewed changes

python/ray/data/tests/test_distinct.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

richardliaw reviewed Oct 13, 2025

View reviewed changes

python/ray/data/tests/test_distinct.py Outdated Show resolved Hide resolved

richardliaw reviewed Oct 13, 2025

View reviewed changes

python/ray/data/tests/test_distinct.py Show resolved Hide resolved

richardliaw reviewed Oct 13, 2025

View reviewed changes

python/ray/data/_internal/planner/exchange/distinct_task_spec.py Outdated Show resolved Hide resolved

richardliaw reviewed Oct 13, 2025

View reviewed changes

python/ray/data/_internal/planner/distinct.py Outdated Show resolved Hide resolved

richardliaw reviewed Oct 13, 2025

View reviewed changes

python/ray/data/_internal/table_block.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

soffer-anyscale added 5 commits October 13, 2025 15:31

Fix ruff linting issues

1571d13

- Remove trailing whitespace from docstrings - Fix import ordering (stdlib before third-party) - Remove unused imports (GiB, MiB) Signed-off-by: soffer-anyscale <[email protected]>

cursor bot reviewed Oct 13, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_all_to_all_op.py Show resolved Hide resolved

soffer-anyscale added 2 commits October 13, 2025 15:47

Fix trailing whitespace in docstrings

9602397

Remove trailing whitespace from blank lines in docstrings to pass ruff checks Signed-off-by: soffer-anyscale <[email protected]>

cursor bot reviewed Oct 13, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_all_to_all_op.py Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

richardliaw reviewed Oct 13, 2025

View reviewed changes

Fix lint issues in distinct_aggregation

09503c0

Reformat assert statement and remove trailing newline to pass ruff checks. Signed-off-by: soffer-anyscale <[email protected]>

cursor bot reviewed Oct 13, 2025

View reviewed changes

[Data] Added distinct function #53460

Are you sure you want to change the base?

[Data] Added distinct function #53460

Conversation

soffer-anyscale commented May 31, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardliaw Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Oct 13, 2025

Choose a reason for hiding this comment

Bug: Missing Metadata Inference and Incorrect Progress Bar

Uh oh!

cursor bot Oct 13, 2025

Choose a reason for hiding this comment

Bug: DistinctAggregation Key Columns Initialization Bug

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

soffer-anyscale commented May 31, 2025 •

edited by cursor bot

Loading