Skip to content

Conversation

soffer-anyscale
Copy link
Contributor

@soffer-anyscale soffer-anyscale commented May 31, 2025

Why are these changes needed?

Distinct is one of the most-needed functions to improve Ray Data user experience. There currently isn't a way to dedupe data out-of-the-box with Ray Data, which is one of the most common ETL functions.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Adds Dataset.distinct(keys=None) to remove duplicate rows (optionally by subset of columns), with unit tests and Bazel test target.

  • Data API:
    • Dataset.distinct(keys=None): New public API to drop duplicate rows; supports dedup across all columns or a specified subset; validates inputs; implemented via groupby(...).map_groups() with Arrow batches; includes docstring examples.
  • Tests/Build:
    • Add python/ray/data/tests/test_distinct.py covering global, subset, edge cases, and error handling.
    • Register py_test(name="test_distinct") in python/ray/data/BUILD.bazel.

Written by Cursor Bugbot for commit 030df62. This will update automatically on new commits. Configure here.

@soffer-anyscale soffer-anyscale requested a review from a team as a code owner May 31, 2025 20:32
@raulchen raulchen enabled auto-merge (squash) June 4, 2025 21:41
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 4, 2025
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
@github-actions github-actions bot disabled auto-merge June 4, 2025 22:07
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reqeust

Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
cursor[bot]

This comment was marked as outdated.

- Fix r[keys] to iterate over keys: tuple(r[k] for k in keys)
- Fix empty keys case to use all columns instead of constant (0,)
- Prevents incorrect list indexing and improper deduplication

Signed-off-by: soffer-anyscale <[email protected]>
- Remove local 'import pyarrow as pa' at line 6085 (to_arrow_refs method)
- Remove local 'import pyarrow as pa' at line 6738 (Schema.types property)
- These imports are redundant as pa is already imported at module level (line 26)

Signed-off-by: soffer-anyscale <[email protected]>
cursor[bot]

This comment was marked as outdated.

- Remove unreliable hasattr check in _key_fn fallback
- TableRow is always a Mapping with .values() method
- Previous fallback tuple(r) would iterate keys not values
- Ensures consistent key extraction for all-column deduplication

Signed-off-by: soffer-anyscale <[email protected]>
Applied auto-fixes from pre-commit hooks:

1. Fixed end-of-file issues (added missing newlines):
   - python/ray/data/_internal/planner/exchange/distinct_task_spec.py
   - python/ray/data/_internal/planner/distinct.py

2. Applied black formatting:
   - python/ray/data/_internal/planner/exchange/distinct_task_spec.py

All pre-commit checks now passing:
- ruff: no errors
- black: formatted
- pydoclint: all docstrings valid
- buildifier: passed
- trailing whitespace: clean
- end of files: clean
- Ray import order: valid

Signed-off-by: soffer-anyscale <[email protected]>
- Add ray_start_regular_shared fixture to all 9 test functions for proper cluster cleanup
- Fix test_distinct_across_blocks to use sample_data fixture instead of inline table
- Add ray.data.tests.conftest import for consistency with other Ray Data tests

Addresses PR feedback from @richardliaw

Signed-off-by: soffer-anyscale <[email protected]>
cursor[bot]

This comment was marked as outdated.

- Remove sample_data parameter from test_distinct_all_unique (unused)
- Add test_distinct_with_multiple_blocks_range: tests with 10 blocks and union
- Add test_distinct_with_varying_block_sizes: tests blocks of different sizes
- Add test_distinct_subset_with_multiple_blocks: tests keys parameter with multiple blocks
- Add test_distinct_large_dataset_multiple_blocks: tests larger dataset (150 items, 10 blocks)

Addresses PR feedback from @richardliaw for more multi-block test coverage

Signed-off-by: soffer-anyscale <[email protected]>
cursor[bot]

This comment was marked as outdated.

This is a major architectural change from sort-based to hash-based distinct:

- Add HashDistinctOperator: extends HashShufflingOperatorBase
- Add DistinctAggregation: implements StatefulShuffleAggregation
  - Uses hash-based partitioning (same key hash -> same partition)
  - Deduplicates within partitions using dict-based lookup
- Update plan_all_to_all_op to use hash shuffle when shuffle_strategy=HASH_SHUFFLE

Benefits of hash-based approach:
- Simpler implementation (no merge-sort complexity)
- Better performance for deduplication workloads
- Aligns with modern Ray Data infrastructure
- Eliminates bugs in sort-based empty block handling

Addresses PR feedback from @richardliaw to use AggregateFnV2 and hash shuffles
instead of older sort-based TaskSpec infrastructure

Signed-off-by: soffer-anyscale <[email protected]>
- Get all column names from schema when distinct keys parameter is None
- Pass actual column names to hash partitioning (cannot be empty)
- Simplify DistinctAggregation logic since key_columns is always provided
- Add assertion to ensure key_columns is non-empty in HashDistinctOperator

This ensures hash partitioning works correctly for both:
- .distinct() - deduplicates on all columns
- .distinct(keys=[...]) - deduplicates on specified columns

Signed-off-by: soffer-anyscale <[email protected]>
- Remove trailing whitespace from docstrings
- Fix import ordering (stdlib before third-party)
- Remove unused imports (GiB, MiB)

Signed-off-by: soffer-anyscale <[email protected]>
Code Quality Improvements:
- Add comprehensive module-level docstring explaining hash-based algorithm
- Enhance class and method docstrings with detailed explanations
- Add inline comments explaining memory estimation logic
- Document O(1) dict lookup performance characteristics
- Improve code readability with better variable names and structure

Bug Fixes (addressing cursor[bot] feedback):
- Fix empty blocks handling in DistinctTaskSpec.reduce (IndexError)
- Fix empty blocks handling in _merge_sorted_blocks_and_keep_first (IndexError)
- Add empty blocks checks before accessing blocks[0] or normalized_blocks[0]

These fixes prevent crashes when reduce tasks receive no input blocks during
distributed shuffle operations.

Signed-off-by: soffer-anyscale <[email protected]>
Remove obsolete sort-based distinct implementation:
- Delete distinct.py (sort-based planner function)
- Delete distinct_task_spec.py (sort-based TaskSpec)
- Remove _merge_sorted_blocks_and_keep_first method (only used by sort-based)
- Remove fallback logic from plan_all_to_all_op.py

Hash-based distinct is now the only implementation. This simplifies the codebase
and removes unnecessary complexity. Hash shuffle is the default strategy and is
more efficient for deduplication workloads.

Benefits:
- Cleaner codebase with single implementation path
- No confusing fallback logic
- Reduced maintenance burden
- Aligns with Ray Data's modern infrastructure

Signed-off-by: soffer-anyscale <[email protected]>
Remove trailing whitespace from blank lines in docstrings to pass ruff checks

Signed-off-by: soffer-anyscale <[email protected]>
Add proper error handling for schema inference failures in _plan_hash_shuffle_distinct.
Now raises clear ValueError if schema cannot be inferred or is empty, preventing
None from being passed to HashDistinctOperator.

Signed-off-by: soffer-anyscale <[email protected]>
Remove expensive count() call that forced full dataset materialization for
empty check. Distinguish between schema inference failure (None) and empty
column list when validating dataset schema, providing clearer error messages.

Signed-off-by: soffer-anyscale <[email protected]>
Comment on lines +93 to +120
def finalize(self, partition_id: int) -> Block:
"""Finalize deduplication for a partition and return the result block.
Converts the deduplicated rows stored in the dictionary back into a
PyArrow table block.
Args:
partition_id: Partition ID to finalize.
Returns:
Deduplicated block for this partition as a PyArrow table.
"""
seen_keys = self._partition_data[partition_id]

# Build output block from deduplicated rows
builder = ArrowBlockBuilder()
for row in seen_keys.values():
builder.add(row)

return builder.build()

def clear(self, partition_id: int):
"""Clear accumulated state for a partition.
Args:
partition_id: Partition ID to clear.
"""
self._partition_data.pop(partition_id, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformat assert statement and remove trailing newline to pass ruff checks.

Signed-off-by: soffer-anyscale <[email protected]>

def infer_schema(self) -> Optional["Schema"]:
"""The output schema is the same as the input schema."""
return self._input_dependencies[0].infer_schema()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing Metadata Inference and Incorrect Progress Bar

The Distinct logical operator is missing its infer_metadata method, unlike other AbstractAllToAll subclasses, which can cause issues during query planning. Also, its progress bar incorrectly shows "Sort Sample" for a hash-based shuffle, which might be confusing to users.

Fix in Cursor Fix in Web


# Only keep first occurrence of each key
if key not in seen_keys:
seen_keys[key] = row
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: DistinctAggregation Key Columns Initialization Bug

The DistinctAggregation class's __init__ method accepts key_columns=None and its docstring suggests this means using all columns for uniqueness. However, the accept method directly iterates self._key_columns, which would raise a TypeError if key_columns were None. This creates a mismatch between the DistinctAggregation's public interface and its internal implementation, even though the HashDistinctOperator currently ensures key_columns is always provided.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants