-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Data] Added distinct function #53460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reqeust
Signed-off-by: soffer-anyscale <[email protected]>
…to data_distinct Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
- Fix r[keys] to iterate over keys: tuple(r[k] for k in keys) - Fix empty keys case to use all columns instead of constant (0,) - Prevents incorrect list indexing and improper deduplication Signed-off-by: soffer-anyscale <[email protected]>
- Remove local 'import pyarrow as pa' at line 6085 (to_arrow_refs method) - Remove local 'import pyarrow as pa' at line 6738 (Schema.types property) - These imports are redundant as pa is already imported at module level (line 26) Signed-off-by: soffer-anyscale <[email protected]>
- Remove unreliable hasattr check in _key_fn fallback - TableRow is always a Mapping with .values() method - Previous fallback tuple(r) would iterate keys not values - Ensures consistent key extraction for all-column deduplication Signed-off-by: soffer-anyscale <[email protected]>
Applied auto-fixes from pre-commit hooks: 1. Fixed end-of-file issues (added missing newlines): - python/ray/data/_internal/planner/exchange/distinct_task_spec.py - python/ray/data/_internal/planner/distinct.py 2. Applied black formatting: - python/ray/data/_internal/planner/exchange/distinct_task_spec.py All pre-commit checks now passing: - ruff: no errors - black: formatted - pydoclint: all docstrings valid - buildifier: passed - trailing whitespace: clean - end of files: clean - Ray import order: valid Signed-off-by: soffer-anyscale <[email protected]>
- Add ray_start_regular_shared fixture to all 9 test functions for proper cluster cleanup - Fix test_distinct_across_blocks to use sample_data fixture instead of inline table - Add ray.data.tests.conftest import for consistency with other Ray Data tests Addresses PR feedback from @richardliaw Signed-off-by: soffer-anyscale <[email protected]>
python/ray/data/_internal/planner/exchange/distinct_task_spec.py
Outdated
Show resolved
Hide resolved
- Remove sample_data parameter from test_distinct_all_unique (unused) - Add test_distinct_with_multiple_blocks_range: tests with 10 blocks and union - Add test_distinct_with_varying_block_sizes: tests blocks of different sizes - Add test_distinct_subset_with_multiple_blocks: tests keys parameter with multiple blocks - Add test_distinct_large_dataset_multiple_blocks: tests larger dataset (150 items, 10 blocks) Addresses PR feedback from @richardliaw for more multi-block test coverage Signed-off-by: soffer-anyscale <[email protected]>
This is a major architectural change from sort-based to hash-based distinct: - Add HashDistinctOperator: extends HashShufflingOperatorBase - Add DistinctAggregation: implements StatefulShuffleAggregation - Uses hash-based partitioning (same key hash -> same partition) - Deduplicates within partitions using dict-based lookup - Update plan_all_to_all_op to use hash shuffle when shuffle_strategy=HASH_SHUFFLE Benefits of hash-based approach: - Simpler implementation (no merge-sort complexity) - Better performance for deduplication workloads - Aligns with modern Ray Data infrastructure - Eliminates bugs in sort-based empty block handling Addresses PR feedback from @richardliaw to use AggregateFnV2 and hash shuffles instead of older sort-based TaskSpec infrastructure Signed-off-by: soffer-anyscale <[email protected]>
- Get all column names from schema when distinct keys parameter is None - Pass actual column names to hash partitioning (cannot be empty) - Simplify DistinctAggregation logic since key_columns is always provided - Add assertion to ensure key_columns is non-empty in HashDistinctOperator This ensures hash partitioning works correctly for both: - .distinct() - deduplicates on all columns - .distinct(keys=[...]) - deduplicates on specified columns Signed-off-by: soffer-anyscale <[email protected]>
- Remove trailing whitespace from docstrings - Fix import ordering (stdlib before third-party) - Remove unused imports (GiB, MiB) Signed-off-by: soffer-anyscale <[email protected]>
Code Quality Improvements: - Add comprehensive module-level docstring explaining hash-based algorithm - Enhance class and method docstrings with detailed explanations - Add inline comments explaining memory estimation logic - Document O(1) dict lookup performance characteristics - Improve code readability with better variable names and structure Bug Fixes (addressing cursor[bot] feedback): - Fix empty blocks handling in DistinctTaskSpec.reduce (IndexError) - Fix empty blocks handling in _merge_sorted_blocks_and_keep_first (IndexError) - Add empty blocks checks before accessing blocks[0] or normalized_blocks[0] These fixes prevent crashes when reduce tasks receive no input blocks during distributed shuffle operations. Signed-off-by: soffer-anyscale <[email protected]>
Remove obsolete sort-based distinct implementation: - Delete distinct.py (sort-based planner function) - Delete distinct_task_spec.py (sort-based TaskSpec) - Remove _merge_sorted_blocks_and_keep_first method (only used by sort-based) - Remove fallback logic from plan_all_to_all_op.py Hash-based distinct is now the only implementation. This simplifies the codebase and removes unnecessary complexity. Hash shuffle is the default strategy and is more efficient for deduplication workloads. Benefits: - Cleaner codebase with single implementation path - No confusing fallback logic - Reduced maintenance burden - Aligns with Ray Data's modern infrastructure Signed-off-by: soffer-anyscale <[email protected]>
Remove trailing whitespace from blank lines in docstrings to pass ruff checks Signed-off-by: soffer-anyscale <[email protected]>
Add proper error handling for schema inference failures in _plan_hash_shuffle_distinct. Now raises clear ValueError if schema cannot be inferred or is empty, preventing None from being passed to HashDistinctOperator. Signed-off-by: soffer-anyscale <[email protected]>
Remove expensive count() call that forced full dataset materialization for empty check. Distinguish between schema inference failure (None) and empty column list when validating dataset schema, providing clearer error messages. Signed-off-by: soffer-anyscale <[email protected]>
def finalize(self, partition_id: int) -> Block: | ||
"""Finalize deduplication for a partition and return the result block. | ||
Converts the deduplicated rows stored in the dictionary back into a | ||
PyArrow table block. | ||
Args: | ||
partition_id: Partition ID to finalize. | ||
Returns: | ||
Deduplicated block for this partition as a PyArrow table. | ||
""" | ||
seen_keys = self._partition_data[partition_id] | ||
|
||
# Build output block from deduplicated rows | ||
builder = ArrowBlockBuilder() | ||
for row in seen_keys.values(): | ||
builder.add(row) | ||
|
||
return builder.build() | ||
|
||
def clear(self, partition_id: int): | ||
"""Clear accumulated state for a partition. | ||
Args: | ||
partition_id: Partition ID to clear. | ||
""" | ||
self._partition_data.pop(partition_id, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this; just use https://docs.ray.io/en/latest/data/api/doc/ray.data.aggregate.AggregateFnV2.html
Reformat assert statement and remove trailing newline to pass ruff checks. Signed-off-by: soffer-anyscale <[email protected]>
|
||
def infer_schema(self) -> Optional["Schema"]: | ||
"""The output schema is the same as the input schema.""" | ||
return self._input_dependencies[0].infer_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Missing Metadata Inference and Incorrect Progress Bar
The Distinct
logical operator is missing its infer_metadata
method, unlike other AbstractAllToAll
subclasses, which can cause issues during query planning. Also, its progress bar incorrectly shows "Sort Sample" for a hash-based shuffle, which might be confusing to users.
|
||
# Only keep first occurrence of each key | ||
if key not in seen_keys: | ||
seen_keys[key] = row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: DistinctAggregation Key Columns Initialization Bug
The DistinctAggregation
class's __init__
method accepts key_columns=None
and its docstring suggests this means using all columns for uniqueness. However, the accept
method directly iterates self._key_columns
, which would raise a TypeError
if key_columns
were None
. This creates a mismatch between the DistinctAggregation
's public interface and its internal implementation, even though the HashDistinctOperator
currently ensures key_columns
is always provided.
Why are these changes needed?
Distinct is one of the most-needed functions to improve Ray Data user experience. There currently isn't a way to dedupe data out-of-the-box with Ray Data, which is one of the most common ETL functions.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.Note
Adds Dataset.distinct(keys=None) to remove duplicate rows (optionally by subset of columns), with unit tests and Bazel test target.
Dataset.distinct(keys=None)
: New public API to drop duplicate rows; supports dedup across all columns or a specified subset; validates inputs; implemented viagroupby(...).map_groups()
with Arrow batches; includes docstring examples.python/ray/data/tests/test_distinct.py
covering global, subset, edge cases, and error handling.py_test(name="test_distinct")
inpython/ray/data/BUILD.bazel
.Written by Cursor Bugbot for commit 030df62. This will update automatically on new commits. Configure here.