Skip to content

Conversation

@demoncoder-crypto
Copy link

Why are these changes needed?

Currently, applying a .limit(n) operation on a Ray Dataset reads significantly more data than required, especially with file-based sources. The limit is only applied after data blocks are loaded into memory by the read tasks. This PR implements the first stage of "limit pushdown" to address this inefficiency.

Specifically, these changes:

  1. Modify the LimitPushdownRule logical optimization pass to push the limit value into the Read logical operator when applicable, instead of just moving the Limit operator past it.
  2. Add a _limit attribute to the Read logical operator.
  3. Update the Datasource interface (get_read_tasks) to accept an optional limit parameter.
  4. Modify the plan_read_op function to pass the limit from the logical Read operator down to the datasource.get_read_tasks call.
  5. Implement limit handling within the ParquetDatasource:
    • get_read_tasks now selects only the necessary Parquet fragments based on row count metadata to satisfy the limit.
    • The underlying read_fragments function now accepts and enforces a row limit, slicing the final Arrow batch if needed to exactly match the limit.

This aims to make .limit() operations on Parquet datasets much more efficient by avoiding unnecessary data loading. Further work will be needed to implement this for other datasources.

Related issue number

Closes #51966

Checks

Please verify these checks before merging.

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR. (You need to ensure this)
  • I've run scripts/format.sh to lint the changes in this PR. (You need to run this)
  • I've included any doc changes needed for https://docs.ray.io/en/master/. (Likely no user-facing API changes yet, but double-check if internal behavior description needs updates)
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file. (N/A for this change)
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ (You need to run tests. Consider unskipping and adapting test_limit_pushdown or adding new tests specific to Parquet limit pushdown)
  • Testing Strategy
    • Unit tests (This should be the primary testing strategy - ensure existing relevant tests pass and add new ones for Parquet limit behavior)
    • Release tests
    • This PR is not tested :(

@demoncoder-crypto demoncoder-crypto requested a review from a team as a code owner April 5, 2025 16:45
@jcotant1 jcotant1 added the data Ray Data-related issues label Apr 7, 2025
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 7, 2025
@gvspraveen
Copy link
Contributor

Thank you for making this contribution 🙇 . We will take a look at this next week and respond back here.

@github-actions
Copy link

github-actions bot commented Jun 8, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 8, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Implement proper limit pushdown

5 participants