[Data] Implement proper limit pushdown #51966 #52018

demoncoder-crypto · 2025-04-05T16:45:00Z

Why are these changes needed?

Currently, applying a .limit(n) operation on a Ray Dataset reads significantly more data than required, especially with file-based sources. The limit is only applied after data blocks are loaded into memory by the read tasks. This PR implements the first stage of "limit pushdown" to address this inefficiency.

Specifically, these changes:

Modify the LimitPushdownRule logical optimization pass to push the limit value into the Read logical operator when applicable, instead of just moving the Limit operator past it.
Add a _limit attribute to the Read logical operator.
Update the Datasource interface (get_read_tasks) to accept an optional limit parameter.
Modify the plan_read_op function to pass the limit from the logical Read operator down to the datasource.get_read_tasks call.
Implement limit handling within the ParquetDatasource:
- get_read_tasks now selects only the necessary Parquet fragments based on row count metadata to satisfy the limit.
- The underlying read_fragments function now accepts and enforces a row limit, slicing the final Arrow batch if needed to exactly match the limit.

This aims to make .limit() operations on Parquet datasets much more efficient by avoiding unnecessary data loading. Further work will be needed to implement this for other datasources.

Related issue number

Closes #51966

Checks

Please verify these checks before merging.

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR. (You need to ensure this)
I've run scripts/format.sh to lint the changes in this PR. (You need to run this)
I've included any doc changes needed for https://docs.ray.io/en/master/. (Likely no user-facing API changes yet, but double-check if internal behavior description needs updates)
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file. (N/A for this change)
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ (You need to run tests. Consider unskipping and adapting test_limit_pushdown or adding new tests specific to Parquet limit pushdown)
Testing Strategy
- Unit tests (This should be the primary testing strategy - ensure existing relevant tests pass and add new ones for Parquet limit behavior)
- Release tests
- This PR is not tested :(

…rquetDatasource

gvspraveen · 2025-04-16T20:23:35Z

Thank you for making this contribution 🙇 . We will take a look at this next week and respond back here.

github-actions · 2025-06-08T00:36:48Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-06-23T00:34:57Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

feat(data): Implement initial limit pushdown into ReadOperator and Pa…

e857032

…rquetDatasource

demoncoder-crypto requested a review from a team as a code owner April 5, 2025 16:45

demoncoder-crypto mentioned this pull request Apr 5, 2025

[Data] Implement proper limit pushdown #51966

Open

jcotant1 added the data Ray Data-related issues label Apr 7, 2025

hainesmichaelc added the community-contribution Contributed by the community label Apr 7, 2025

alexeykudinkin self-assigned this Apr 16, 2025

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 8, 2025

github-actions bot closed this Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Implement proper limit pushdown #51966 #52018

[Data] Implement proper limit pushdown #51966 #52018

Uh oh!

demoncoder-crypto commented Apr 5, 2025

Uh oh!

gvspraveen commented Apr 16, 2025

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Data] Implement proper limit pushdown #51966 #52018

[Data] Implement proper limit pushdown #51966 #52018

Uh oh!

Conversation

demoncoder-crypto commented Apr 5, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gvspraveen commented Apr 16, 2025

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants