[Data] Implement proper limit pushdown #51966 #52018
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Currently, applying a
.limit(n)operation on a Ray Dataset reads significantly more data than required, especially with file-based sources. The limit is only applied after data blocks are loaded into memory by the read tasks. This PR implements the first stage of "limit pushdown" to address this inefficiency.Specifically, these changes:
LimitPushdownRulelogical optimization pass to push the limit value into theReadlogical operator when applicable, instead of just moving theLimitoperator past it._limitattribute to theReadlogical operator.Datasourceinterface (get_read_tasks) to accept an optionallimitparameter.plan_read_opfunction to pass the limit from the logicalReadoperator down to thedatasource.get_read_taskscall.ParquetDatasource:get_read_tasksnow selects only the necessary Parquet fragments based on row count metadata to satisfy the limit.read_fragmentsfunction now accepts and enforces a row limit, slicing the final Arrow batch if needed to exactly match the limit.This aims to make
.limit()operations on Parquet datasets much more efficient by avoiding unnecessary data loading. Further work will be needed to implement this for other datasources.Related issue number
Closes #51966
Checks
Please verify these checks before merging.
git commit -s) in this PR. (You need to ensure this)scripts/format.shto lint the changes in this PR. (You need to run this)method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile. (N/A for this change)test_limit_pushdownor adding new tests specific to Parquet limit pushdown)