-
Notifications
You must be signed in to change notification settings - Fork 2.9k
OpenAPI: Add min-rows-requested field to PlanTableScanRequest #14565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OpenAPI: Add min-rows-requested field to PlanTableScanRequest #14565
Conversation
open-api/rest-catalog-open-api.yaml
Outdated
| $ref: '#/components/schemas/Expression' | ||
| min-rows-requested: | ||
| description: | ||
| The minimum number of rows requested for the scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] how is the server supposed to fulfill this request, when there are equality deletes present ?
Additional questions -
- may be we should be explicit that its an hint ?
- what would server do if its certain table doesn't have minimum number of row, 400 bad request ?
- we might need to define iceberg scan api like we do for projection and filter so that E2E plumbing works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intended as a "hint" or "indicator" from the client to help the server not have to return more than is necessary. It is not required for the server to return that many results (as the result of the scan may not have that many rows, that's why it's named "min-rows-requested" and not "min-rows-required"). Also, it's not a max limit either and the server can return more than the requested number.
I'm open to different name and better description for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about The minimum number of rows requested for the scan. This is used as a hint to the server to not have to return more rows than necessary. It is not required for the server to return that many rows since the scan may not have that many rows. The server can also return more rows than requested
This should make it easier to understand that this is a hint and what the expectations are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @nastra description is pretty reasonable, +1. This is fundamentally a hint to the server, and I would almost certainly not want to fail if a table doesn't have enough records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from my end too ! it addresses my questions above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I updated the description.
| filter: Optional[Expression] = Field( | ||
| None, description='Expression used to filter the table data' | ||
| ) | ||
| min_rows_requested: Optional[int] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the name and description a bit confusing. For me, it is obvious that it might not return exactly the number of rows that are requested. For example, the table might be empty. I would much rather go with something that's analogue to SQL:
| min_rows_requested: Optional[int] = Field( | |
| limit: Optional[int] = Field( |
Of course, it might still return more rows since the Parquet file contains more rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I look at things the other way since I feel like limit connotes a strict upper-bound (that is if there are sufficient matching rows) as how it means in SQL. Another idea: limit-hint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Fokko , what about soft_limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to jump in on bike shedding how about max_rows_required I think it lacks the ambiguity if someone isn't familiar with SQL and I think my issue with and hopefully makes it clearer that this is in fact an upper bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to pile on. target-scan-task-number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it actually data row count or scan task count? I thought it is later. but if it is former, this config can be target-row-count.
This is similar to the table properties for write.target-file-size-bytes or read.split.target-size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are talking about fine grained control, maybe target-plan-size-bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it actually data row count or scan task count? I thought it is later. but if it is former, this config can be target-row-count.
I'll let @nastra confirm but I think it is a row count (as the original name suggests). target-row-count seems more ambiguous to me in terms of semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for rows and we're looking for a lower bound, hence why we're having the min in the name. We're open to better naming suggestions, but using something with limit in the name rather implies an upper bound, which wouldn't be correct here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra lower bound may not always be valid. e.g. maybe the table just don't have enough number of rows to satisfy the lower bound.
target is probably a good name for the desired plan size. Wondering why not using bytes (instead of rows) to express the target, similar to read.split.target-size?
Anyway, I would suggest target-plan-size-rows or target-plan-size-bytes.
Proposal to add to the spec a way for the client to give the server a hint/indicator for the amount of rows being requested so that the server does not need to return more results than is necessary. Just to be clear, this is not meant to mean the server must return X number of rows or that the server should return at most X of number of rows.
ML Discussion: https://lists.apache.org/thread/m51fxlsbt5yk219ypk2dhj07tlk3407b