-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: pre-warm listing file statistics cache during listing table creation #18971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tion Signed-off-by: bharath-techie <[email protected]>
|
|
||
| // Pre-warm statistics cache if collect_statistics is enabled | ||
| if session_state.config().collect_statistics() { | ||
| let _ = table.list_files_for_scan(state, &[], None).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is it okay to reuse this method to pre-warm as we do couple more things post collecting the statistics ?
- Also is no limit fine ? as list_file_statistics_cache doesn't seem to have any size limit unlike metadata cache ?
cc: @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Also is no limit fine ?
I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martin-g for reviewing.
Agree on having limit.
But doing it in background will result in inconsistent behavior ?
Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default ListingTableProvider in DataFusion. Defaults to true.
Will a user not expect the statistics to be collected when creating the table and expect any query post that to be optimized based on the above documentation ?
|
|
||
| // Pre-warm statistics cache if collect_statistics is enabled | ||
| if session_state.config().collect_statistics() { | ||
| let _ = table.list_files_for_scan(state, &[], None).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should errors in the pre-warming be propagated ?
Maybe handle/ignore failures locally ?!
|
|
||
| // Pre-warm statistics cache if collect_statistics is enabled | ||
| if session_state.config().collect_statistics() { | ||
| let _ = table.list_files_for_scan(state, &[], None).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Also is no limit fine ?
I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.
Pre-warm listing file statistics cache during create listing table flow as suggested in #18952.
Reused
list_files_for_scanto pre-warm.Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes unit tested.
Are there any user-facing changes?