Skip to content

Conversation

@bharath-techie
Copy link

@bharath-techie bharath-techie commented Nov 27, 2025

Pre-warm listing file statistics cache during create listing table flow as suggested in #18952.
Reused list_files_for_scan to pre-warm.

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes unit tested.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 27, 2025

// Pre-warm statistics cache if collect_statistics is enabled
if session_state.config().collect_statistics() {
let _ = table.list_files_for_scan(state, &[], None).await?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is it okay to reuse this method to pre-warm as we do couple more things post collecting the statistics ?
  2. Also is no limit fine ? as list_file_statistics_cache doesn't seem to have any size limit unlike metadata cache ?

cc: @alamb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Also is no limit fine ?

I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.

Copy link
Author

@bharath-techie bharath-techie Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martin-g for reviewing.

Agree on having limit.

But doing it in background will result in inconsistent behavior ?

Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default ListingTableProvider in DataFusion. Defaults to true.

Will a user not expect the statistics to be collected when creating the table and expect any query post that to be optimized based on the above documentation ?


// Pre-warm statistics cache if collect_statistics is enabled
if session_state.config().collect_statistics() {
let _ = table.list_files_for_scan(state, &[], None).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should errors in the pre-warming be propagated ?
Maybe handle/ignore failures locally ?!


// Pre-warm statistics cache if collect_statistics is enabled
if session_state.config().collect_statistics() {
let _ = table.list_files_for_scan(state, &[], None).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Also is no limit fine ?

I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: statistics not collected automatically upon creation of ListingTable

2 participants