Skip to content

[branch-48] Set the default value of datafusion.execution.collect_statistics to true #16447 #16659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: branch-48
Choose a base branch
from

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Jul 2, 2025

Backport #16447 by @AdamGS to #16486

… `true` (apache#16447)

* fix sqllogicaltests
* Add upgrade note

(cherry picked from commit 2d7ae09)
@blaginin blaginin self-assigned this Jul 2, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Jul 2, 2025
@blaginin blaginin changed the title Backport #16447 to df48 [branch-48] Backport #16447 to df48 Jul 2, 2025
@blaginin blaginin changed the title [branch-48] Backport #16447 to df48 [branch-48] Backport #16447 to df48: Set the default value of datafusion.execution.collect_statistics to true #16447 Jul 2, 2025
@alamb alamb changed the title [branch-48] Backport #16447 to df48: Set the default value of datafusion.execution.collect_statistics to true #16447 [branch-48] Set the default value of datafusion.execution.collect_statistics to true #16447 Jul 2, 2025
@alamb
Copy link
Contributor

alamb commented Jul 2, 2025

This is a somewhat subtle issue so I will try and summarize:

In DataFusion 47 and earlier,

  1. Calling DataFrame::register_parquet collected statistics for the table at create time (slower to create table, potentially faster quer)
  2. Calling CREATE EXTERNAL TABLE did not collect statistics (faster to create table, but potentially slower query)

There are more details about this on the ticket from @davisp here:

In DataFusion 48.0.0:

  1. Make SessionContext::register_parquet obey collect_statistics config #16080 made DataFrame::register_parquet and CREATE EXTERNAL TABLE DID NOT collect statistics.

However this means that users who were relying on statistics, such as @AdamGS , saw queries get slower (see #16444)

Thus this PR proposes changing DataFusion 48.0.1 so

  1. Both DataFrame::register_parquet and CREATE EXTERNAL TABLE WILL collect statistics.

Note that this is consistent with the behavior on the latest main (what will be released in DataFusion 49.0.0):

Since we have already made this change, the thinking is that by changing 48.0.1 we'll avoid people full migrating to the "no default statistics" behavior only to have to change back again in 49

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants