Skip to content

Conversation

@yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Jan 20, 2026

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Update SparkScan estimated size in bytes to take account in column pruning. This behavior change follows the interface comment in SupportsReportStatistics where this stats API is being implemented, since now that DSv2 only returns the columns after pruning if column filter is pushed down.

For the estimation calculation, it follows the exact same behavior as in SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode from Spark.

  • I looked into delta features such as materializePartitionColumns that allows writing partition columns to the data files, which may in theory impact estimated size; however, this feature allows arbitrary turning on and off, which can result in a table containing some files with materialized partition columns within the data files and others without. Given that the existing estimation code based on schema size is arguably already not accurate, I decided to keep the logic as is today for complete feature parity, but this potentially could be a future improvement area.

How was this patch tested?

Unit test. Also tested this against local Spark code with corresponding changes and loggings, and verified that in CostBasedJoinReorder, size estimate is the exact same as in before:

v1
61329:[JoinReorderWithGroups] Input: Project
61330:[JoinReorderWithGroups]   sizeInBytes=98028
61331:[JoinReorderWithGroups]   output.size=1
61332:[JoinReorderWithGroups]   canBroadcastBySize=true
61333:[JoinReorderWithGroups]   Leaf: LogicalRelation, sizeInBytes=2418043, output.size=28
61334:[JoinReorderWithGroups] Input: Project
61335:[JoinReorderWithGroups]   sizeInBytes=949
61336:[JoinReorderWithGroups]   output.size=2
61337:[JoinReorderWithGroups]   canBroadcastBySize=true
61338:[JoinReorderWithGroups]   Leaf: LogicalRelation, sizeInBytes=11636, output.size=26
---
v2
29166:[JoinReorderWithGroups] Input: Project
29167:[JoinReorderWithGroups]   sizeInBytes=98028
29168:[JoinReorderWithGroups]   output.size=1
29169:[JoinReorderWithGroups]   canBroadcastBySize=true
29170:[JoinReorderWithGroups]   Leaf: DataSourceV2ScanRelation, sizeInBytes=130705, output.size=2
29171:[JoinReorderWithGroups] Input: Filter
29172:[JoinReorderWithGroups]   sizeInBytes=949
29173:[JoinReorderWithGroups]   output.size=2
29174:[JoinReorderWithGroups]   canBroadcastBySize=true
29175:[JoinReorderWithGroups]   Leaf: DataSourceV2ScanRelation, sizeInBytes=949, output.size=2

Does this PR introduce any user-facing changes?

no

/**
* Computes the estimated size in bytes accounting for column projection.
*
* <p>This mirrors what {@code SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode} (from Spark code)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants