[Spark] update SparkScan estimated size in bytes to take account in column pruning #5894

yyanyy · 2026-01-20T23:49:00Z

Which Delta project/connector is this regarding?

Description

Update SparkScan estimated size in bytes to take account in column pruning. This behavior change follows the interface comment in SupportsReportStatistics where this stats API is being implemented, since now that DSv2 only returns the columns after pruning if column filter is pushed down.

For the estimation calculation, it follows the exact same behavior as in SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode from Spark.

I looked into delta features such as materializePartitionColumns that allows writing partition columns to the data files, which may in theory impact estimated size; however, this feature allows arbitrary turning on and off, which can result in a table containing some files with materialized partition columns within the data files and others without. Given that the existing estimation code based on schema size is arguably already not accurate, I decided to keep the logic as is today for complete feature parity, but this potentially could be a future improvement area.

How was this patch tested?

Unit test. Also tested this against local Spark code with corresponding changes and loggings, and verified that in CostBasedJoinReorder, size estimate is the exact same as in before:

v1
61329:[JoinReorderWithGroups] Input: Project
61330:[JoinReorderWithGroups]   sizeInBytes=98028
61331:[JoinReorderWithGroups]   output.size=1
61332:[JoinReorderWithGroups]   canBroadcastBySize=true
61333:[JoinReorderWithGroups]   Leaf: LogicalRelation, sizeInBytes=2418043, output.size=28
61334:[JoinReorderWithGroups] Input: Project
61335:[JoinReorderWithGroups]   sizeInBytes=949
61336:[JoinReorderWithGroups]   output.size=2
61337:[JoinReorderWithGroups]   canBroadcastBySize=true
61338:[JoinReorderWithGroups]   Leaf: LogicalRelation, sizeInBytes=11636, output.size=26
---
v2
29166:[JoinReorderWithGroups] Input: Project
29167:[JoinReorderWithGroups]   sizeInBytes=98028
29168:[JoinReorderWithGroups]   output.size=1
29169:[JoinReorderWithGroups]   canBroadcastBySize=true
29170:[JoinReorderWithGroups]   Leaf: DataSourceV2ScanRelation, sizeInBytes=130705, output.size=2
29171:[JoinReorderWithGroups] Input: Filter
29172:[JoinReorderWithGroups]   sizeInBytes=949
29173:[JoinReorderWithGroups]   output.size=2
29174:[JoinReorderWithGroups]   canBroadcastBySize=true
29175:[JoinReorderWithGroups]   Leaf: DataSourceV2ScanRelation, sizeInBytes=949, output.size=2

Does this PR introduce any user-facing changes?

no

…olumn pruning

gengliangwang · 2026-01-25T23:36:57Z

spark/v2/src/main/java/io/delta/spark/internal/v2/read/SparkScan.java

+  /**
+   * Computes the estimated size in bytes accounting for column projection.
+   *
+   * <p>This mirrors what {@code SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode} (from Spark code)


shall we follow https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L193 instead?

[Spark] update SparkScan estimated size in bytes to take account in c…

0dbe57f

…olumn pruning

gengliangwang reviewed Jan 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] update SparkScan estimated size in bytes to take account in column pruning #5894

[Spark] update SparkScan estimated size in bytes to take account in column pruning #5894

Uh oh!

yyanyy commented Jan 20, 2026

Uh oh!

gengliangwang Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Spark] update SparkScan estimated size in bytes to take account in column pruning #5894

Are you sure you want to change the base?

[Spark] update SparkScan estimated size in bytes to take account in column pruning #5894

Uh oh!

Conversation

yyanyy commented Jan 20, 2026

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

gengliangwang Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants