[Spark] update SparkScan estimated size in bytes to take account in column pruning #5894
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
Update SparkScan estimated size in bytes to take account in column pruning. This behavior change follows the interface comment in SupportsReportStatistics where this stats API is being implemented, since now that DSv2 only returns the columns after pruning if column filter is pushed down.
For the estimation calculation, it follows the exact same behavior as in
SizeInBytesOnlyStatsPlanVisitor.visitUnaryNodefrom Spark.materializePartitionColumnsthat allows writing partition columns to the data files, which may in theory impact estimated size; however, this feature allows arbitrary turning on and off, which can result in a table containing some files with materialized partition columns within the data files and others without. Given that the existing estimation code based on schema size is arguably already not accurate, I decided to keep the logic as is today for complete feature parity, but this potentially could be a future improvement area.How was this patch tested?
Unit test. Also tested this against local Spark code with corresponding changes and loggings, and verified that in
CostBasedJoinReorder, size estimate is the exact same as in before:Does this PR introduce any user-facing changes?
no