-
Notifications
You must be signed in to change notification settings - Fork 338
feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
| /// | ||
| /// If no suitable hash columns are found (e.g., unpartitioned, non-bucketed table), | ||
| /// falls back to round-robin batch partitioning for even load distribution. | ||
| fn determine_partitioning_strategy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting to see. At first I just thought two cases:
- If it's partitioned table, we should just hash partition.
- If it's not partitioned, we should just use round robin partition.
However, this reminds me another case: range only partition, e.g. we only has partitions like date, time. I think in this case we should also use round robin partition since in this case most data are focused in several partitions.
Also I don't think we should take into account write.distribution-mode for now. The example you use are for spark, but not applicable for datafusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this reminds me another case: range only partition, e.g. we only has partitions like date, time. I think in this case we should also use round robin partition since in this case most data are focused in several partitions.
Hum. You are right. The range partitions concentrate data in recent partitions, making hash partitioning counterproductive (considering a date with a temporal partition).
Since DataFusion doesn't provide Range, the fallback is round-robin and not hashing.
Briefly:
- Hash partition: Only on bucket columns (partition spec + sort order)
- Round-robin: Everything else (unpartitioned, range, identity, temporal transforms)
Also I don't think we should take into account write.distribution-mode for now. The example you use are for spark, but not applicable for datafusion.
Oh, good point, I misunderstood this. I thought it was an iceberg-rust table property.
e8eb255 to
48ffd26
Compare
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
77ca951 to
1a8d65d
Compare
0ca0994 to
d3670f1
Compare
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
820a827 to
bede4ed
Compare
crates/integrations/datafusion/src/physical_plan/repartition.rs
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| let hash_column_names: Vec<&str> = partition_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this part? The physical plan is generated by IcebergTableProvider, so we could ensure that the _partition column will be inserted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the physical plan is generated by IcebergTableProvider, which always calls project_with_partition() for partitioned tables... we can guarantee that the _partition column will be present... so it is unnecessary if we stick to the DataFusion plan creation! Let me rework this. I will add the need for the column _partition to be available.
| } | ||
|
|
||
| /// Returns whether repartitioning is actually needed by comparing input and desired partitioning | ||
| fn needs_repartitioning(input: &Arc<dyn ExecutionPlan>, desired: &Partitioning) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why we need this part. For partitioned table, the input is expected to be a ProjectExec, and we should always add the partitioning. For unpartitioned table, we should simply return RoundRobinBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit of this check is to avoid redundant repartitioning when the input already has correct partitioning, when it's called multiple times, or during optimization passes.
Since DataFusion normally builds the plan once (no manual calls), it's okay to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell me what you think; I'm okay with removing this part of the integration of the other changes.
d95b86b to
ccc1b67
Compare
…the best partition strategy for Iceberg for writing - Implement hash partitioning for partitioned/bucketed tables - Use round-robin partitioning for unpartitioned tables - Support range distribution mode approximation via sort columns
…robin for range partitions Signed-off-by: Florian Valeye <[email protected]>
…rtitioning strategy Signed-off-by: Florian Valeye <[email protected]>
…e unused parameters
Signed-off-by: Florian Valeye <[email protected]>
…itioning strategy logic Signed-off-by: Florian Valeye <[email protected]>
ccc1b67 to
3547194
Compare
…taFusion plan and requiring the _partition column Signed-off-by: Florian Valeye <[email protected]>
3547194 to
479a0f6
Compare
…t-into-datafusion
…t-into-datafusion
Which issue does this PR close?
What changes are included in this PR?
Implement a physical execution repartition node that determines the relevant DataFusion partitioning strategy based on the Iceberg table schema and metadata.
Minor change: I created a new
schema_ref()helper method.Are these changes tested?
Yes, with unit tests