Skip to content

Conversation

@fvaleye
Copy link
Contributor

@fvaleye fvaleye commented Aug 21, 2025

Which issue does this PR close?

What changes are included in this PR?

Implement a physical execution repartition node that determines the relevant DataFusion partitioning strategy based on the Iceberg table schema and metadata.

  • Implement hash partitioning for partitioned/bucketed tables
  • Use round-robin partitioning for unpartitioned tables

Minor change: I created a new schema_ref() helper method.

Are these changes tested?

Yes, with unit tests

@fvaleye fvaleye changed the title feat(datafusion): implement repartition node for DataFusion with feat(datafusion): implement the partitioning node for DataFusion to define the partitioning Aug 21, 2025
///
/// If no suitable hash columns are found (e.g., unpartitioned, non-bucketed table),
/// falls back to round-robin batch partitioning for even load distribution.
fn determine_partitioning_strategy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting to see. At first I just thought two cases:

  1. If it's partitioned table, we should just hash partition.
  2. If it's not partitioned, we should just use round robin partition.

However, this reminds me another case: range only partition, e.g. we only has partitions like date, time. I think in this case we should also use round robin partition since in this case most data are focused in several partitions.

Also I don't think we should take into account write.distribution-mode for now. The example you use are for spark, but not applicable for datafusion.

Copy link
Contributor Author

@fvaleye fvaleye Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this reminds me another case: range only partition, e.g. we only has partitions like date, time. I think in this case we should also use round robin partition since in this case most data are focused in several partitions.

Hum. You are right. The range partitions concentrate data in recent partitions, making hash partitioning counterproductive (considering a date with a temporal partition).
Since DataFusion doesn't provide Range, the fallback is round-robin and not hashing.

Briefly:

  • Hash partition: Only on bucket columns (partition spec + sort order)
  • Round-robin: Everything else (unpartitioned, range, identity, temporal transforms)

Also I don't think we should take into account write.distribution-mode for now. The example you use are for spark, but not applicable for datafusion.

Oh, good point, I misunderstood this. I thought it was an iceberg-rust table property.

@fvaleye fvaleye force-pushed the feature/implement-repartition-node-for-insert-into-datafusion branch from e8eb255 to 48ffd26 Compare August 29, 2025 12:10
@fvaleye fvaleye force-pushed the feature/implement-repartition-node-for-insert-into-datafusion branch 2 times, most recently from 77ca951 to 1a8d65d Compare September 1, 2025 18:28
@fvaleye fvaleye force-pushed the feature/implement-repartition-node-for-insert-into-datafusion branch from 0ca0994 to d3670f1 Compare October 27, 2025 08:57
@fvaleye fvaleye force-pushed the feature/implement-repartition-node-for-insert-into-datafusion branch 2 times, most recently from 820a827 to bede4ed Compare October 27, 2025 10:39
}
}

let hash_column_names: Vec<&str> = partition_spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this part? The physical plan is generated by IcebergTableProvider, so we could ensure that the _partition column will be inserted.

Copy link
Contributor Author

@fvaleye fvaleye Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the physical plan is generated by IcebergTableProvider, which always calls project_with_partition() for partitioned tables... we can guarantee that the _partition column will be present... so it is unnecessary if we stick to the DataFusion plan creation! Let me rework this. I will add the need for the column _partition to be available.

}

/// Returns whether repartitioning is actually needed by comparing input and desired partitioning
fn needs_repartitioning(input: &Arc<dyn ExecutionPlan>, desired: &Partitioning) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need this part. For partitioned table, the input is expected to be a ProjectExec, and we should always add the partitioning. For unpartitioned table, we should simply return RoundRobinBatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of this check is to avoid redundant repartitioning when the input already has correct partitioning, when it's called multiple times, or during optimization passes.
Since DataFusion normally builds the plan once (no manual calls), it's okay to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell me what you think; I'm okay with removing this part of the integration of the other changes.

@fvaleye fvaleye force-pushed the feature/implement-repartition-node-for-insert-into-datafusion branch 2 times, most recently from d95b86b to ccc1b67 Compare October 28, 2025 14:56
…the best partition strategy for Iceberg for writing

- Implement hash partitioning for partitioned/bucketed tables
- Use round-robin partitioning for unpartitioned tables
- Support range distribution mode approximation via sort columns
…robin for range partitions

Signed-off-by: Florian Valeye <[email protected]>
…itioning strategy logic

Signed-off-by: Florian Valeye <[email protected]>
@fvaleye fvaleye force-pushed the feature/implement-repartition-node-for-insert-into-datafusion branch from ccc1b67 to 3547194 Compare October 28, 2025 15:05
…taFusion plan and requiring the _partition column

Signed-off-by: Florian Valeye <[email protected]>
@fvaleye fvaleye force-pushed the feature/implement-repartition-node-for-insert-into-datafusion branch from 3547194 to 479a0f6 Compare October 28, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Repartition Node: Decide when the partitioning mode for the best parallelism

2 participants