-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Apache Iceberg version
1.5.0
Query engine
Spark
Please describe the bug 🐞
Problem
When copying Iceberg tables with position delete files using SparkActions.copyTable(), the operation fails with FileAlreadyExistsException during parallel processing:
org.apache.hadoop.fs.FileAlreadyExistsException: /staging/00001-deletes.parquet already exists
This issue is reproducible when:
Multiple manifests reference the same position delete file (e.g., after manifest compaction)
Different position delete files have the same filename in different directories
Root Cause
The stagingPath() method only used the filename to generate staging paths:
// OLD CODE
private static String stagingPath(String originalPath, String stagingLocation) {
return stagingLocation + fileName(originalPath); // Only filename!
}
Collision scenarios:
dir1/00001-deletes.parquet → staging/00001-deletes.parquet
dir2/00001-deletes.parquet → staging/00001-deletes.parquet ❌ COLLISION
When Spark processes manifests in parallel (mapPartitions at line 784), multiple tasks simultaneously try to write to the same staging path, causing the exception.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time