Skip to content

Corrupted table because manifest list file of an active snapshot is removed when commit is empty #14583

@priyankar-stripe

Description

@priyankar-stripe

Apache Iceberg version

1.8.1

Query engine

Flink

Please describe the bug 🐞

In https://github.com/apache/iceberg/pull/10523/files, we changed the cleanup logic to stop fetching the latest snapshot from the metastore and instead maintain an in-memory snapshot instance for cleanup operations.
Specifically what we saw happen was:

  1. Initial Commit Attempt: Flink attempts to commit snapshot <snapshot_id> to metastore. The commit succeeds on the metastore side, but Flink receives a transient network error and incorrectly marks the commit as failed.
  2. Retry with Stale Metadata: RetryingMetaStoreClient retries the commit, but since the table has already been modified, metastore returns a The table has been modified error. This triggers a CommitFailedException (see
    https://github.com/apache/iceberg/blob/1.8.x/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L277-L278).
  3. SnapshotProducer Retry: SnapshotProducer catches this exception and retries the operation. It reuses the same snapshot ID but generates a new manifest list file: snap-<snapshot_id>-2-<uuid>.avro (note the incremented attempt number), different from the already-committed manifest list snap-<snapshot_id>-1-<uuid>.avro.
  4. No-Op Detection: Since there are no actual changes between these two attempts (same snapshot content), Iceberg detects this as a no-op and skips the commit https://github.com/apache/iceberg/blob/1.8.x/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L448-L453.
  5. Incorrect Cleanup: The cleanup logic then runs, but it incorrectly assumes snap-<snapshot_id>-2-<uuid>.avro is the committed manifest list (since it's the most recent attempt). It therefore deletes snap-<snapshot_id>-1-<uuid>.avro as an "uncommitted" file, thereby corrupting the active snapshot

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions