spark: Don't use table FileIO for checkpointing files by c2zwdjnlcg · Pull Request #15239 · apache/iceberg

c2zwdjnlcg · 2026-02-05T17:33:06Z

c2zwdjnlcg · 2026-02-20T01:09:16Z

@nastra could you take a look at this PR and see if you are aligned with separating the checkpoint IO from the table IO?

nastra · 2026-02-20T11:56:31Z

@c2zwdjnlcg I currently don't have any cycles to review this. Maybe @huaxingao, @RussellSpitzer or @aokolnychyi have some time to review it

RussellSpitzer · 2026-02-24T16:48:34Z

First look, please do larger changes like this only on a single module first, then backport to the others in a follow up. It makes reviewing a bit more difficult to have duplicated changes. Taking a pass in depth now

RussellSpitzer

RussellSpitzer

Rather than changing the IO here to something a user wouldn't expect, I think it's probably better for us to change InitialOffsetStore itself directly.

Since Spark Checkpoints are expected to go through HadoopFS we should probably just use Hadoop FileSystem directly instead of using Iceberg FileIO class. This of course is a breaking change so we probably also need to gate this at least initially.

Maybe build two OffsetStores with the same interface and allow users to opt to Hadoop based with a spark read conf property?

interface InitialOffsetStore {
  StreamingOffset initialOffset();

  class TableIOOffsetStore implements InitialOffsetStore {
  }
  class HadoopOffsetStore implements InitialOffsetStore {
}

c2zwdjnlcg · 2026-02-25T03:33:12Z

@RussellSpitzer Thanks for the review.

please do larger changes like this only on a single module first, then backport to the others in a follow up.

Sorry about that, will keep in mind for next time

Hopefully this is more inline with what you were thinking.

I named the setting streaming-checkpoint-use-table-io. If you are generally ok with this approach and name I'll also add documentation to this PR.

github-actions bot added the spark label Feb 5, 2026

c2zwdjnlcg force-pushed the fix-checkpoint-fs-impl branch from f994146 to 13f02a1 Compare February 19, 2026 23:44

RussellSpitzer requested changes Feb 24, 2026

View reviewed changes

RussellSpitzer reviewed Feb 24, 2026

View reviewed changes

c2zwdjnlcg force-pushed the fix-checkpoint-fs-impl branch 2 times, most recently from 7631a4f to 257b264 Compare February 24, 2026 23:18

spark: Allow users to use hadoop file system for checkpoint files

0f5ead4

c2zwdjnlcg force-pushed the fix-checkpoint-fs-impl branch from 257b264 to 0f5ead4 Compare February 24, 2026 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

spark: Don't use table FileIO for checkpointing files#15239

spark: Don't use table FileIO for checkpointing files#15239
c2zwdjnlcg wants to merge 1 commit intoapache:mainfrom
c2zwdjnlcg:fix-checkpoint-fs-impl

c2zwdjnlcg commented Feb 5, 2026

Uh oh!

c2zwdjnlcg commented Feb 20, 2026

Uh oh!

nastra commented Feb 20, 2026

Uh oh!

RussellSpitzer commented Feb 24, 2026

Uh oh!

RussellSpitzer left a comment •

edited

Loading

Uh oh!

RussellSpitzer left a comment •

edited

Loading

Uh oh!

c2zwdjnlcg commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

c2zwdjnlcg commented Feb 5, 2026

Uh oh!

c2zwdjnlcg commented Feb 20, 2026

Uh oh!

nastra commented Feb 20, 2026

Uh oh!

RussellSpitzer commented Feb 24, 2026

Uh oh!

RussellSpitzer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c2zwdjnlcg commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RussellSpitzer left a comment •

edited

Loading

RussellSpitzer left a comment •

edited

Loading