spark: Don't use table FileIO for checkpointing files#15239
spark: Don't use table FileIO for checkpointing files#15239c2zwdjnlcg wants to merge 1 commit intoapache:mainfrom
Conversation
f994146 to
13f02a1
Compare
|
@nastra could you take a look at this PR and see if you are aligned with separating the checkpoint IO from the table IO? |
|
@c2zwdjnlcg I currently don't have any cycles to review this. Maybe @huaxingao, @RussellSpitzer or @aokolnychyi have some time to review it |
|
First look, please do larger changes like this only on a single module first, then backport to the others in a follow up. It makes reviewing a bit more difficult to have duplicated changes. Taking a pass in depth now |
There was a problem hiding this comment.
Rather than changing the IO here to something a user wouldn't expect, I think it's probably better for us to change InitialOffsetStore itself directly.
Since Spark Checkpoints are expected to go through HadoopFS we should probably just use Hadoop FileSystem directly instead of using Iceberg FileIO class. This of course is a breaking change so we probably also need to gate this at least initially.
Maybe build two OffsetStores with the same interface and allow users to opt to Hadoop based with a spark read conf property?
interface InitialOffsetStore {
StreamingOffset initialOffset();
class TableIOOffsetStore implements InitialOffsetStore {
}
class HadoopOffsetStore implements InitialOffsetStore {
}7631a4f to
257b264
Compare
257b264 to
0f5ead4
Compare
|
@RussellSpitzer Thanks for the review.
Sorry about that, will keep in mind for next time Hopefully this is more inline with what you were thinking. I named the setting |
Fixes: #14762