Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(puffin): Add PuffinWriter #959

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fqaiser94
Copy link
Contributor

Part of #744

Summary

  • Add PuffinWriter

Context

  • This is the fifth of a number of PRs to add support for Iceberg Puffin file format.
  • It might be helpful to refer to the overarching PR from which these changes were split to understand better how these changes will fit in to the larger picture.
  • It may also be helpful to refer to the Java reference implementation for PuffinWriter here.

@fqaiser94 fqaiser94 changed the title Add PuffinWriter feat(puffin): Add PuffinWriter Feb 9, 2025
@fqaiser94 fqaiser94 mentioned this pull request Feb 8, 2025
8 tasks
@fqaiser94 fqaiser94 force-pushed the puffin_writer branch 3 times, most recently from af396c4 to d3749f4 Compare February 9, 2025 21:26
@fqaiser94 fqaiser94 marked this pull request as ready for review February 10, 2025 11:01
@fqaiser94 fqaiser94 marked this pull request as draft February 10, 2025 11:03
@jonathanc-n
Copy link
Contributor

@fqaiser94 Whats the current idea for this? I can offer help if needed.

@fqaiser94
Copy link
Contributor Author

@fqaiser94 Whats the current idea for this? I can offer help if needed.

Forgot about this, let me try to revive this over the weekend

@fqaiser94 fqaiser94 marked this pull request as ready for review March 23, 2025 14:51
Comment on lines +41 to +43
pub fn bytes_written(&self) -> u64 {
self.written_size.load(std::sync::atomic::Ordering::SeqCst)
}
Copy link
Contributor Author

@fqaiser94 fqaiser94 Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part I'm concerned about and have been struggling to find time to investigate. What memory ordering option should we use here?

In the rest of the codebase, the written_size variable is accessed using Relaxed memory ordering ... I'm a little concerned about that TBH from a consistency perspective but for the moment, let's focus on just this PR's use case. For Puffin, it's important we get accurate written_size as that is included in the BlobMetadata that we write into Puffin file footer for each Blob so that users can efficiently extract blobs later.

Now that I'm thinking about this again, I think I might have a bug here lol I can't just SeqCst for the load side ... I think I need SeqCst for the store side too for this to work correctly ... in which case I can just use Release for load and Acquire for store probably ... What are people's thoughts on this, any memory ordering experts around? XD

Other alternatives:

  • Write a custom tracking writer for the puffin use case and avoid async/atomic variables entirely (like the reference Java implementation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants