Skip to content

Conversation

@anoopj
Copy link
Contributor

@anoopj anoopj commented Nov 8, 2025

WIP PR for s.apache.org/iceberg-single-file-commit

Implemented so far:

  • Foundational types such as TrackedFile interface as unified representation for all V4 entry types
  • Reader and basic root manifest expansion

* Manifest deletion vector entry (V4+ only) - marks entries in a manifest as deleted without
* rewriting the manifest.
*/
MANIFEST_DV(5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the option of having the DV located in a field of the data or delete manifest record. That way we don't have to wait to find the DV before processing a manifest file. Not sure what others think here, but since the DV metadata/content is likely going to be different between the Metadata DV (inline) and Data DV (stored in Puffin), I don't see much value in trying to reuse metadata fields for it.

Copy link
Contributor Author

@anoopj anoopj Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my preference too, and I advocated for it in our community discussion but this is what we settled on. Our current v4 proposal specifically uses MANIFEST_DV as a separate content type that references manifests via the referenced_file field. We can certainly change it, but want to hear from others.

* <p>When present, the deletion vector is stored inline in the manifest rather than in a separate
* Puffin file.
*/
ByteBuffer inlineContent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in my comment below, but I don't think there's much value in combining the inline MDV metadata and fields to track data DVs stored in Puffin. These aren't overlapping, so I'd keep them separate.

Copy link
Contributor Author

@anoopj anoopj Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data and metadata DVs seemed like similar concepts to me. I can create a separate ManifestDeletionVector.

100,
"location",
Types.StringType.get(),
"Location of the file. Optional if content_type is 5 and deletion_vector.inline_content is not null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using a separate entry for inline would make this required, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

104,
"file_size_in_bytes",
Types.LongType.get(),
"Total file size in bytes. Must be defined if location is defined");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

* <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if
* tracking info is inherited.
*/
TrackingInfo trackingInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this also requires a field to be defined, so we have a record of the ID used for it.

* @throws IllegalStateException if content_type is not DATA
* @throws UnsupportedOperationException if ContentStats not yet implemented
*/
DataFile asDataFile(PartitionSpec spec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we want to pass in the spec, since the record contains an ID. Wouldn't it be better to pass in a map of specs by ID when reading manifests so that this is already known when adapting to DataFile?

DeleteFile asDeleteFile(PartitionSpec spec);

/** Set the status for this tracked file entry. */
void setStatus(TrackingInfo.Status status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API should not expose any setter methods. The implementation used can, if needed, for things like inherited metadata. But the API interface itself should not force implementations to be mutable. In general, we want to think of the API interfaces as immutable.

*/
public interface TrackingInfo {
/** Status of an entry in a tracked file */
enum Status {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this enum already defined somewhere?

@Override
public CloseableIterable<FileScanTask> doPlanFiles() {
Snapshot snapshot = snapshot();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Avoid unnecessary whitespace changes. They cause conflicts.

: 2;

if (formatVersion >= 4) {
return planV4Files(snapshot, io);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than modifying data table scan right now, let's leave this out. We don't need to plug anything into table scans at this point, since that is just a configuration API.

*
* <p>Use this method to copy data without stats when collecting files.
*/
F copyWithoutStats();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that we removed copyWithStats(Set<Integer> requestedColumnIds) from ContentFile?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants