[WIP] V4 Manifest Read Support #14533

anoopj · 2025-11-08T00:07:47Z

WIP PR for s.apache.org/iceberg-single-file-commit

Implemented so far:

Foundational types such as TrackedFile interface as unified representation for all V4 entry types
Reader and basic root manifest expansion

Introduces the foundational types for V4 manifest format support: - TrackedFile interface as unified representation for all V4 entry types - DeletionVector and ManifestStats interfaces - GenericTrackedFile implementation and test

rdblue · 2025-11-13T00:45:13Z

api/src/main/java/org/apache/iceberg/FileContent.java

+   * Manifest deletion vector entry (V4+ only) - marks entries in a manifest as deleted without
+   * rewriting the manifest.
+   */
+  MANIFEST_DV(5);


I prefer the option of having the DV located in a field of the data or delete manifest record. That way we don't have to wait to find the DV before processing a manifest file. Not sure what others think here, but since the DV metadata/content is likely going to be different between the Metadata DV (inline) and Data DV (stored in Puffin), I don't see much value in trying to reuse metadata fields for it.

That was my preference too, and I advocated for it in our community discussion but this is what we settled on. Our current v4 proposal specifically uses MANIFEST_DV as a separate content type that references manifests via the referenced_file field. We can certainly change it, but want to hear from others.

rdblue · 2025-11-13T00:46:24Z

api/src/main/java/org/apache/iceberg/DeletionVector.java

+   * <p>When present, the deletion vector is stored inline in the manifest rather than in a separate
+   * Puffin file.
+   */
+  ByteBuffer inlineContent();


I mentioned this in my comment below, but I don't think there's much value in combining the inline MDV metadata and fields to track data DVs stored in Puffin. These aren't overlapping, so I'd keep them separate.

Data and metadata DVs seemed like similar concepts to me. I can create a separate ManifestDeletionVector.

rdblue · 2025-11-13T00:47:12Z

api/src/main/java/org/apache/iceberg/TrackedFile.java

+          100,
+          "location",
+          Types.StringType.get(),
+          "Location of the file. Optional if content_type is 5 and deletion_vector.inline_content is not null");


Not using a separate entry for inline would make this required, right?

rdblue · 2025-11-13T00:48:09Z

api/src/main/java/org/apache/iceberg/TrackedFile.java

+          104,
+          "file_size_in_bytes",
+          Types.LongType.get(),
+          "Total file size in bytes. Must be defined if location is defined");


rdblue · 2025-11-13T00:50:01Z

api/src/main/java/org/apache/iceberg/TrackedFile.java

+   * <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if
+   * tracking info is inherited.
+   */
+  TrackingInfo trackingInfo();


I think that this also requires a field to be defined, so we have a record of the ID used for it.

rdblue · 2025-11-13T00:51:26Z

api/src/main/java/org/apache/iceberg/TrackedFile.java

+   * @throws IllegalStateException if content_type is not DATA
+   * @throws UnsupportedOperationException if ContentStats not yet implemented
+   */
+  DataFile asDataFile(PartitionSpec spec);


I'm not sure that we want to pass in the spec, since the record contains an ID. Wouldn't it be better to pass in a map of specs by ID when reading manifests so that this is already known when adapting to DataFile?

rdblue · 2025-11-13T00:52:48Z

api/src/main/java/org/apache/iceberg/TrackedFile.java

+  DeleteFile asDeleteFile(PartitionSpec spec);
+
+  /** Set the status for this tracked file entry. */
+  void setStatus(TrackingInfo.Status status);


This API should not expose any setter methods. The implementation used can, if needed, for things like inherited metadata. But the API interface itself should not force implementations to be mutable. In general, we want to think of the API interfaces as immutable.

rdblue · 2025-11-13T00:53:23Z

api/src/main/java/org/apache/iceberg/TrackingInfo.java

+ */
+public interface TrackingInfo {
+  /** Status of an entry in a tracked file */
+  enum Status {


Isn't this enum already defined somewhere?

rdblue · 2025-11-13T00:54:46Z

core/src/main/java/org/apache/iceberg/DataTableScan.java

  @Override
  public CloseableIterable<FileScanTask> doPlanFiles() {
    Snapshot snapshot = snapshot();
-


Nit: Avoid unnecessary whitespace changes. They cause conflicts.

rdblue · 2025-11-13T00:55:51Z

core/src/main/java/org/apache/iceberg/DataTableScan.java

+            : 2;
+
+    if (formatVersion >= 4) {
+      return planV4Files(snapshot, io);


Rather than modifying data table scan right now, let's leave this out. We don't need to plug anything into table scans at this point, since that is just a configuration API.

pvary · 2025-11-13T15:27:57Z

api/src/main/java/org/apache/iceberg/TrackedFile.java

+   *
+   * <p>Use this method to copy data without stats when collecting files.
+   */
+  F copyWithoutStats();


Is it intentional that we removed copyWithStats(Set<Integer> requestedColumnIds) from ContentFile?

anoopj added 11 commits November 3, 2025 09:37

Fix checkstyle error

c278fcc

Add position, setters

5925ff1

Fix checkstyle error

d6302a0

implement reader

bbb44cb

Checkstyle and formatter

3ae5e16

Rename file readers

2fbd886

implement root manifest reader

8e4849b

implement adapter for asData and asDelete

cff48fe

Trigger PR refresh

e53cbb7

Implement manifest expander

0380105

anoopj marked this pull request as draft November 8, 2025 00:07

github-actions bot added API core labels Nov 8, 2025

anoopj mentioned this pull request Nov 8, 2025

[Draft] V4 Manifest Read/Write Support anoopj/iceberg#3

Closed

anoopj added 3 commits November 10, 2025 11:35

Support for inline metadata DVs: 32 bit roaring bitmaps for now

f278ae4

Stub for snapshot integration

170d4bf

Implement parallel expansion

889db0d

anoopj mentioned this pull request Nov 12, 2025

Prototype: Parquet as manifest format for v4 #14577

Closed

rdblue reviewed Nov 13, 2025

View reviewed changes

pvary reviewed Nov 13, 2025

View reviewed changes

[WIP] V4 Manifest Read Support #14533

Are you sure you want to change the base?

[WIP] V4 Manifest Read Support #14533

Conversation

anoopj commented Nov 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anoopj Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anoopj Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anoopj Nov 15, 2025 •

edited

Loading

anoopj Nov 15, 2025 •

edited

Loading