-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[WIP] V4 Manifest Read Support #14533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces the foundational types for V4 manifest format support: - TrackedFile interface as unified representation for all V4 entry types - DeletionVector and ManifestStats interfaces - GenericTrackedFile implementation and test
| * Manifest deletion vector entry (V4+ only) - marks entries in a manifest as deleted without | ||
| * rewriting the manifest. | ||
| */ | ||
| MANIFEST_DV(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the option of having the DV located in a field of the data or delete manifest record. That way we don't have to wait to find the DV before processing a manifest file. Not sure what others think here, but since the DV metadata/content is likely going to be different between the Metadata DV (inline) and Data DV (stored in Puffin), I don't see much value in trying to reuse metadata fields for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my preference too, and I advocated for it in our community discussion but this is what we settled on. Our current v4 proposal specifically uses MANIFEST_DV as a separate content type that references manifests via the referenced_file field. We can certainly change it, but want to hear from others.
| * <p>When present, the deletion vector is stored inline in the manifest rather than in a separate | ||
| * Puffin file. | ||
| */ | ||
| ByteBuffer inlineContent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this in my comment below, but I don't think there's much value in combining the inline MDV metadata and fields to track data DVs stored in Puffin. These aren't overlapping, so I'd keep them separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data and metadata DVs seemed like similar concepts to me. I can create a separate ManifestDeletionVector.
| 100, | ||
| "location", | ||
| Types.StringType.get(), | ||
| "Location of the file. Optional if content_type is 5 and deletion_vector.inline_content is not null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using a separate entry for inline would make this required, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
| 104, | ||
| "file_size_in_bytes", | ||
| Types.LongType.get(), | ||
| "Total file size in bytes. Must be defined if location is defined"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
| * <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if | ||
| * tracking info is inherited. | ||
| */ | ||
| TrackingInfo trackingInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this also requires a field to be defined, so we have a record of the ID used for it.
| * @throws IllegalStateException if content_type is not DATA | ||
| * @throws UnsupportedOperationException if ContentStats not yet implemented | ||
| */ | ||
| DataFile asDataFile(PartitionSpec spec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we want to pass in the spec, since the record contains an ID. Wouldn't it be better to pass in a map of specs by ID when reading manifests so that this is already known when adapting to DataFile?
| DeleteFile asDeleteFile(PartitionSpec spec); | ||
|
|
||
| /** Set the status for this tracked file entry. */ | ||
| void setStatus(TrackingInfo.Status status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API should not expose any setter methods. The implementation used can, if needed, for things like inherited metadata. But the API interface itself should not force implementations to be mutable. In general, we want to think of the API interfaces as immutable.
| */ | ||
| public interface TrackingInfo { | ||
| /** Status of an entry in a tracked file */ | ||
| enum Status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this enum already defined somewhere?
| @Override | ||
| public CloseableIterable<FileScanTask> doPlanFiles() { | ||
| Snapshot snapshot = snapshot(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Avoid unnecessary whitespace changes. They cause conflicts.
| : 2; | ||
|
|
||
| if (formatVersion >= 4) { | ||
| return planV4Files(snapshot, io); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than modifying data table scan right now, let's leave this out. We don't need to plug anything into table scans at this point, since that is just a configuration API.
| * | ||
| * <p>Use this method to copy data without stats when collecting files. | ||
| */ | ||
| F copyWithoutStats(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that we removed copyWithStats(Set<Integer> requestedColumnIds) from ContentFile?
WIP PR for s.apache.org/iceberg-single-file-commit
Implemented so far: