-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add the in-commit timestamp field to CommitInfo #581
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #581 +/- ##
==========================================
+ Coverage 83.68% 83.70% +0.02%
==========================================
Files 75 75
Lines 16951 16955 +4
Branches 16951 16955 +4
==========================================
+ Hits 14185 14193 +8
+ Misses 2100 2098 -2
+ Partials 666 664 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks good, but we're using the wrong column name.
} else if let Some(timestamp) = getters[16].get_long(i, "commitInfo.timestamp")? { | ||
*self.timestamp = timestamp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Delta spec:
The
commitInfo
action must be the first action in the commit... [and] must include a field namedinCommitTimestamp
, of typelong
The current CommitInfo
in kernel doesn't even define that field; the timestamp
field is just an example from the spec (that predates ICT) of the sorts of information a commit might include, and I believe it's something delta-spark was already emitting for years before ICT came along. I don't know why the ICT spec couldn't just "take over" that existing timestamp
field, nor whether two fields necessarily have the same value.
Also, either now or as a follow-up, we should probably enforce that the commit info does indeed come first? But we'd have to verify that ICT is actually enabled first, since the presence, position, and content of the commit info action is otherwise unconstrained.
Finally, the spec doesn't specifically say what we should do with an ICT that we find in a non-ICT table? Presumably we should ignore it because it's not the timestamp that time travel would use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is really annoying. The test tables from delta-rs seem to just pickup the timestamp field in commitInfo. Meanwhile delta-spark uses some external service to get non-ict timestamps.
The issue is that without a commitInfo-based timestamp, we can't get reliable E2E testing for timestamps. Git doesn't preserve timestamps. Perhaps for tests, I should just project the timestamp column out?
Regarding ICT enablement, this is the behaviour of delta-spark:
// If the commit has an In-Commit Timestamp, we should use that as the commit timestamp.
// Note that it is technically possible for a commit range to begin with ICT commits
// followed by non-ICT commits, and end with ICT commits again. Ideally, for these commits
// we should use the file modification time for the first two ranges. However, this
// scenario is an edge case not worth optimizing for.
val ts = commitInfo
.flatMap(_.inCommitTimestamp)
.map(ict => new Timestamp(ict))
.getOrElse(nonICTTimestampsByVersion.get(v).orNull)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so spoke to @zachschuermann and the plan is to fix the delta-rs tests by changing their field from "timestamp" to "inCommitTimestamp"
0163c63
to
99f1321
Compare
|
||
// Since writing in-commit timestamps is not supported, we remove the field so it is not | ||
// written to the log | ||
commit_info_data_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachschuermann just flagging this comment and approach to writing ICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fine :) can we add an issue link/todo so we are tracking this?
@scovich This PR has been updated to only add the ICT field to CommitInfo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks! just a couple quick things
kernel/src/actions/mod.rs
Outdated
/// The time this logical file was created, as milliseconds since the epoch. Unlike | ||
/// `timestamp`, this field is guaranteed to be monotonically increase with each commit. | ||
/// If in-commit timestamps are enabled, this is always required. | ||
pub(crate) in_commit_timestamp: Option<i64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also note that (when enabled) it requires that commitInfo
be the first action in the commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that's technically what the protocol says, delta-spark doesn't check that this is true. It simply extracts the CommitInfo if present.
I think the error handling would be more code and let us capture fewer tables than delta spark, and commitInfo being the first action doesn't seem to impact us much in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delta-spark doesn't check, but it also does stop looking after the first line of json. So if the commit info is not first it will appear to not exist at all. Sometimes that causes an error downstream, and sometimes it would cause a silent wrong behavior.
I would prefer to check strictly unless/until we see a need to do otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich Do you recommend that we error, or just ignore the ICT
|
||
// Since writing in-commit timestamps is not supported, we remove the field so it is not | ||
// written to the log | ||
commit_info_data_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fine :) can we add an issue link/todo so we are tracking this?
@zachschuermann Added issue here #634 |
55144b2
to
f8909ba
Compare
What changes are proposed in this pull request?
This PR adds the inCommitTimestamp filed to CommitInfo. This lays the groundwork for supporting ICT for table properties and for writes.
We update transaction write tests to ignore the in-commit timestamp field since the write path does not currently support in-commit timestamps.
How was this change tested?
All existing tests pass.