Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add the in-commit timestamp field to CommitInfo #581

Merged
merged 13 commits into from
Jan 15, 2025

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 9, 2024

What changes are proposed in this pull request?

This PR adds the inCommitTimestamp filed to CommitInfo. This lays the groundwork for supporting ICT for table properties and for writes.

We update transaction write tests to ignore the in-commit timestamp field since the write path does not currently support in-commit timestamps.

How was this change tested?

All existing tests pass.

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.70%. Comparing base (606db20) to head (d37a191).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #581      +/-   ##
==========================================
+ Coverage   83.68%   83.70%   +0.02%     
==========================================
  Files          75       75              
  Lines       16951    16955       +4     
  Branches    16951    16955       +4     
==========================================
+ Hits        14185    14193       +8     
+ Misses       2100     2098       -2     
+ Partials      666      664       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good, but we're using the wrong column name.

Comment on lines 363 to 364
} else if let Some(timestamp) = getters[16].get_long(i, "commitInfo.timestamp")? {
*self.timestamp = timestamp;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Delta spec:

The commitInfo action must be the first action in the commit... [and] must include a field named inCommitTimestamp, of type long

The current CommitInfo in kernel doesn't even define that field; the timestamp field is just an example from the spec (that predates ICT) of the sorts of information a commit might include, and I believe it's something delta-spark was already emitting for years before ICT came along. I don't know why the ICT spec couldn't just "take over" that existing timestamp field, nor whether two fields necessarily have the same value.

Also, either now or as a follow-up, we should probably enforce that the commit info does indeed come first? But we'd have to verify that ICT is actually enabled first, since the presence, position, and content of the commit info action is otherwise unconstrained.

Finally, the spec doesn't specifically say what we should do with an ICT that we find in a non-ICT table? Presumably we should ignore it because it's not the timestamp that time travel would use?

Copy link
Collaborator Author

@OussamaSaoudi-db OussamaSaoudi-db Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is really annoying. The test tables from delta-rs seem to just pickup the timestamp field in commitInfo. Meanwhile delta-spark uses some external service to get non-ict timestamps.

The issue is that without a commitInfo-based timestamp, we can't get reliable E2E testing for timestamps. Git doesn't preserve timestamps. Perhaps for tests, I should just project the timestamp column out?

Regarding ICT enablement, this is the behaviour of delta-spark:

// If the commit has an In-Commit Timestamp, we should use that as the commit timestamp.
// Note that it is technically possible for a commit range to begin with ICT commits
// followed by non-ICT commits, and end with ICT commits again. Ideally, for these commits
// we should use the file modification time for the first two ranges. However, this
// scenario is an edge case not worth optimizing for.
val ts = commitInfo
  .flatMap(_.inCommitTimestamp)
  .map(ict => new Timestamp(ict))
  .getOrElse(nonICTTimestampsByVersion.get(v).orNull)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so spoke to @zachschuermann and the plan is to fix the delta-rs tests by changing their field from "timestamp" to "inCommitTimestamp"

@OussamaSaoudi-db OussamaSaoudi-db marked this pull request as draft December 11, 2024 00:08
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Dec 11, 2024
@OussamaSaoudi-db OussamaSaoudi-db changed the title Support in-commit timestamps in Change Data Feed Add the in-commit timestamp field to CommitInfo Dec 30, 2024

// Since writing in-commit timestamps is not supported, we remove the field so it is not
// written to the log
commit_info_data_type
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachschuermann just flagging this comment and approach to writing ICT.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine :) can we add an issue link/todo so we are tracking this?

@OussamaSaoudi-db
Copy link
Collaborator Author

@scovich This PR has been updated to only add the ICT field to CommitInfo.

@OussamaSaoudi-db OussamaSaoudi-db marked this pull request as ready for review December 30, 2024 22:43
@OussamaSaoudi-db OussamaSaoudi-db changed the title Add the in-commit timestamp field to CommitInfo feat: Add the in-commit timestamp field to CommitInfo Dec 30, 2024
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@OussamaSaoudi-db OussamaSaoudi-db removed the breaking-change Change that will require a version bump label Jan 8, 2025
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! just a couple quick things

Comment on lines 335 to 338
/// The time this logical file was created, as milliseconds since the epoch. Unlike
/// `timestamp`, this field is guaranteed to be monotonically increase with each commit.
/// If in-commit timestamps are enabled, this is always required.
pub(crate) in_commit_timestamp: Option<i64>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also note that (when enabled) it requires that commitInfo be the first action in the commit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that's technically what the protocol says, delta-spark doesn't check that this is true. It simply extracts the CommitInfo if present.

I think the error handling would be more code and let us capture fewer tables than delta spark, and commitInfo being the first action doesn't seem to impact us much in this scenario.

Copy link
Collaborator

@scovich scovich Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta-spark doesn't check, but it also does stop looking after the first line of json. So if the commit info is not first it will appear to not exist at all. Sometimes that causes an error downstream, and sometimes it would cause a silent wrong behavior.

I would prefer to check strictly unless/until we see a need to do otherwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich Do you recommend that we error, or just ignore the ICT


// Since writing in-commit timestamps is not supported, we remove the field so it is not
// written to the log
commit_info_data_type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine :) can we add an issue link/todo so we are tracking this?

@OussamaSaoudi-db
Copy link
Collaborator Author

@zachschuermann Added issue here #634

@OussamaSaoudi OussamaSaoudi merged commit 616e9ac into delta-io:main Jan 15, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants