Skip to content

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 30, 2024

What changes are proposed in this pull request?

This is a Stacked PR. Please look at the latest commit in the branch!

This adds support for in-commit timestamps when performing change data feed. Now when a commit contains commitInfo with inCommitTimestamp, that timestamp will be the one used for all changed rows in the commit.

Please only review these commits.

How was this change tested?

Add tests to check that the timestamp extracted from commits containing in-commit-timestamps are the ICT instead of file modification time.

Copy link

codecov bot commented Dec 30, 2024

Codecov Report

Attention: Patch coverage is 91.60305% with 22 lines in your changes missing coverage. Please review.

Project coverage is 84.87%. Comparing base (1f4e4c0) to head (680ba1c).

Files with missing lines Patch % Lines
kernel/src/table_changes/log_replay.rs 83.72% 5 Missing and 9 partials ⚠️
kernel/src/actions/visitors.rs 41.66% 0 Missing and 7 partials ⚠️
kernel/src/table_changes/scan.rs 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #617      +/-   ##
==========================================
+ Coverage   84.78%   84.87%   +0.08%     
==========================================
  Files          88       88              
  Lines       22605    22758     +153     
  Branches    22605    22758     +153     
==========================================
+ Hits        19166    19316     +150     
- Misses       2459     2460       +1     
- Partials      980      982       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@OussamaSaoudi-db OussamaSaoudi-db changed the title feat: Add in-commit timestamp support for change data fede feat: Add in-commit timestamp support for change data feed Jan 2, 2025
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few things looks good tho!

add_paths: &mut add_paths,
remove_dvs: &mut remove_dvs,
has_cdc_action: &mut has_cdc_action,
commit_timestamp: &mut timestamp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be clearer?

Suggested change
commit_timestamp: &mut timestamp,
in_commit_timestamp: &mut timestamp,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize this field with the file modification timestamp, so it would be inaccurate to call it that. I do like the update you made below tho when we actually read ICT from a commitinfo.

/// 2. Construct a map from path to deletion vector of remove actions that share the same path
/// as an add action.
/// 3. Perform validation on each protocol and metadata action in the commit.
/// 4. Extract the in-commit timestamp from [`CommitInfo`] if it is present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on L130 above but I think we need to do some comment updates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I went through every mention of ICT and I think I got them all.

Comment on lines +622 to +678
Action::CommitInfo(CommitInfo {
in_commit_timestamp: Some(timestamp),
..Default::default()
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if commit info isn't first? do we still read it? I know the protocol says it must be first with ICT enabled but I wonder what the expected behavior is when it isn't first? do we do the right thing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but probably don't solve here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed a little here:
#581 (comment)

I'm still quite certain that delta-spark doesn't care about the ordering because it goes through the all actions in the commit looking for CommitInfo

        var commitInfo: Option[CommitInfo] = None
        actions.foreach {
          case c: AddCDCFile =>
            cdcActions.append(c)
            totalFiles += 1L
            totalBytes += c.size
          case a: AddFile =>
            totalFiles += 1L
            totalBytes += a.size
          case r: RemoveFile =>
            totalFiles += 1L
            totalBytes += r.size.getOrElse(0L)
          case i: CommitInfo => commitInfo = Some(i)
          case _ => // do nothing
        }

I've added a check that only puts in the ICT if it is the first action in the log, but there comes a question: should we fail if it isn't the first action?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also revert the check that CommitInfo is first and revisit that in a future PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to be strict. ICT isn't widely adopted yet, so hopefully we don't have too many bad actors yet either. If kernel-rs is strict that will deter future bad actors.

Option::<Cdc>::get_struct_field(CDC_NAME),
Option::<Metadata>::get_struct_field(METADATA_NAME),
Option::<Protocol>::get_struct_field(PROTOCOL_NAME),
StructField::new("commitInfo", StructType::new([ict_type]), true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though i wonder if we can do something similar to above like Option<CommitInfo>::get_struct_field(COMMIT_INFO_NAME) and get struct field inCommitTimestamp of that?

but for now at least can use COMMIT_INFO_NAME?

Suggested change
StructField::new("commitInfo", StructType::new([ict_type]), true),
StructField::new(COMMIT_INFO_NAME, StructType::new([ict_type]), true),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we can do something similar to above like Option::get_struct_field(COMMIT_INFO_NAME) and get struct field inCommitTimestamp of that?

We would get a StructField of type CommitInfo, which we'd have to 1) get datatype, 2) cast to a struct 3) get the ICT field. So I'll stick with your suggested change 👍

Comment on lines 330 to 332
Action::Cdc(cdc.clone()),
Action::CommitInfo(commit_info.clone()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these ordered? should commit info be first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swapped ordering

@OussamaSaoudi-db OussamaSaoudi-db force-pushed the cdf_ict_impl branch 2 times, most recently from 7242904 to f50c202 Compare January 10, 2025 00:13
@OussamaSaoudi OussamaSaoudi added the merge hold Don't allow the PR to merge label Jan 10, 2025
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Feb 4, 2025
Comment on lines 362 to 364
if self.is_first_batch && i == 0 {
*self.commit_timestamp = in_commit_timestamp;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relating to the other thread -- to enforce that commit info is first, we would just need an else here that returns Err?

Suggested change
if self.is_first_batch && i == 0 {
*self.commit_timestamp = in_commit_timestamp;
}
if self.is_first_batch && i == 0 {
*self.commit_timestamp = in_commit_timestamp;
} else {
return Err(...);
}

or even use require! macro?

Suggested change
if self.is_first_batch && i == 0 {
*self.commit_timestamp = in_commit_timestamp;
}
require!(self.is_first_batch && i == 0, Error::Something(...));
*self.commit_timestamp = in_commit_timestamp;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with the require macro. 👍

@OussamaSaoudi OussamaSaoudi force-pushed the cdf_ict_impl branch 2 times, most recently from 0230327 to 20228f6 Compare April 30, 2025 22:00
@OussamaSaoudi OussamaSaoudi force-pushed the cdf_ict_impl branch 2 times, most recently from c745a0c to 94bbe54 Compare May 2, 2025 15:41
@OussamaSaoudi OussamaSaoudi self-assigned this Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that require a major version bump merge hold Don't allow the PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants