- 
                Notifications
    
You must be signed in to change notification settings  - Fork 119
 
feat: Add in-commit timestamp support for change data feed #617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
          Codecov ReportAttention: Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main     #617      +/-   ##
==========================================
+ Coverage   84.78%   84.87%   +0.08%     
==========================================
  Files          88       88              
  Lines       22605    22758     +153     
  Branches    22605    22758     +153     
==========================================
+ Hits        19166    19316     +150     
- Misses       2459     2460       +1     
- Partials      980      982       +2     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few things looks good tho!
| add_paths: &mut add_paths, | ||
| remove_dvs: &mut remove_dvs, | ||
| has_cdc_action: &mut has_cdc_action, | ||
| commit_timestamp: &mut timestamp, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be clearer?
| commit_timestamp: &mut timestamp, | |
| in_commit_timestamp: &mut timestamp, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We initialize this field with the file modification timestamp, so it would be inaccurate to call it that. I do like the update you made below tho when we actually read ICT from a commitinfo.
| /// 2. Construct a map from path to deletion vector of remove actions that share the same path | ||
| /// as an add action. | ||
| /// 3. Perform validation on each protocol and metadata action in the commit. | ||
| /// 4. Extract the in-commit timestamp from [`CommitInfo`] if it is present. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on L130 above but I think we need to do some comment updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I went through every mention of ICT and I think I got them all.
| Action::CommitInfo(CommitInfo { | ||
| in_commit_timestamp: Some(timestamp), | ||
| ..Default::default() | ||
| }), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if commit info isn't first? do we still read it? I know the protocol says it must be first with ICT enabled but I wonder what the expected behavior is when it isn't first? do we do the right thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but probably don't solve here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed a little here:
#581 (comment)
I'm still quite certain that delta-spark doesn't care about the ordering because it goes through the all actions in the commit looking for CommitInfo
        var commitInfo: Option[CommitInfo] = None
        actions.foreach {
          case c: AddCDCFile =>
            cdcActions.append(c)
            totalFiles += 1L
            totalBytes += c.size
          case a: AddFile =>
            totalFiles += 1L
            totalBytes += a.size
          case r: RemoveFile =>
            totalFiles += 1L
            totalBytes += r.size.getOrElse(0L)
          case i: CommitInfo => commitInfo = Some(i)
          case _ => // do nothing
        }I've added a check that only puts in the ICT if it is the first action in the log, but there comes a question: should we fail if it isn't the first action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also revert the check that CommitInfo is first and revisit that in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to be strict. ICT isn't widely adopted yet, so hopefully we don't have too many bad actors yet either. If kernel-rs is strict that will deter future bad actors.
| Option::<Cdc>::get_struct_field(CDC_NAME), | ||
| Option::<Metadata>::get_struct_field(METADATA_NAME), | ||
| Option::<Protocol>::get_struct_field(PROTOCOL_NAME), | ||
| StructField::new("commitInfo", StructType::new([ict_type]), true), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though i wonder if we can do something similar to above like Option<CommitInfo>::get_struct_field(COMMIT_INFO_NAME) and get struct field inCommitTimestamp of that?
but for now at least can use COMMIT_INFO_NAME?
| StructField::new("commitInfo", StructType::new([ict_type]), true), | |
| StructField::new(COMMIT_INFO_NAME, StructType::new([ict_type]), true), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we can do something similar to above like Option::get_struct_field(COMMIT_INFO_NAME) and get struct field inCommitTimestamp of that?
We would get a StructField of type CommitInfo, which we'd have to 1) get datatype, 2) cast to a struct 3) get the ICT field. So I'll stick with your suggested change 👍
| Action::Cdc(cdc.clone()), | ||
| Action::CommitInfo(commit_info.clone()), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these ordered? should commit info be first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swapped ordering
7242904    to
    f50c202      
    Compare
  
    2831885    to
    d295ffc      
    Compare
  
    d295ffc    to
    1b16c4c      
    Compare
  
    | if self.is_first_batch && i == 0 { | ||
| *self.commit_timestamp = in_commit_timestamp; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relating to the other thread -- to enforce that commit info is first, we would just need an else here that returns Err?
| if self.is_first_batch && i == 0 { | |
| *self.commit_timestamp = in_commit_timestamp; | |
| } | |
| if self.is_first_batch && i == 0 { | |
| *self.commit_timestamp = in_commit_timestamp; | |
| } else { | |
| return Err(...); | |
| } | 
or even use require! macro?
| if self.is_first_batch && i == 0 { | |
| *self.commit_timestamp = in_commit_timestamp; | |
| } | |
| require!(self.is_first_batch && i == 0, Error::Something(...)); | |
| *self.commit_timestamp = in_commit_timestamp; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with the require macro. 👍
0230327    to
    20228f6      
    Compare
  
    c745a0c    to
    94bbe54      
    Compare
  
    94bbe54    to
    9f44537      
    Compare
  
    
What changes are proposed in this pull request?
This is a Stacked PR. Please look at the latest commit in the branch!
This adds support for in-commit timestamps when performing change data feed. Now when a commit contains commitInfo with
inCommitTimestamp, that timestamp will be the one used for all changed rows in the commit.Please only review these commits.
How was this change tested?
Add tests to check that the timestamp extracted from commits containing in-commit-timestamps are the ICT instead of file modification time.