Skip to content

[WIP] feat: scan from previous result #829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Apr 11, 2025

What changes are proposed in this pull request?

While integrating kernel in delta-rs, we ended up exposing quite a few internal functions to make state management and incrementally updating logs work. A hopefully cleaner approach might be to expose an API that allows engines to re-use existing scan results to facilitate scans.

see also: #825

Since I expect there might be some discussion around the if and when how this should land, I'm putting up an initial draft to gather some feedback before cleaning things up.

The basic idea is to transform existing scan metadata back into the shape expected by the visitors and treat it as if it originated from a checkpoint. To keep things simple, it is engines responsibility to provide data that does not conflict with the current scans predicate.

This PR affects the following public APIs

New Scan::scan_metadata_from_exisiting method that consumes a

How was this change tested?

additional unit tests for new APIs.

Copy link

codecov bot commented Apr 11, 2025

Codecov Report

Attention: Patch coverage is 86.36364% with 18 lines in your changes missing coverage. Please review.

Project coverage is 85.09%. Comparing base (c94f1a4) to head (9917b04).

Files with missing lines Patch % Lines
kernel/src/scan/mod.rs 83.92% 11 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #829      +/-   ##
==========================================
+ Coverage   85.07%   85.09%   +0.01%     
==========================================
  Files          84       84              
  Lines       20797    20916     +119     
  Branches    20797    20916     +119     
==========================================
+ Hits        17694    17798     +104     
- Misses       2226     2234       +8     
- Partials      877      884       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. I think it should be relatively inexpensive because it's only shuffling columns around rather than rewriting data?

// back into shape as we read it from the log. Since it is already reconciled data,
// we treat it as if it originated from a checkpoint.
let transform = engine.evaluation_handler().new_expression_evaluator(
Arc::new(scan_row_schema()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should scan_row_schema return a SchemaRef so we can create once and reuse it?

Comment on lines +343 to +350
Expression::Struct(vec![Expression::Struct(vec![
column_expr!("path"),
column_expr!("fileConstantValues.partitionValues"),
column_expr!("size"),
column_expr!("modificationTime"),
column_expr!("stats"),
column_expr!("deletionVector"),
])])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of these are just column extracts, it should be pretty simple for an engine to rewire the corresponding columns without actually examining their bytes. I believe our default arrow evaluation would be cheap this way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. even if clones are needed it should be very cheap.

hint_version: Version,
hint_data: impl IntoIterator<Item = Box<dyn EngineData>> + 'static,
) -> DeltaResult<Box<dyn Iterator<Item = DeltaResult<ScanMetadata>>>> {
static RESTORED_ADD_SCHEMA: LazyLock<DataType> = LazyLock::new(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the original schema? Is it just a subset of fields? Asking because:

  1. How do we keep them from diverging accidentally?
  2. If the original schema included unnecessary fields, should we be projecting those out in the original scan as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need all the fields we currently extract. we could not use the add schema due to nullability of the fields - IIRC - and the scan row schema is in the wrong order to match the indices in the dedup visitor.

We can reuse the deleteion vector schema at least though.

RESTORED_ADD_SCHEMA.clone(),
);
let apply_transform =
move |data: Box<dyn EngineData>| Ok((transform.evaluate(data.as_ref())?, false));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing scan equivalent to a checkpoint, because it already deduplicated everything, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my reasoning at least - the constraint being predicates, but we deferred that to the engine...

Comment on lines 520 to 521
// If the current log segment contains a checkpoint newer than the hint version
// we disregard the existing data hint, and perform a full scan.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. In the incremental snapshot API, a newer checkpoint matters because downstream use sites (like scan) could be quite expensive as the number of deltas grows -- even if the immediate incremental P&M is cheap. Here tho, we already paid the cost of a full scan previously, effectively giving us a checkpoint as-of the hint version, and there are no further "downstream" operations to worry about if we aggressively pursue incrementality in our scan reuse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess that ultimately doesn't matter. The log segment only has deltas after the checkpoint, so a checkpoint after the hint version blocks any hope of an incremental scan.

Maybe a quick code comment could be helpful?

let scan_iter = self.scan_metadata(engine)?;
return Ok(Box::new(scan_iter));
};
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need ; after an if let block?

(surprised clippy/fmt didn't notice)

Comment on lines 530 to 537
let mut ascending_commit_files = self.snapshot.log_segment().ascending_commit_files.clone();
ascending_commit_files.retain(|f| f.version > hint_version);
let log_segment = LogSegment::try_new(
ascending_commit_files,
vec![],
self.snapshot.log_segment().log_root.clone(),
Some(self.snapshot.log_segment().end_version),
)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pull out a local variable:

let log_segment = self.snapshot.log_segment();`

to simplify L522 above as well as this block here:

Suggested change
let mut ascending_commit_files = self.snapshot.log_segment().ascending_commit_files.clone();
ascending_commit_files.retain(|f| f.version > hint_version);
let log_segment = LogSegment::try_new(
ascending_commit_files,
vec![],
self.snapshot.log_segment().log_root.clone(),
Some(self.snapshot.log_segment().end_version),
)?;
let mut ascending_commit_files = log_segment.ascending_commit_files.clone();
ascending_commit_files.retain(|f| f.version > hint_version);
let log_segment = LogSegment::try_new(
ascending_commit_files,
vec![],
log_segment.log_root.clone(),
Some(log_segment.end_version),
)?;

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that require a major version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants