[WIP] feat: scan from previous result #829

roeap · 2025-04-11T13:29:04Z

What changes are proposed in this pull request?

While integrating kernel in delta-rs, we ended up exposing quite a few internal functions to make state management and incrementally updating logs work. A hopefully cleaner approach might be to expose an API that allows engines to re-use existing scan results to facilitate scans.

This PR affects the following public APIs

New Scan::scan_metadata_from_exisiting method that consumes a

How was this change tested?

additional unit tests for new APIs.

codecov · 2025-04-11T13:32:07Z

Codecov Report

Attention: Patch coverage is 86.36364% with 18 lines in your changes missing coverage. Please review.

Project coverage is 85.09%. Comparing base (c94f1a4) to head (9917b04).

Files with missing lines	Patch %	Lines
kernel/src/scan/mod.rs	83.92%	11 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #829      +/-   ##
==========================================
+ Coverage   85.07%   85.09%   +0.01%     
==========================================
  Files          84       84              
  Lines       20797    20916     +119     
  Branches    20797    20916     +119     
==========================================
+ Hits        17694    17798     +104     
- Misses       2226     2234       +8     
- Partials      877      884       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scovich

Interesting approach. I think it should be relatively inexpensive because it's only shuffling columns around rather than rewriting data?

scovich · 2025-04-11T17:23:04Z

kernel/src/scan/mod.rs

+        // back into shape as we read it from the log. Since it is already reconciled data,
+        // we treat it as if it originated from a checkpoint.
+        let transform = engine.evaluation_handler().new_expression_evaluator(
+            Arc::new(scan_row_schema()),


Should scan_row_schema return a SchemaRef so we can create once and reuse it?

scovich · 2025-04-11T17:28:49Z

kernel/src/scan/log_replay.rs

+    Expression::Struct(vec![Expression::Struct(vec![
+        column_expr!("path"),
+        column_expr!("fileConstantValues.partitionValues"),
+        column_expr!("size"),
+        column_expr!("modificationTime"),
+        column_expr!("stats"),
+        column_expr!("deletionVector"),
+    ])])


Since all of these are just column extracts, it should be pretty simple for an engine to rewire the corresponding columns without actually examining their bytes. I believe our default arrow evaluation would be cheap this way?

I think you are right. even if clones are needed it should be very cheap.

scovich · 2025-04-11T17:35:34Z

kernel/src/scan/mod.rs

+        hint_version: Version,
+        hint_data: impl IntoIterator<Item = Box<dyn EngineData>> + 'static,
+    ) -> DeltaResult<Box<dyn Iterator<Item = DeltaResult<ScanMetadata>>>> {
+        static RESTORED_ADD_SCHEMA: LazyLock<DataType> = LazyLock::new(|| {


How is this different from the original schema? Is it just a subset of fields? Asking because:

How do we keep them from diverging accidentally?

If the original schema included unnecessary fields, should we be projecting those out in the original scan as well?

I think we need all the fields we currently extract. we could not use the add schema due to nullability of the fields - IIRC - and the scan row schema is in the wrong order to match the indices in the dedup visitor.

We can reuse the deleteion vector schema at least though.

scovich · 2025-04-11T17:47:38Z

kernel/src/scan/mod.rs

+            RESTORED_ADD_SCHEMA.clone(),
+        );
+        let apply_transform =
+            move |data: Box<dyn EngineData>| Ok((transform.evaluate(data.as_ref())?, false));


The existing scan equivalent to a checkpoint, because it already deduplicated everything, correct?

That was my reasoning at least - the constraint being predicates, but we deferred that to the engine...

scovich · 2025-04-11T17:47:54Z

kernel/src/scan/mod.rs

+        // If the current log segment contains a checkpoint newer than the hint version
+        // we disregard the existing data hint, and perform a full scan.


Hmm. In the incremental snapshot API, a newer checkpoint matters because downstream use sites (like scan) could be quite expensive as the number of deltas grows -- even if the immediate incremental P&M is cheap. Here tho, we already paid the cost of a full scan previously, effectively giving us a checkpoint as-of the hint version, and there are no further "downstream" operations to worry about if we aggressively pursue incrementality in our scan reuse.

But I guess that ultimately doesn't matter. The log segment only has deltas after the checkpoint, so a checkpoint after the hint version blocks any hope of an incremental scan.

Maybe a quick code comment could be helpful?

scovich · 2025-04-11T17:52:25Z

kernel/src/scan/mod.rs

+                let scan_iter = self.scan_metadata(engine)?;
+                return Ok(Box::new(scan_iter));
+            };
+        };


I don't think we need ; after an if let block?

(surprised clippy/fmt didn't notice)

scovich · 2025-04-11T18:00:35Z

kernel/src/scan/mod.rs

+        let mut ascending_commit_files = self.snapshot.log_segment().ascending_commit_files.clone();
+        ascending_commit_files.retain(|f| f.version > hint_version);
+        let log_segment = LogSegment::try_new(
+            ascending_commit_files,
+            vec![],
+            self.snapshot.log_segment().log_root.clone(),
+            Some(self.snapshot.log_segment().end_version),
+        )?;


nit: pull out a local variable:

let log_segment = self.snapshot.log_segment();`

to simplify L522 above as well as this block here:

Suggested change

let mut ascending_commit_files = self.snapshot.log_segment().ascending_commit_files.clone();

ascending_commit_files.retain(|f| f.version > hint_version);

let log_segment = LogSegment::try_new(

ascending_commit_files,

vec![],

self.snapshot.log_segment().log_root.clone(),

Some(self.snapshot.log_segment().end_version),

)?;

let mut ascending_commit_files = log_segment.ascending_commit_files.clone();

ascending_commit_files.retain(|f| f.version > hint_version);

let log_segment = LogSegment::try_new(

ascending_commit_files,

vec![],

log_segment.log_root.clone(),

Some(log_segment.end_version),

)?;

feat: scan from provisous result

3592f6f

github-actions bot assigned roeap Apr 11, 2025

roeap mentioned this pull request Mar 7, 2025

[tracking] Kernelize! delta-io/delta-rs#3298

Open

scovich reviewed Apr 11, 2025

View reviewed changes

Merge branch 'main' into scan-hints

d517ce0

roeap force-pushed the scan-hints branch from b13efd0 to 5f2993a Compare April 15, 2025 18:34

github-actions bot added the breaking-change Change that require a major version bump label Apr 15, 2025

fix: pr feedback

9917b04

roeap force-pushed the scan-hints branch from 5f2993a to 9917b04 Compare April 15, 2025 18:38

roeap mentioned this pull request Apr 15, 2025

Object Store Caching layer delta-io/delta-rs#2776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat: scan from previous result #829

[WIP] feat: scan from previous result #829

roeap commented Apr 11, 2025

codecov bot commented Apr 11, 2025 •

edited

Loading

scovich left a comment

scovich Apr 11, 2025

scovich Apr 11, 2025

roeap Apr 15, 2025

scovich Apr 11, 2025

roeap Apr 15, 2025

scovich Apr 11, 2025

roeap Apr 15, 2025

scovich Apr 11, 2025

scovich Apr 11, 2025

scovich Apr 11, 2025

scovich Apr 11, 2025

		// If the current log segment contains a checkpoint newer than the hint version
		// we disregard the existing data hint, and perform a full scan.

[WIP] feat: scan from previous result #829

Are you sure you want to change the base?

[WIP] feat: scan from previous result #829

Conversation

roeap commented Apr 11, 2025

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

codecov bot commented Apr 11, 2025 • edited Loading

Codecov Report

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 11, 2025 •

edited

Loading