Part 1, Read transforms via expressions: Just compute the expression and return it. #607

nicklan · 2024-12-18T20:50:32Z

What changes are proposed in this pull request?

This is the initial part of moving to using expressions to express transformations when reading data. What this PR does is:

Compute a "static" transform, which is just a set of column expressions that need to be passed directly through without change, or enough metadata for lower levels to fill in a "fixup" expression
The static transform is passed into the iterator that parses each Add file
When parsing the Add file, if there are needed fix-ups (just partition columns today), the correct expression is created, and inserted into a row indexed map
This map is returned so the caller can find out for a given row what, if any, expression needs to be applied when reading the specified row

Follow-up PRs:

Part 2: propagate transform in visit_scan_files #612: Propagate this information through when using visit_scan_files
Part 3 of expression based transform: Use computed transform #613: Actually use the data to do transformation and remove transform_to_logical entirely
Part 4: read_table.c uses transform in ffi #614: Make this work over ffi and use it
(TODO): Clean up any existing code that's now over complicated in the scan building

Each of those are more invasive and end up touching significant code, so I'm staging this as much as possible to make reviews easier.

How was this change tested?

Unit tests, and inspection of resultant expressions when run on tables

nicklan · 2024-12-19T00:30:27Z

kernel/src/scan/log_replay.rs

    }
 }

 /// Given an iterator of (engine_data, bool) tuples and a predicate, returns an iterator of
 /// `(engine_data, selection_vec)`. Each row that is selected in the returned `engine_data` _must_
 /// be processed to complete the scan. Non-selected rows _must_ be ignored. The boolean flag
 /// indicates whether the record batch is a log or checkpoint batch.
-pub fn scan_action_iter(
+pub(crate) fn scan_action_iter(


Note this is a significant change as we not longer expose this function. In discussion so far we've agreed that it basically should never have been pub, and I just made a mistake when doing so. An engine should call scan_data which mostly just proxies to this, but doesn't expose internal details to the engine.

Open to discussion though.

pub(crate) SGTM!

codecov · 2024-12-19T20:22:54Z

Codecov Report

Attention: Patch coverage is 87.57396% with 21 lines in your changes missing coverage. Please review.

Project coverage is 84.05%. Comparing base (bf7e212) to head (08275cc).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/scan/log_replay.rs	85.48%	7 Missing and 11 partials ⚠️
kernel/src/scan/mod.rs	94.87%	0 Missing and 2 partials ⚠️
ffi/src/scan.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #607      +/-   ##
==========================================
+ Coverage   84.00%   84.05%   +0.05%     
==========================================
  Files          75       75              
  Lines       17097    17251     +154     
  Branches    17097    17251     +154     
==========================================
+ Hits        14363    14501     +138     
- Misses       2045     2050       +5     
- Partials      689      700      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich

When parsing the Add file, if there are needed fix-ups (just partition columns today), the correct expression is created, and inserted into a row indexed map

Why do we need a map here? It seems like we either have a fixup for every row, or for no rows? Just apply the fixup conditionally if we see a non-empty vec of fixups?

scovich · 2024-12-30T22:13:14Z

kernel/src/engine_data.rs

+        val.ok_or_else(|| {
+            Error::MissingData(format!("Data missing for field {field_name}")).with_backtrace()
+        })


intentional/permanent change? Or just for debugging?

intentional, since this error occurs in more than one place

aside: I wonder if we should start adding some kind of "location code" as a (much) cheaper alternative to backtraces, that also stays stable as the code base evolves around it?

Yeah, that could work. I'm not too worried about perf for backtraces as they should only appear in error cases though

scovich · 2024-12-30T22:18:16Z

kernel/src/scan/log_replay.rs

+        let have_seen = self.check_and_record_seen(file_key);
+        if is_add && !have_seen {
+            // compute transform here
+            if let Some(ref transform) = self.transform {


Agree on both nesting and rule of 30 here.

Also, this code is redundant:

let have_seen = self.check_and_record_seen(file_key); if is_add && !have_seen { ... do stuff ... } Ok(is_add && !have_seen)

The early return would make very clear what's going on:

if !is_add || have_seen { return Ok(false); } ... do stuff ... Ok(true)

scovich · 2024-12-30T22:37:36Z

kernel/src/scan/log_replay.rs

+        let have_seen = self.check_and_record_seen(file_key);
+        if is_add && !have_seen {


Related to #615:

Data skipping runs before this visitor, which means we can't use the partition values for data skipping in its current form.

How should we proceed? Even if we run a partition value extraction visitor before data skipping, that builds a hash map of parsed partition value literals (instead of embedding them in a struct expression), we still can't use the normal data skipping expression machinery. We'd almost need the row visitor itself to apply partition pruning, using a DefaultPredicateEvaluator that sits on top of the partition values map. The (big) downside of that approach is it won't reliably handle predicates that mix references to partition columns and normal columns, e.g. the following predicate would have no data skipping at all, because both predicate evaluators would reject the OR due to a missing leg:

WHERE partition_col1 = 10 OR value_col2 = 20

It would at least handle top-level AND gracefully, tho:

WHERE partition_col1 = 10 AND value_col2 = 20

(because each predicate evaluator would work with the subset of the AND it understands)

Even if we run a partition value extraction visitor before data skipping, that builds a hash map of parsed partition value literals (instead of embedding them in a struct expression), we still can't use the normal data skipping expression machinery.

Could you explain why we can't use the normal data skipping expression machinery? Current data skipping reads the stats field of add actions. I imagine we could use a visitor to extract the partition values along with the stats, then write back the stats field with updated values. Then data skipping proceeds as normal. idk if this is perhaps expensive, but I think it'll be important to be able to do data skipping on predicates with mixed references.

We definitely want the effect of data skipping, one way or another. I just meant that today's data skipping flow happens before the row visitor that could extract and parse partition values.

Either we need to add a second visitor that runs first and updates the stats column, or we apply partition skipping as a completely separate step (that could run before or after normal data skipping). Updating the stats column has several disadvantages:

Needs a separate visitor pass (runtime cost)

We don't currently have any API for updating an EngineData (we only have expression eval). We know we need to eventually add such capability, but we don't have it yet.

Stats-based pruning code isn't a great fit for partition values, because it wouldn't support nullcount based pruning and min/max based pruning is needlessly complex when always min=max for partition values.

That makes me wonder if we should apply partition pruning after stats-based pruning, as part of the existing row visitor that already filters out previously seen files:

Parse partition values into a HashMap<ColumnName, Scalar>, which already has #[cfg(test)] impl ResolveColumnAsScalar in predicates/mod.rs (just need to remove the feature flag from it).

Wrap a DefaultPredicateEvaluator around the partition values hashmap, and evaluate it.

Aha that makes sense. So move it till later to avoid complicating the existing data skipping and avoiding the runtime cost.

As for mixed references -- it will work for a majority of cases, because most partition predicates are simple top-level conjuncts, like this:

WHERE partition_col1 = 10 AND value_col2 = 20

The partition pruning code would handle the first conjunct (ignoring the second), and stats pruning code would handle the second conjunct (ignoring the first). This is actually how Delta-spark does it today.

Seems like being out was a great way for me to get this resolved :)

In seriousness though, that suggestion makes sense. We can let the existing flow prune via stats, and then just run the predicate evaluator over the extracted hashmap in the visitor, which can modify its already existing selection vector to prune files where the partition doesn't match.

wrt. to this PR, I think the code flow then still makes sense, and we can take partition pruning as a follow-up?

kernel/src/scan/log_replay.rs

OussamaSaoudi-db

looks good to me 👍

OussamaSaoudi-db · 2025-01-07T22:00:12Z

kernel/src/scan/log_replay.rs

        state::{DvInfo, Stats},
-        test_utils::{add_batch_simple, add_batch_with_remove, run_with_validate_callback},
+        test_utils::{


nit: we can perhaps flatten these imports.

nicklan · 2025-01-08T00:30:41Z

Why do we need a map here? It seems like we either have a fixup for every row, or for no rows? Just apply the fixup conditionally if we see a non-empty vec of fixups?

It's not the same for every row. The file could have a different value for the partition column for instance, or in the future each file could need the variant cols fixed up differently.

Perhaps you meant we could just have a Vec<Option<Transform>> where the index maps to the row? I considered this as well and decided that we'd potentially be creating large vecs of None in the cases that the majority of the batch was not Add actions, and so this was more compact and efficient. Happy to revisit that choice if you think otherwise though.

scovich · 2025-01-08T17:02:45Z

Why do we need a map here? It seems like we either have a fixup for every row, or for no rows? Just apply the fixup conditionally if we see a non-empty vec of fixups?

It's not the same for every row. The file could have a different value for the partition column for instance, or in the future each file could need the variant cols fixed up differently.

Perhaps you meant we could just have a Vec<Option<Transform>> where the index maps to the row? I considered this as well and decided that we'd potentially be creating large vecs of None in the cases that the majority of the batch was not Add actions, and so this was more compact and efficient. Happy to revisit that choice if you think otherwise though.

I think we have three potential cases:

No fixups are needed (e.g. not partitioned, no variant shredding, etc). The container should be empty or even missing (don't bloat it with a bunch of None). Doesn't matter which container type we use.
~Most rows need a fixup. Vec would be preferable to HashMap in that case (zipper join instead of a hash join)
~Few rows need a fixup (e.g. because many add actions were filtered out by data skipping). Open question whether Vec<Option> or HashMap is better.

Vec is definitely good for 1/ and 2/, and I estimate that Vec will also work Just Fine for 3/, given the batch sizes we expect to encounter in practice.

Rationale:

Each fixup will be fairly large, perhaps O(100 bytes)
To hit a space problem we would need very good data skipping. For example, if we assume 1 fixup is equivalent to 25 None, we would need 96% skipping or better to get even 2x bloat.
Further, we would need a large batch size (128k rows or more) for all those None to occupy more than a couple MB of actual memory. Checkpoint parquet files only hold 30k rows each, so that leaves only json commits which are usually not very large.
Even if we did hit both conditions above, data skipping is very query-dependent and so we would have a 25x bigger space problem in case we ever got a query with poor data skipping.

nicklan · 2025-01-09T17:20:11Z

Vec is definitely good for 1/ and 2/, and I estimate that Vec will also work Just Fine for 3/, given the batch sizes we expect to encounter in practice.

Following discussion, agree a Vec makes more sense, and have changed it to that

scovich

Couple nits to fix before merge, but otherwise LGTM

scovich · 2025-01-13T17:51:49Z

kernel/src/engine_data.rs

+        val.ok_or_else(|| {
+            Error::MissingData(format!("Data missing for field {field_name}")).with_backtrace()
+        })


aside: I wonder if we should start adding some kind of "location code" as a (much) cheaper alternative to backtraces, that also stays stable as the code base evolves around it?

scovich · 2025-01-13T18:11:56Z

kernel/src/scan/log_replay.rs

+                        ));
+                    };
+                    let name = field.physical_name();
+                    let value_expression = super::parse_partition_value(


aside: Looking at #624, I wonder if there's a (worthwhile) way to parse partition values only once per file action? But partition pruning and data fixup happen so far apart that I suspect it would be simpler (and maybe even cheaper) to parse a second time rather than try to build up and track a big side collection of parsed partition values.

It would perhaps be a different story if we had a clean way to convert partition values from string-string map to parsed struct using expressions, because then the partition values would be conveniently embedded in the log replay engine data. But I don't see that happening any time soon, given how much effort it would take to add map and string parsing expression support.

Yeah, this is a good point. Depending on how we merge things, we should consider looking at it when the second of this or #624 go in

scovich · 2025-01-13T18:16:36Z

kernel/src/scan/log_replay.rs

+                        partition_values.get(name),
+                        field.data_type(),
+                    )?;
+                    Ok(value_expression.into())


Technically this isn't an expression (yet). Maybe better to call it partition_value (scalar), which then gets converted into a (literal) expression?

Suggested change

Ok(value_expression.into())

Ok(partition_value.into())

scovich · 2025-01-13T19:05:02Z

kernel/src/scan/log_replay.rs

+        let have_seen = self.check_and_record_seen(file_key);
+        if !is_add || have_seen {
+            return Ok(false);
+        }


This seems like a dangerous change (because somebody trying to "optimize" the code might produce control flow that skips non-adds without checking them first). Now that you no longer need the have_seen multiple times, can we partially revert so it matches the code comment at L142 again?

Suggested change

let have_seen = self.check_and_record_seen(file_key);

if !is_add || have_seen {

return Ok(false);

}

if self.check_and_record_seen(file_key) || !is_add {

return Ok(false);

}

Actually, we should probably update the comment to match the new code:

// Check both adds and removes (skipping already-seen), but only transform and return adds

scovich · 2025-01-13T19:12:34Z

kernel/src/scan/log_replay.rs

+
+        fn validate_transform(transform: Option<&ExpressionRef>, expected_date_offset: i32) {
+            assert!(transform.is_some());
+            if let Expression::Struct(inner) = transform.unwrap().as_ref() {


Seems like a good place for let-else matching?

let Expression::Struct(inner) = transform.unwrap().as_ref() else { panic!("Transform should always be a struct expr"); }; assert_eq!(...); let Expression::Column(ref name) = inner[0] else { panic!("Expected first expression to be a column"); }; assert_eq!(...); let Expression::Literal(ref scalar) = inner[1] else { panic!("Expected second expression to be a literal"); }; assert_eq!(...);

(less indentation => more readable)

scovich · 2025-01-13T19:19:50Z

kernel/src/scan/mod.rs

@@ -371,11 +399,22 @@ impl Scan {
    ///   the query. NB: If you are using the default engine and plan to call arrow's
    ///   `filter_record_batch`, you _need_ to extend this vector to the full length of the batch or
    ///   arrow will drop the extra rows.
+    /// - `HashMap<usize, Expression>`: Transformation expressions that need to be applied. For each


I think it's using a Vec now?

nice catch. updated and updated the description

scovich · 2025-01-13T19:22:22Z

kernel/src/scan/mod.rs

+        // Compute the static part of the transformation. This is `None` if no transformation is
+        // needed (currently just means no partition cols, but will be extended for other transforms
+        // as we support them)


The comment doesn't reference column mapping? Should it?

Also, what other kind of transform might there be, besides "static" referenced here?

"Other transforms" means future things we may need to apply transforms for. So, variant decoding for example. If something needed variant decoding then the static_transform would not be None.

zachschuermann

looks great mostly just questions as I got caught up on everything! I guess my only concern is the row-based nature which in the worst case would have an expression for every row? if we had the notion of stateful KDF's I wonder if that would remove the need to create an expression for every row?

zachschuermann · 2025-01-15T04:32:28Z

kernel/src/scan/mod.rs

@@ -320,7 +320,20 @@ pub enum ColumnType {
    Partition(usize),
 }

-pub type ScanData = (Box<dyn EngineData>, Vec<bool>);
+/// A transform is ultimately a `Struct` expr. This holds the set of expressions that make that struct expr up
+type Transform = Vec<TransformExpr>;


(idea for future) maybe worth introducing scan/transforms.rs for a transforms module?

also qq how does this actually become struct? just since you use that to colocate all the different exprs in the vec?

Yeah, not a bad idea on the module. I'll consider it when I handle merging this with #624.

A TransformExpr can either be a static "just select this col", or something that actually requires a transform, like "this is a partition col". In the log-replay code we iterate over this, turn the static "select cols" into Column expressions, and fill in the actual proper expressions for the others, which leaves a Vec<Expr> which we make a struct expr.

ah yep got it thanks!

zachschuermann · 2025-01-15T04:36:53Z

kernel/src/scan/mod.rs

+        // for other transforms as we support them)
+        let static_transform = (self.have_partition_cols
+            || self.snapshot.column_mapping_mode != ColumnMappingMode::None)
+            .then_some(Arc::new(Scan::get_static_transform(&self.all_fields)));


is this desirable over self.get_static_transform() sort of API just so we can recycle/use it elsewhere in the future?

It would be, but to do that the tests for this functionality would always have to construct a snapshot, which is... hard at the moment. Perhaps when we switch to ResolvedTable we can simplify this

zachschuermann · 2025-01-15T04:58:07Z

kernel/src/scan/mod.rs

+/// Transforms aren't computed all at once. So static ones can just go straight to `Expression`, but
+/// things like partition columns need to filled in. This enum holds an expression that's part of a
+/// `Transform`.
+pub(crate) enum TransformExpr {


wonder if we we should consider something to communicate that this is some sort of unresolved/partial transform. might be useful to call out that the static part is already resolved as an actual Expr for transform but then partition requires resolution to actually deduce the expr (i realize this is comment above haha but took me a little to catch on so wondering if we could optimize naming)

Hopefully people read the comment :) I don't really want to bloat the name

zachschuermann · 2025-01-15T05:02:51Z

kernel/src/scan/log_replay.rs

+                    let field = self.logical_schema.fields.get_index(*field_idx);
+                    let Some((_, field)) = field else {
+                        return Err(Error::generic(
+                            "logical schema did not contain expected field, can't transform data",


mega nit but might be useful to log the index?

Minor change: Propagate the computed transforms from #607 through calls to `visit_scan_files`.

Use the transform that has been computed (see #607 and #612) rather than using `transform_to_logical`. 1. Remove `column_mapping_mode` from `GlobalScanState` (it's not needed there anymore) 2. Remove the old `transform_to_logical` code 3. Add a new `scan::state::transform_to_logical` function that encapsulates the boilerplate of applying the transform expression 4. Use the new function where needed. Existing tests pass which test this functionality extensively.

nicklan added 12 commits December 9, 2024 15:08

also extract partitionValues

6e0b1c9

checkpoint

c53b7de

Merge branch 'main' into transform-expr

9ac7173

hey, it kinda works

f75b2e3

Merge branch 'main' into transform-expr

4d1d4f7

undo change to ColumnType, will go a different direction

c8cc84b

use TransformExpr

29ded0e

cleanup

9d4688c

Merge branch 'main' into transform-expr

631f403

optional transform

f791167

add initial tests

b7268e5

adjust comments

da5a9e8

github-actions bot assigned nicklan Dec 18, 2024

fix comment

e3fdfaa

nicklan requested review from zachschuermann, scovich and OussamaSaoudi-db and removed request for zachschuermann and scovich December 18, 2024 20:54

github-actions bot added the breaking-change Change that require a major version bump label Dec 18, 2024

nicklan added 2 commits December 18, 2024 16:21

oops, fix ffi

e9a8d1c

cleanup examples

b773614

nicklan commented Dec 19, 2024

View reviewed changes

nicklan added 2 commits December 18, 2024 16:45

Actually use ExpressionRef

ebcb42d

Merge branch 'main' into transform-expr

3a38785

nicklan added 2 commits December 19, 2024 13:51

remove unused try_from

58ad2a3

need transform if column mapping is enabled

3d040f7

nicklan mentioned this pull request Dec 20, 2024

Part 2: propagate transform in visit_scan_files #612

Merged

nicklan changed the title ~~Read transforms via expressions. Part 1: Just compute the expression and return it.~~ Part 1, Read transforms via expressions: Just compute the expression and return it. Dec 20, 2024

scovich mentioned this pull request Dec 30, 2024

Merge partition columns into scan statistics for data skipping #615

Closed

scovich reviewed Dec 30, 2024

View reviewed changes

OussamaSaoudi-db reviewed Jan 2, 2025

View reviewed changes

kernel/src/scan/log_replay.rs Show resolved Hide resolved

timsaucer mentioned this pull request Jan 7, 2025

partition skipping filter #624

Open

2 tasks

nicklan added 4 commits January 7, 2025 09:28

move getting transform into own function

999143d

rename transforms

de5c18a

address comments

d5e55a4

Merge branch 'main' into transform-expr

a97f605

nicklan requested review from OussamaSaoudi-db and scovich January 7, 2025 17:37

OussamaSaoudi-db reviewed Jan 7, 2025

View reviewed changes

flatten imports

1ea159c

switch to a Vec for transforms

f6b81ac

Merge branch 'main' into transform-expr

7254704

nicklan requested a review from OussamaSaoudi-db January 10, 2025 01:22

scovich approved these changes Jan 13, 2025

View reviewed changes

nicklan added 2 commits January 14, 2025 12:39

Merge branch 'main' into transform-expr

fe68f0a

address comments

0ea983d

zachschuermann approved these changes Jan 15, 2025

View reviewed changes

nicklan added 3 commits January 22, 2025 14:40

Merge branch 'main' into transform-expr

6084d7a

include index

9f5eca6

Merge remote-tracking branch 'upstream/main' into transform-expr

08275cc

nicklan merged commit 45eedf2 into delta-io:main Jan 23, 2025
21 checks passed

nicklan added a commit that referenced this pull request Feb 4, 2025

Part 2: propagate transform in visit_scan_files (#612)

6a82a57

Minor change: Propagate the computed transforms from #607 through calls to `visit_scan_files`.

		let have_seen = self.check_and_record_seen(file_key);
		if is_add && !have_seen {

Part 1, Read transforms via expressions: Just compute the expression and return it. #607

Part 1, Read transforms via expressions: Just compute the expression and return it. #607

Uh oh!

Conversation

nicklan commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this change tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan commented Jan 8, 2025

Uh oh!

scovich commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicklan commented Jan 9, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan commented Dec 18, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

scovich commented Jan 8, 2025 •

edited

Loading