feat!: Split out predicates as different from expressions #775

scovich · 2025-03-27T22:11:50Z

What changes are proposed in this pull request?

Teach kernel to treat "predicates" (boolean-valued invertible expressions) as different from normal expressions (which are generally neither boolean-valued nor invertible). Accomplished by splitting out a new Predicate type from today's Expression type, and then adjusting all the various transforms, visitors, and evaluation frameworks accordingly.

This change is highly invasive, but very important because kernel's data skipping cares very much about predicates (= expressions that return boolean values and are invertible), which are quite different from the ordinary expressions used for transforming data. We see that tension already in the fact that some of our binary operators are really (invertible) binary predicates (=, DISTINCT, etc.), while others are not (e.g. +, -). Further, a key piece of our predicate evaluation is the ability to push NOT through an invertible expression. Pushing down NOT is more than just a performance optimization -- it is required for correct stats-based data skipping because NOT skipping_predicate(<expr>) is NOT equivalent to skipping_predicate(NOT <expr>).

The work has been carefully split into a number of commits, each focusing on a different change. Most of the changes are preparatory work intended to gradually increase the amount of predicate awareness in the code, while reducing the churn of the final diff.

Closes #765

This PR affects the following public APIs

Everything related to expressions.

How was this change tested?

Added new unit tests and updated existing ones

codecov · 2025-04-24T17:49:13Z

Codecov Report

Attention: Patch coverage is 70.84399% with 342 lines in your changes missing coverage. Please review.

Project coverage is 84.68%. Comparing base (a93a85a) to head (886b87e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
ffi/src/expressions/engine.rs	0.00%	87 Missing ⚠️
ffi/src/expressions/kernel.rs	0.00%	72 Missing ⚠️
ffi/src/test_ffi.rs	0.00%	52 Missing ⚠️
kernel/src/kernel_predicates/tests.rs	65.62%	4 Missing and 29 partials ⚠️
kernel/src/expressions/mod.rs	86.25%	22 Missing ⚠️
kernel/src/expressions/transforms.rs	88.41%	13 Missing and 6 partials ⚠️
...src/engine/arrow_expression/evaluate_expression.rs	68.42%	11 Missing and 7 partials ⚠️
kernel/src/kernel_predicates/mod.rs	90.90%	14 Missing ⚠️
kernel/src/scan/data_skipping/tests.rs	90.14%	5 Missing and 2 partials ⚠️
kernel/src/engine/arrow_expression/mod.rs	81.48%	0 Missing and 5 partials ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #775      +/-   ##
==========================================
- Coverage   85.28%   84.68%   -0.60%     
==========================================
  Files          88       88              
  Lines       22260    22610     +350     
  Branches    22260    22610     +350     
==========================================
+ Hits        18985    19148     +163     
- Misses       2305     2484     +179     
- Partials      970      978       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zachschuermann

WHEW i've made it :) I left a number of comments/questions but no major concerns, will go ahead and stamp it. really impressive PR thanks @scovich!!

zachschuermann · 2025-05-02T22:46:04Z

kernel/src/expressions/transforms.rs

        use Cow::*;
-        let u = match self.transform(&u.expr)? {
+        let u = match self.transform_expr(&u.expr)? {


why is this not doing transform_pred? if u is pred? (and below)

Unary predicates take expressions as input and produce boolean as output. And here we're recursing into that input expression.

NOT is a special case because it takes a predicate as input (which is why it gets a separate variant instead of being a type of unary predicate)

zachschuermann · 2025-05-02T23:12:50Z

ffi/src/expressions/kernel.rs

+/// # Safety
+/// Engine is responsible for passing a valid SharedPredicate
+#[no_mangle]
+pub unsafe extern "C" fn free_kernel_predicate(data: Handle<SharedPredicate>) {


hah funny just realized we had 'free_kernel_predicate' before (just called all those exprs predicates)

Yeah, this PR definitely had to tighten up naming in a few places.

zachschuermann · 2025-05-02T23:15:49Z

kernel/src/engine/arrow_expression/mod.rs

+}
+
+impl PredicateEvaluator for DefaultPredicateEvaluator {
+    fn evaluate(&self, batch: &dyn EngineData) -> DeltaResult<Box<dyn EngineData>> {


recognizing this is same as expression evaluator's evaluate (just with evaluate_predicate) - should we unify (or maybe just track some unification whenever we do that TODO marked below?)

This is a similar situation to the parquet vs. json handlers. Just because the methods are structurally similar doesn't necessarily mean they're logically related in a way that justifies tying them together.

At least with json vs. parquet the "shape" of the output is identical, and we have situations like parquet vs. json checkpoint manifests. I'm having a trouble imagining a case where we want to evaluate an expression or a predicate in a generic way, with downstream code unconditionally processing the result the same way?

zachschuermann · 2025-05-02T23:16:41Z

kernel/src/lib.rs

+    /// [`Schema`]: crate::schema::StructType
+    fn new_predicate_evaluator(
+        &self,
+        schema: SchemaRef,


nit: maybe just explicitly call it input_schema?

new_expression_evaluator has the same naming. Changed both.

ffi/src/expressions/engine.rs

zachschuermann · 2025-05-02T23:20:12Z

ffi/src/expressions/engine.rs

@@ -104,6 +121,42 @@ pub extern "C" fn visit_predicate_and(
    wrap_predicate(state, result)
 }

+#[no_mangle]
+pub extern "C" fn visit_expression_plus(


oh and these just didn't exist before?

kernel/src/engine/arrow_expression/evaluate_expression.rs

zachschuermann · 2025-05-02T23:39:22Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+    match predicate {
+        BooleanExpression(expr) => {
+            // Grr -- there's no way to cast an `Arc<dyn Array>` back to its native type, so we
+            // can't use `Arc::into_inner` here and must unconditionally clone instead.


but this is still just Arc clone right?

No, that's the annoying part. We're calling clone on a &BooleanArray, whose values and null_buffer are not Arc. And unlike our AsAny trait, Array::as_any works with &self, not Arc<Self>, so we can't downcast to Arc<BooleanArray>. Even if we could, there's no reliable way to take ownership of an Arc's inner, tho Arc::try_unwrap would probably work fine in practice (with clone as a fallback in case it somehow was actually a shared reference).

Fortunately, it's very rare to treat a predicate as an expression -- even if technically possible. The most likely reason would be something like (a < b) IS NULL, since IS NULL takes an expression as input.

So hopefully this won't cause performance problems.

zachschuermann · 2025-05-02T23:45:18Z

kernel/src/expressions/mod.rs

+/// of literals. It is up to the predicate evaluator to validate the
+/// predicate against a schema and add appropriate casts as required.
+#[derive(Debug, Clone, PartialEq)]
+pub enum Predicate {


I wonder if it would be useful to split this module into expressions/predicates modules?

(there's a good amount of overlap so could also see the case for keeping them together)

Yeah, there's just enough overlap, and a mutually recursive dependency, that I don't think we gain much by splitting the mod. Plus, doing so would make a much uglier diff -- prefer to do it in a follow-up PR if possible, to preserve a semblance of sanity for reviewers on this PR.

zachschuermann · 2025-05-02T23:48:30Z

kernel/src/expressions/mod.rs

    }

    /// Create a new predicate `self == other`
-    pub fn eq(self, other: impl Into<Self>) -> Self {
-        Self::binary(BinaryPredicateOp::Equal, self, other)
+    pub fn eq(a: impl Into<Expression>, b: impl Into<Expression>) -> Self {


why the departure from self like what's used in Expression above?

Because this is Predicate::eq and it takes expressions not predicates as input.

Same reason Expression::eq doesn't return Self (because it needs to return Predicate)

I'm on the fence whether we even want/need both forms. Expression::eq allows infix notation such as expr1.eq(expr2), while Predicate::eq allows things like Predicate::eq(column_name!("foo"), expr2) which are not possible with infix notation (because the first arg is Into<Expression>, not Expression.

A quick code search confirms that there are almost no callers of the Predicate::eq form tho -- except the corresponding Expression::eq that proxies it -- but things like Expression::eq(a, b) might be weird, and the PR diff would also get a lot uglier because of code movement etc.

So yeah... not sure.

You could use into() for the infix case right?, like expr2.eq(column_name!("foo").into()).

I'm generally in favor of having one way to do things. I feel like the Predicate::eq way is a bit more clear, but if we're not using it much maybe that suggests infix is preferred. Regardless I'd say let's not take it up here and follow-up after this merges.

I did a code audit:

Pred::not is already an associated function (not a method), and anyway no analogue exists for Expr

Expr::is_null has only a three call sites, all in test code, while Pred::is_null has more call sites including prod code

Expr::is_not_null has eight call sites, half in prod code, while Pred::is_not_null is only used by test code

All binary comparisons (eq, lt, distinct, etc) are only used in test code

Overall, infix Expr is almost perfectly tied with Pred usage

Most operators see similar usage in both forms, 4-8 call sites each

lt and gt are a lot more popular, and have exactly opposite Expr/Pred usage patterns: 14/30 for lt and 30/15 for gt.

So yeah, overall we have an almost perfect split today between Expr (method) vs. Pred (associated function) forms. We probably should just choose one and fix up the other's call sites.

…icate

…n type

nicklan

phew, sorry for the delay. finished first pass. Looks mostly great! Had a few comments, but nothing major.

kernel/src/kernel_predicates/mod.rs

nicklan · 2025-05-06T23:36:14Z

kernel/src/expressions/mod.rs

    }

    /// Create a new predicate `self == other`
-    pub fn eq(self, other: impl Into<Self>) -> Self {
-        Self::binary(BinaryPredicateOp::Equal, self, other)
+    pub fn eq(a: impl Into<Expression>, b: impl Into<Expression>) -> Self {


You could use into() for the infix case right?, like expr2.eq(column_name!("foo").into()).

I'm generally in favor of having one way to do things. I feel like the Predicate::eq way is a bit more clear, but if we're not using it much maybe that suggests infix is preferred. Regardless I'd say let's not take it up here and follow-up after this merges.

nicklan · 2025-05-06T23:38:21Z

ffi/src/expressions/engine.rs

+    state
+        .inflight_ids
+        .insert(ExpressionOrPredicate::Expression(expr.into()))


Could we save a line with:

Suggested change

state

.inflight_ids

.insert(ExpressionOrPredicate::Expression(expr.into()))

use ExpressionOrPredicate::*;

state.inflight_ids.insert(Expression(expr.into()))

same below

The method is using two different things called Expression at the same time. But I saved a line by pulling out an intermediate value instead.

nicklan · 2025-05-06T23:42:59Z

ffi/src/expressions/engine.rs

+    match left.zip(right) {
+        Some((left, right)) => wrap_predicate(state, Predicate::binary(op, left, right)),
+        None => 0, // invalid child => invalid node
+    }


Why zip? Could just do:

Suggested change

match left.zip(right) {

Some((left, right)) => wrap_predicate(state, Predicate::binary(op, left, right)),

None => 0, // invalid child => invalid node

}

match (left, right) {

(Some(left), Some(right)) => wrap_predicate(state, Predicate::binary(op, left, right)),

_ => 0, // invalid child => invalid node

}

Fair. Updated both versions of this method.

nicklan · 2025-05-06T23:45:22Z

ffi/src/expressions/kernel.rs

+            let child_list_id = call!(visitor, make_field_list, 2);
+            visit_expression_impl(visitor, left, child_list_id);
+            visit_expression_impl(visitor, right, child_list_id);
+            let op = match op {


nit: this was in the old code, but I think this would be more clear as:

Suggested change

let op = match op {

let visit_fn = match op {

and then:

visit_fn(visitor.data, sibling_list_id, child_list_id);

below

Fixed all three locations I found.

nicklan · 2025-05-07T00:06:30Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+            // Grr -- there's no way to cast an `Arc<dyn Array>` back to its native type, so we
+            // can't use `Arc::into_inner` here and must unconditionally clone instead.
+            let arr = evaluate_expression(expr, batch, Some(&DataType::BOOLEAN))?;
+            Ok(downcast_to_bool(&arr)?.clone())


Gah! I thought I had it with:

Suggested change

Ok(downcast_to_bool(&arr)?.clone())

arr.into_data().try_into().map_err(|_| Error::generic("expected boolean array"))

We can do arr.into_data() because have a blanket impl Array for Arc<dyn Array> here

BUT, looking at the code for an Arc this just calls to_data, so it'll clone :(

I do think this is a little more clean though so maybe we want it anyway? And then we can also remove downcast_to_bool.

Problem is, TryFrom<ArrayData> for BooleanArray is just the (infallible) blanket impl based on From<ArrayData> for BooleanArray, which panics on type mismatch. I couldn't find any fallible version of that code?

Meanwhile, I noticed clone calls in that code, even tho the data array is supposedly owned. Sure enough, it turns out that a Buffer:

can be sliced and cloned without copying the underlying data

I can fold the downcast_to_bool method into its only remaining call site, tho.

Problem is, TryFrom for BooleanArray is just the (infallible) blanket impl based on From for BooleanArray, which panics on type mismatch. I couldn't find any fallible version of that code?

Ahh yeah. We could check the conditions of the asserts though, i.e. that the DataType of the ArrayData is Boolean and that there's only one buffer. That would give us confidence that it won't panic (although arrow updates could of course change the conditions).

But anyway it's mostly moot because we end up cloning anyway, and we can't get the inner out of the Arc since it's a trait.

nicklan · 2025-05-07T00:13:16Z

kernel/src/kernel_predicates/mod.rs

-
-    /// Dispatches an expression to the specific implementation for each expression variant.
-    ///
-    /// NOTE: [`Expression::Struct`] is not supported and always evaluates to `None`.


Nice that we just defined this corner out of existence :)

Definitely a nice side effect.

nicklan

lgtm! Thanks for tackling this monster!

scovich added merge hold Don't allow the PR to merge breaking-change Change that require a major version bump labels Mar 27, 2025

github-actions bot assigned scovich Mar 27, 2025

scovich force-pushed the expressions-and-predicates branch 2 times, most recently from 502aaf1 to 6e1f2fd Compare April 14, 2025 20:24

scovich force-pushed the expressions-and-predicates branch 4 times, most recently from 351f6b2 to 681e4c3 Compare April 23, 2025 15:37

scovich mentioned this pull request Apr 23, 2025

feat!: Add support for opaque engine expressions #686

Merged

scovich force-pushed the expressions-and-predicates branch from 681e4c3 to 1f64328 Compare April 24, 2025 17:46

scovich changed the title ~~[WIP DO NOT MERGE] Split out predicates as different from expressions~~ feat!: Split out predicates as different from expressions Apr 24, 2025

scovich removed the merge hold Don't allow the PR to merge label Apr 24, 2025

scovich requested review from OussamaSaoudi, zachschuermann and nicklan April 24, 2025 17:47

scovich marked this pull request as ready for review April 24, 2025 17:47

zachschuermann approved these changes May 2, 2025

View reviewed changes

scovich added 11 commits May 6, 2025 13:10

Misc code cleanups

96b1005

Add Predicate as type alias of Expression

17eea0a

Rename Unary/Binary/Junction Expression to Unary/Binary/Junction Pred…

e747eac

…icate

Adjust ExpressionTransform method names

eff0623

Adjust KernelPredicateEvaluator method names

50d3c7c

Define and use arrow evaluate_predicate

1a0372a

Adjust names of FFI predicate visitor functions

f657d09

Rename predicate-related structs and enum variants

8ef0b4c

Update FFI test to distinguish predicates from expressions

f4b3f63

Split out expression and predicate evaluators

887492f

Define (boolean-valued) Predicate type distinct from normal Expressio…

0ed7a29

…n type

scovich added 4 commits May 6, 2025 13:26

fix - define output columns name of predicate evaluator

ac67537

fix: remove double not

d579135

fix: FFI expression visitor test

f050265

review feedback

d7bddd5

scovich force-pushed the expressions-and-predicates branch from 1f64328 to d7bddd5 Compare May 6, 2025 20:52

nicklan reviewed May 7, 2025

View reviewed changes

scovich added 3 commits May 7, 2025 05:45

review feedback

af09707

Merge remote-tracking branch 'oss/main' into expressions-and-predicates

2952bb8

fix logical merge conflict

886b87e

scovich requested a review from nicklan May 7, 2025 12:49

nicklan approved these changes May 7, 2025

View reviewed changes

scovich merged commit 2315d00 into delta-io:main May 7, 2025
19 of 21 checks passed

	Ok(downcast_to_bool(&arr)?.clone())
	arr.into_data().try_into().map_err(\|_\| Error::generic("expected boolean array"))

feat!: Split out predicates as different from expressions #775

feat!: Split out predicates as different from expressions #775

Uh oh!

Conversation

scovich commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

codecov bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

scovich commented Mar 27, 2025 •

edited

Loading

codecov bot commented Apr 24, 2025 •

edited

Loading

scovich May 6, 2025 •

edited

Loading

scovich May 7, 2025 •

edited

Loading

scovich May 7, 2025 •

edited

Loading

nicklan May 7, 2025 •

edited

Loading