Skip to content

feat!: Split out predicates as different from expressions #775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 7, 2025

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Mar 27, 2025

What changes are proposed in this pull request?

Teach kernel to treat "predicates" (boolean-valued invertible expressions) as different from normal expressions (which are generally neither boolean-valued nor invertible). Accomplished by splitting out a new Predicate type from today's Expression type, and then adjusting all the various transforms, visitors, and evaluation frameworks accordingly.

This change is highly invasive, but very important because kernel's data skipping cares very much about predicates (= expressions that return boolean values and are invertible), which are quite different from the ordinary expressions used for transforming data. We see that tension already in the fact that some of our binary operators are really (invertible) binary predicates (=, DISTINCT, etc.), while others are not (e.g. +, -). Further, a key piece of our predicate evaluation is the ability to push NOT through an invertible expression. Pushing down NOT is more than just a performance optimization -- it is required for correct stats-based data skipping because NOT skipping_predicate(<expr>) is NOT equivalent to skipping_predicate(NOT <expr>).

The work has been carefully split into a number of commits, each focusing on a different change. Most of the changes are preparatory work intended to gradually increase the amount of predicate awareness in the code, while reducing the churn of the final diff.

Closes #765

This PR affects the following public APIs

Everything related to expressions.

How was this change tested?

Added new unit tests and updated existing ones

@scovich scovich added merge hold Don't allow the PR to merge breaking-change Change that require a major version bump labels Mar 27, 2025
@scovich scovich force-pushed the expressions-and-predicates branch 2 times, most recently from 502aaf1 to 6e1f2fd Compare April 14, 2025 20:24
@scovich scovich force-pushed the expressions-and-predicates branch 4 times, most recently from 351f6b2 to 681e4c3 Compare April 23, 2025 15:37
@scovich scovich force-pushed the expressions-and-predicates branch from 681e4c3 to 1f64328 Compare April 24, 2025 17:46
@scovich scovich changed the title [WIP DO NOT MERGE] Split out predicates as different from expressions feat!: Split out predicates as different from expressions Apr 24, 2025
@scovich scovich removed the merge hold Don't allow the PR to merge label Apr 24, 2025
@scovich scovich marked this pull request as ready for review April 24, 2025 17:47
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

Attention: Patch coverage is 70.84399% with 342 lines in your changes missing coverage. Please review.

Project coverage is 84.68%. Comparing base (a93a85a) to head (886b87e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/expressions/engine.rs 0.00% 87 Missing ⚠️
ffi/src/expressions/kernel.rs 0.00% 72 Missing ⚠️
ffi/src/test_ffi.rs 0.00% 52 Missing ⚠️
kernel/src/kernel_predicates/tests.rs 65.62% 4 Missing and 29 partials ⚠️
kernel/src/expressions/mod.rs 86.25% 22 Missing ⚠️
kernel/src/expressions/transforms.rs 88.41% 13 Missing and 6 partials ⚠️
...src/engine/arrow_expression/evaluate_expression.rs 68.42% 11 Missing and 7 partials ⚠️
kernel/src/kernel_predicates/mod.rs 90.90% 14 Missing ⚠️
kernel/src/scan/data_skipping/tests.rs 90.14% 5 Missing and 2 partials ⚠️
kernel/src/engine/arrow_expression/mod.rs 81.48% 0 Missing and 5 partials ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #775      +/-   ##
==========================================
- Coverage   85.28%   84.68%   -0.60%     
==========================================
  Files          88       88              
  Lines       22260    22610     +350     
  Branches    22260    22610     +350     
==========================================
+ Hits        18985    19148     +163     
- Misses       2305     2484     +179     
- Partials      970      978       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHEW i've made it :) I left a number of comments/questions but no major concerns, will go ahead and stamp it. really impressive PR thanks @scovich!!

use Cow::*;
let u = match self.transform(&u.expr)? {
let u = match self.transform_expr(&u.expr)? {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not doing transform_pred? if u is pred? (and below)

Copy link
Collaborator Author

@scovich scovich May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unary predicates take expressions as input and produce boolean as output. And here we're recursing into that input expression.

NOT is a special case because it takes a predicate as input (which is why it gets a separate variant instead of being a type of unary predicate)

/// # Safety
/// Engine is responsible for passing a valid SharedPredicate
#[no_mangle]
pub unsafe extern "C" fn free_kernel_predicate(data: Handle<SharedPredicate>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah funny just realized we had 'free_kernel_predicate' before (just called all those exprs predicates)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR definitely had to tighten up naming in a few places.

}

impl PredicateEvaluator for DefaultPredicateEvaluator {
fn evaluate(&self, batch: &dyn EngineData) -> DeltaResult<Box<dyn EngineData>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recognizing this is same as expression evaluator's evaluate (just with evaluate_predicate) - should we unify (or maybe just track some unification whenever we do that TODO marked below?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar situation to the parquet vs. json handlers. Just because the methods are structurally similar doesn't necessarily mean they're logically related in a way that justifies tying them together.

At least with json vs. parquet the "shape" of the output is identical, and we have situations like parquet vs. json checkpoint manifests. I'm having a trouble imagining a case where we want to evaluate an expression or a predicate in a generic way, with downstream code unconditionally processing the result the same way?

/// [`Schema`]: crate::schema::StructType
fn new_predicate_evaluator(
&self,
schema: SchemaRef,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe just explicitly call it input_schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_expression_evaluator has the same naming. Changed both.

@@ -104,6 +121,42 @@ pub extern "C" fn visit_predicate_and(
wrap_predicate(state, result)
}

#[no_mangle]
pub extern "C" fn visit_expression_plus(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh and these just didn't exist before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

match predicate {
BooleanExpression(expr) => {
// Grr -- there's no way to cast an `Arc<dyn Array>` back to its native type, so we
// can't use `Arc::into_inner` here and must unconditionally clone instead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is still just Arc clone right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's the annoying part. We're calling clone on a &BooleanArray, whose values and null_buffer are not Arc. And unlike our AsAny trait, Array::as_any works with &self, not Arc<Self>, so we can't downcast to Arc<BooleanArray>. Even if we could, there's no reliable way to take ownership of an Arc's inner, tho Arc::try_unwrap would probably work fine in practice (with clone as a fallback in case it somehow was actually a shared reference).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, it's very rare to treat a predicate as an expression -- even if technically possible. The most likely reason would be something like (a < b) IS NULL, since IS NULL takes an expression as input.

So hopefully this won't cause performance problems.

/// of literals. It is up to the predicate evaluator to validate the
/// predicate against a schema and add appropriate casts as required.
#[derive(Debug, Clone, PartialEq)]
pub enum Predicate {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be useful to split this module into expressions/predicates modules?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(there's a good amount of overlap so could also see the case for keeping them together)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's just enough overlap, and a mutually recursive dependency, that I don't think we gain much by splitting the mod. Plus, doing so would make a much uglier diff -- prefer to do it in a follow-up PR if possible, to preserve a semblance of sanity for reviewers on this PR.

}

/// Create a new predicate `self == other`
pub fn eq(self, other: impl Into<Self>) -> Self {
Self::binary(BinaryPredicateOp::Equal, self, other)
pub fn eq(a: impl Into<Expression>, b: impl Into<Expression>) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the departure from self like what's used in Expression above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is Predicate::eq and it takes expressions not predicates as input.

Same reason Expression::eq doesn't return Self (because it needs to return Predicate)

I'm on the fence whether we even want/need both forms. Expression::eq allows infix notation such as expr1.eq(expr2), while Predicate::eq allows things like Predicate::eq(column_name!("foo"), expr2) which are not possible with infix notation (because the first arg is Into<Expression>, not Expression.

A quick code search confirms that there are almost no callers of the Predicate::eq form tho -- except the corresponding Expression::eq that proxies it -- but things like Expression::eq(a, b) might be weird, and the PR diff would also get a lot uglier because of code movement etc.

So yeah... not sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use into() for the infix case right?, like expr2.eq(column_name!("foo").into()).

I'm generally in favor of having one way to do things. I feel like the Predicate::eq way is a bit more clear, but if we're not using it much maybe that suggests infix is preferred. Regardless I'd say let's not take it up here and follow-up after this merges.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a code audit:

  • Pred::not is already an associated function (not a method), and anyway no analogue exists for Expr
  • Expr::is_null has only a three call sites, all in test code, while Pred::is_null has more call sites including prod code
  • Expr::is_not_null has eight call sites, half in prod code, while Pred::is_not_null is only used by test code
  • All binary comparisons (eq, lt, distinct, etc) are only used in test code
    • Overall, infix Expr is almost perfectly tied with Pred usage
    • Most operators see similar usage in both forms, 4-8 call sites each
    • lt and gt are a lot more popular, and have exactly opposite Expr/Pred usage patterns: 14/30 for lt and 30/15 for gt.

So yeah, overall we have an almost perfect split today between Expr (method) vs. Pred (associated function) forms. We probably should just choose one and fix up the other's call sites.

@scovich scovich force-pushed the expressions-and-predicates branch from 1f64328 to d7bddd5 Compare May 6, 2025 20:52
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phew, sorry for the delay. finished first pass. Looks mostly great! Had a few comments, but nothing major.

}

/// Create a new predicate `self == other`
pub fn eq(self, other: impl Into<Self>) -> Self {
Self::binary(BinaryPredicateOp::Equal, self, other)
pub fn eq(a: impl Into<Expression>, b: impl Into<Expression>) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use into() for the infix case right?, like expr2.eq(column_name!("foo").into()).

I'm generally in favor of having one way to do things. I feel like the Predicate::eq way is a bit more clear, but if we're not using it much maybe that suggests infix is preferred. Regardless I'd say let's not take it up here and follow-up after this merges.

Comment on lines 44 to 46
state
.inflight_ids
.insert(ExpressionOrPredicate::Expression(expr.into()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we save a line with:

Suggested change
state
.inflight_ids
.insert(ExpressionOrPredicate::Expression(expr.into()))
use ExpressionOrPredicate::*;
state.inflight_ids.insert(Expression(expr.into()))

same below

Copy link
Collaborator Author

@scovich scovich May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is using two different things called Expression at the same time. But I saved a line by pulling out an intermediate value instead.

Comment on lines 97 to 100
match left.zip(right) {
Some((left, right)) => wrap_predicate(state, Predicate::binary(op, left, right)),
None => 0, // invalid child => invalid node
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why zip? Could just do:

Suggested change
match left.zip(right) {
Some((left, right)) => wrap_predicate(state, Predicate::binary(op, left, right)),
None => 0, // invalid child => invalid node
}
match (left, right) {
(Some(left), Some(right)) => wrap_predicate(state, Predicate::binary(op, left, right)),
_ => 0, // invalid child => invalid node
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Updated both versions of this method.

let child_list_id = call!(visitor, make_field_list, 2);
visit_expression_impl(visitor, left, child_list_id);
visit_expression_impl(visitor, right, child_list_id);
let op = match op {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this was in the old code, but I think this would be more clear as:

Suggested change
let op = match op {
let visit_fn = match op {

and then:

            visit_fn(visitor.data, sibling_list_id, child_list_id);

below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all three locations I found.

// Grr -- there's no way to cast an `Arc<dyn Array>` back to its native type, so we
// can't use `Arc::into_inner` here and must unconditionally clone instead.
let arr = evaluate_expression(expr, batch, Some(&DataType::BOOLEAN))?;
Ok(downcast_to_bool(&arr)?.clone())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah! I thought I had it with:

Suggested change
Ok(downcast_to_bool(&arr)?.clone())
arr.into_data().try_into().map_err(|_| Error::generic("expected boolean array"))

We can do arr.into_data() because have a blanket impl Array for Arc<dyn Array> here

BUT, looking at the code for an Arc this just calls to_data, so it'll clone :(

I do think this is a little more clean though so maybe we want it anyway? And then we can also remove downcast_to_bool.

Copy link
Collaborator Author

@scovich scovich May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is, TryFrom<ArrayData> for BooleanArray is just the (infallible) blanket impl based on From<ArrayData> for BooleanArray, which panics on type mismatch. I couldn't find any fallible version of that code?

Meanwhile, I noticed clone calls in that code, even tho the data array is supposedly owned. Sure enough, it turns out that a Buffer:

can be sliced and cloned without copying the underlying data

I can fold the downcast_to_bool method into its only remaining call site, tho.

Copy link
Collaborator

@nicklan nicklan May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is, TryFrom for BooleanArray is just the (infallible) blanket impl based on From for BooleanArray, which panics on type mismatch. I couldn't find any fallible version of that code?

Ahh yeah. We could check the conditions of the asserts though, i.e. that the DataType of the ArrayData is Boolean and that there's only one buffer. That would give us confidence that it won't panic (although arrow updates could of course change the conditions).

But anyway it's mostly moot because we end up cloning anyway, and we can't get the inner out of the Arc since it's a trait.


/// Dispatches an expression to the specific implementation for each expression variant.
///
/// NOTE: [`Expression::Struct`] is not supported and always evaluates to `None`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that we just defined this corner out of existence :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a nice side effect.

@scovich scovich requested a review from nicklan May 7, 2025 12:49
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks for tackling this monster!

@scovich scovich merged commit 2315d00 into delta-io:main May 7, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that require a major version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Separate out predicates from expressions
3 participants