-
Notifications
You must be signed in to change notification settings - Fork 90
feat!: Split out predicates as different from expressions #775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
502aaf1
to
6e1f2fd
Compare
351f6b2
to
681e4c3
Compare
681e4c3
to
1f64328
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #775 +/- ##
==========================================
- Coverage 85.28% 84.68% -0.60%
==========================================
Files 88 88
Lines 22260 22610 +350
Branches 22260 22610 +350
==========================================
+ Hits 18985 19148 +163
- Misses 2305 2484 +179
- Partials 970 978 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHEW i've made it :) I left a number of comments/questions but no major concerns, will go ahead and stamp it. really impressive PR thanks @scovich!!
use Cow::*; | ||
let u = match self.transform(&u.expr)? { | ||
let u = match self.transform_expr(&u.expr)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not doing transform_pred
? if u
is pred? (and below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unary predicates take expressions as input and produce boolean as output. And here we're recursing into that input expression.
NOT is a special case because it takes a predicate as input (which is why it gets a separate variant instead of being a type of unary predicate)
/// # Safety | ||
/// Engine is responsible for passing a valid SharedPredicate | ||
#[no_mangle] | ||
pub unsafe extern "C" fn free_kernel_predicate(data: Handle<SharedPredicate>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah funny just realized we had 'free_kernel_predicate' before (just called all those exprs predicates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this PR definitely had to tighten up naming in a few places.
} | ||
|
||
impl PredicateEvaluator for DefaultPredicateEvaluator { | ||
fn evaluate(&self, batch: &dyn EngineData) -> DeltaResult<Box<dyn EngineData>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recognizing this is same as expression evaluator's evaluate
(just with evaluate_predicate
) - should we unify (or maybe just track some unification whenever we do that TODO marked below?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a similar situation to the parquet vs. json handlers. Just because the methods are structurally similar doesn't necessarily mean they're logically related in a way that justifies tying them together.
At least with json vs. parquet the "shape" of the output is identical, and we have situations like parquet vs. json checkpoint manifests. I'm having a trouble imagining a case where we want to evaluate an expression or a predicate in a generic way, with downstream code unconditionally processing the result the same way?
kernel/src/lib.rs
Outdated
/// [`Schema`]: crate::schema::StructType | ||
fn new_predicate_evaluator( | ||
&self, | ||
schema: SchemaRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe just explicitly call it input_schema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_expression_evaluator
has the same naming. Changed both.
@@ -104,6 +121,42 @@ pub extern "C" fn visit_predicate_and( | |||
wrap_predicate(state, result) | |||
} | |||
|
|||
#[no_mangle] | |||
pub extern "C" fn visit_expression_plus( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh and these just didn't exist before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
match predicate { | ||
BooleanExpression(expr) => { | ||
// Grr -- there's no way to cast an `Arc<dyn Array>` back to its native type, so we | ||
// can't use `Arc::into_inner` here and must unconditionally clone instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this is still just Arc
clone right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's the annoying part. We're calling clone
on a &BooleanArray
, whose values
and null_buffer
are not Arc. And unlike our AsAny
trait, Array::as_any
works with &self
, not Arc<Self>
, so we can't downcast to Arc<BooleanArray>
. Even if we could, there's no reliable way to take ownership of an Arc's inner, tho Arc::try_unwrap would probably work fine in practice (with clone
as a fallback in case it somehow was actually a shared reference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, it's very rare to treat a predicate as an expression -- even if technically possible. The most likely reason would be something like (a < b) IS NULL
, since IS NULL takes an expression as input.
So hopefully this won't cause performance problems.
/// of literals. It is up to the predicate evaluator to validate the | ||
/// predicate against a schema and add appropriate casts as required. | ||
#[derive(Debug, Clone, PartialEq)] | ||
pub enum Predicate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be useful to split this module into expressions/predicates modules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(there's a good amount of overlap so could also see the case for keeping them together)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's just enough overlap, and a mutually recursive dependency, that I don't think we gain much by splitting the mod. Plus, doing so would make a much uglier diff -- prefer to do it in a follow-up PR if possible, to preserve a semblance of sanity for reviewers on this PR.
} | ||
|
||
/// Create a new predicate `self == other` | ||
pub fn eq(self, other: impl Into<Self>) -> Self { | ||
Self::binary(BinaryPredicateOp::Equal, self, other) | ||
pub fn eq(a: impl Into<Expression>, b: impl Into<Expression>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the departure from self
like what's used in Expression
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is Predicate::eq
and it takes expressions not predicates as input.
Same reason Expression::eq
doesn't return Self
(because it needs to return Predicate
)
I'm on the fence whether we even want/need both forms. Expression::eq
allows infix notation such as expr1.eq(expr2)
, while Predicate::eq
allows things like Predicate::eq(column_name!("foo"), expr2)
which are not possible with infix notation (because the first arg is Into<Expression>
, not Expression
.
A quick code search confirms that there are almost no callers of the Predicate::eq
form tho -- except the corresponding Expression::eq
that proxies it -- but things like Expression::eq(a, b)
might be weird, and the PR diff would also get a lot uglier because of code movement etc.
So yeah... not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use into()
for the infix case right?, like expr2.eq(column_name!("foo").into())
.
I'm generally in favor of having one way to do things. I feel like the Predicate::eq
way is a bit more clear, but if we're not using it much maybe that suggests infix is preferred. Regardless I'd say let's not take it up here and follow-up after this merges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a code audit:
Pred::not
is already an associated function (not a method), and anyway no analogue exists forExpr
Expr::is_null
has only a three call sites, all in test code, whilePred::is_null
has more call sites including prod codeExpr::is_not_null
has eight call sites, half in prod code, whilePred::is_not_null
is only used by test code- All binary comparisons (
eq
,lt
,distinct
, etc) are only used in test code- Overall, infix
Expr
is almost perfectly tied withPred
usage - Most operators see similar usage in both forms, 4-8 call sites each
lt
andgt
are a lot more popular, and have exactly opposite Expr/Pred usage patterns: 14/30 forlt
and 30/15 forgt
.
- Overall, infix
So yeah, overall we have an almost perfect split today between Expr (method) vs. Pred (associated function) forms. We probably should just choose one and fix up the other's call sites.
1f64328
to
d7bddd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phew, sorry for the delay. finished first pass. Looks mostly great! Had a few comments, but nothing major.
} | ||
|
||
/// Create a new predicate `self == other` | ||
pub fn eq(self, other: impl Into<Self>) -> Self { | ||
Self::binary(BinaryPredicateOp::Equal, self, other) | ||
pub fn eq(a: impl Into<Expression>, b: impl Into<Expression>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use into()
for the infix case right?, like expr2.eq(column_name!("foo").into())
.
I'm generally in favor of having one way to do things. I feel like the Predicate::eq
way is a bit more clear, but if we're not using it much maybe that suggests infix is preferred. Regardless I'd say let's not take it up here and follow-up after this merges.
ffi/src/expressions/engine.rs
Outdated
state | ||
.inflight_ids | ||
.insert(ExpressionOrPredicate::Expression(expr.into())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we save a line with:
state | |
.inflight_ids | |
.insert(ExpressionOrPredicate::Expression(expr.into())) | |
use ExpressionOrPredicate::*; | |
state.inflight_ids.insert(Expression(expr.into())) |
same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is using two different things called Expression
at the same time. But I saved a line by pulling out an intermediate value instead.
ffi/src/expressions/engine.rs
Outdated
match left.zip(right) { | ||
Some((left, right)) => wrap_predicate(state, Predicate::binary(op, left, right)), | ||
None => 0, // invalid child => invalid node | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why zip
? Could just do:
match left.zip(right) { | |
Some((left, right)) => wrap_predicate(state, Predicate::binary(op, left, right)), | |
None => 0, // invalid child => invalid node | |
} | |
match (left, right) { | |
(Some(left), Some(right)) => wrap_predicate(state, Predicate::binary(op, left, right)), | |
_ => 0, // invalid child => invalid node | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. Updated both versions of this method.
ffi/src/expressions/kernel.rs
Outdated
let child_list_id = call!(visitor, make_field_list, 2); | ||
visit_expression_impl(visitor, left, child_list_id); | ||
visit_expression_impl(visitor, right, child_list_id); | ||
let op = match op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this was in the old code, but I think this would be more clear as:
let op = match op { | |
let visit_fn = match op { |
and then:
visit_fn(visitor.data, sibling_list_id, child_list_id);
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed all three locations I found.
// Grr -- there's no way to cast an `Arc<dyn Array>` back to its native type, so we | ||
// can't use `Arc::into_inner` here and must unconditionally clone instead. | ||
let arr = evaluate_expression(expr, batch, Some(&DataType::BOOLEAN))?; | ||
Ok(downcast_to_bool(&arr)?.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah! I thought I had it with:
Ok(downcast_to_bool(&arr)?.clone()) | |
arr.into_data().try_into().map_err(|_| Error::generic("expected boolean array")) |
We can do arr.into_data()
because have a blanket impl Array for Arc<dyn Array>
here
BUT, looking at the code for an Arc
this just calls to_data
, so it'll clone :(
I do think this is a little more clean though so maybe we want it anyway? And then we can also remove downcast_to_bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is, TryFrom<ArrayData> for BooleanArray
is just the (infallible) blanket impl based on From<ArrayData> for BooleanArray, which panics on type mismatch. I couldn't find any fallible version of that code?
Meanwhile, I noticed clone
calls in that code, even tho the data array is supposedly owned. Sure enough, it turns out that a Buffer:
can be sliced and cloned without copying the underlying data
I can fold the downcast_to_bool
method into its only remaining call site, tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is, TryFrom for BooleanArray is just the (infallible) blanket impl based on From for BooleanArray, which panics on type mismatch. I couldn't find any fallible version of that code?
Ahh yeah. We could check the conditions of the asserts though, i.e. that the DataType
of the ArrayData
is Boolean
and that there's only one buffer. That would give us confidence that it won't panic (although arrow updates could of course change the conditions).
But anyway it's mostly moot because we end up cloning anyway, and we can't get the inner out of the Arc since it's a trait.
|
||
/// Dispatches an expression to the specific implementation for each expression variant. | ||
/// | ||
/// NOTE: [`Expression::Struct`] is not supported and always evaluates to `None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that we just defined this corner out of existence :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a nice side effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Thanks for tackling this monster!
What changes are proposed in this pull request?
Teach kernel to treat "predicates" (boolean-valued invertible expressions) as different from normal expressions (which are generally neither boolean-valued nor invertible). Accomplished by splitting out a new
Predicate
type from today'sExpression
type, and then adjusting all the various transforms, visitors, and evaluation frameworks accordingly.This change is highly invasive, but very important because kernel's data skipping cares very much about predicates (= expressions that return boolean values and are invertible), which are quite different from the ordinary expressions used for transforming data. We see that tension already in the fact that some of our binary operators are really (invertible) binary predicates (
=
,DISTINCT
, etc.), while others are not (e.g.+
,-
). Further, a key piece of our predicate evaluation is the ability to push NOT through an invertible expression. Pushing down NOT is more than just a performance optimization -- it is required for correct stats-based data skipping becauseNOT skipping_predicate(<expr>)
is NOT equivalent toskipping_predicate(NOT <expr>)
.The work has been carefully split into a number of commits, each focusing on a different change. Most of the changes are preparatory work intended to gradually increase the amount of predicate awareness in the code, while reducing the churn of the final diff.
Closes #765
This PR affects the following public APIs
Everything related to expressions.
How was this change tested?
Added new unit tests and updated existing ones