-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support API for "pre-image" for pruning predicate evaluation #19722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sdf-jkl -- reviewed this PR carefully this morning and it looks great (thank you for splitting up the work), I found it well commented and well designed and a joy to read
I do think we need to add unit tests tests to for this feature, which I know you have lined up in #18789 but I think writing the unit tests in for the rewrite will make it easiest to validate.
I also have some questions about the rewrite for = (aka the boundary conditions)
| // NOTE: we only consider immutable UDFs with literal RHS values | ||
| Expr::BinaryExpr(BinaryExpr { left, op, right }) => { | ||
| use datafusion_expr::Operator::*; | ||
| let is_preimage_op = matches!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be nice (as a follow on PR) to mention this list in the docs for preimage -- e.g. that it only applies to predicates =, !=, ...
f308662 to
5ffb704
Compare
Add tests for additional cases
|
Wow, this is much cleaner, thanks! |
|
I think this PR needs two more things:
(I am trying to keep my review context under control, so trying to focus on getting stuff through before starting more) |
|
Both done. Re-requested a review. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @sdf-jkl
This looks good to me. I would like to change the signature to use Interval rather than Box<Interval> and there are a few other small comments, but we can also do this as a follow on PR (or I can push some commits to this PR)
Thank you for hanging with this one
FYI @colinmarc -- once we get this in, I think @sdf-jkl plans to implement preimage for date_part. Perhaps you are interested in something similar for date_trunc
Also, FYI @jonahgao and @xudong963 / @zhuqi-lucas in case you are interested in this PR (the primary usecase is improving the handling of date/timestamp predicates)
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
| let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else { | ||
| return Ok((None, None)); | ||
| }; | ||
| if !is_literal_or_literal_cast(right_expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still an open question, but it is ok to handle as a follow on PR (aka widen the expressions)
| if !is_literal_or_literal_cast(right_expr) { | ||
| return Ok(PreimageResult::None); | ||
| } | ||
| if func.signature().volatility != Volatility::Immutable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for a follow on PR, I think it would be safe to rewrite stable functions (whose values don't change during the statement)
| Operator::LtEq => expr.lt(upper), | ||
| // <expr> = x ==> (<expr> >= lower) and (<expr> < upper) | ||
| // | ||
| // <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper)) | |
| // <expr> is not distinct from x ==> (<expr> is NULL) or ((<expr> >= lower) and (<expr> < upper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this IS NOT DISTICNT rewrite is correctas it is rewritten to just the range predicate. If expr is NULL and the literal is non-NULL, the original expression is FALSE, but the rewrite evaluates to NULL (x >= lower AND x < upper), which is not equivalent and violates the “same nullability” expectation for simplified expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?
If we want to keep false:
Operator::IsNotDistinctFrom => {
// expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
// because we know x is NOT NULL.
expr.clone().is_not_null().and(
and(expr.clone().gt_eq(lower), expr.lt(upper))
)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xudong963 this solves the issue. Thanks!
|
I'll have a look at the PR today |
| /// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image | ||
| /// | ||
| pub(super) fn rewrite_with_preimage( | ||
| _info: &SimplifyContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb mentioned that we should keep it in #18789 (comment), but it was a while ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to pass to ScalarUDFImpl::preimage but probably it can be removed from this method call
Since I want to merge this PR up from main anyways before merge, I'll clean it up too
| Operator::LtEq => expr.lt(upper), | ||
| // <expr> = x ==> (<expr> >= lower) and (<expr> < upper) | ||
| // | ||
| // <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?
If we want to keep false:
Operator::IsNotDistinctFrom => {
// expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
// because we know x is NOT NULL.
expr.clone().is_not_null().and(
and(expr.clone().gt_eq(lower), expr.lt(upper))
)
}
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
xudong963
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work!
|
Thank you @sdf-jkl and @xudong963 -- I took a final look through this and it looks really nice. Thank you. I merged up from main and I'll plan to merge when the tests pass. Let's get this one in to keep things moving forward |
|
FYI @rcurtin |
|
Thanks again @sdf-jkl and @xudong963 |
|
Thank you for keeping me in the loop! 👍 Glad to see this. |
Which issue does this PR close?
Rationale for this change
Splitting the PR to make it more readable.
What changes are included in this PR?
Adding the udf_preimage logic without date_part implementation.
Are these changes tested?
Added unit tests for a test specific function
Are there any user-facing changes?
No