Skip to content

Conversation

@sdf-jkl
Copy link
Contributor

@sdf-jkl sdf-jkl commented Jan 9, 2026

Which issue does this PR close?

Rationale for this change

Splitting the PR to make it more readable.

What changes are included in this PR?

Adding the udf_preimage logic without date_part implementation.

Are these changes tested?

Added unit tests for a test specific function

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Jan 9, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sdf-jkl -- reviewed this PR carefully this morning and it looks great (thank you for splitting up the work), I found it well commented and well designed and a joy to read

I do think we need to add unit tests tests to for this feature, which I know you have lined up in #18789 but I think writing the unit tests in for the rewrite will make it easiest to validate.

I also have some questions about the rewrite for = (aka the boundary conditions)

// NOTE: we only consider immutable UDFs with literal RHS values
Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
use datafusion_expr::Operator::*;
let is_preimage_op = matches!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice (as a follow on PR) to mention this list in the docs for preimage -- e.g. that it only applies to predicates =, !=, ...

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate spark labels Jan 18, 2026
@sdf-jkl sdf-jkl force-pushed the smaller-preimage-pr-1 branch from f308662 to 5ffb704 Compare January 18, 2026 18:21
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jan 18, 2026
Add tests for additional cases
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 20, 2026

Wow, this is much cleaner, thanks!

@alamb
Copy link
Contributor

alamb commented Jan 20, 2026

I think this PR needs two more things:

  1. Fi the NULL handling (probably by not calling preimage with null constants)
  2. Update the API to have only a single method

(I am trying to keep my review context under control, so trying to focus on getting stuff through before starting more)

@sdf-jkl sdf-jkl requested a review from alamb January 20, 2026 19:01
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Jan 20, 2026

Both done. Re-requested a review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @sdf-jkl

This looks good to me. I would like to change the signature to use Interval rather than Box<Interval> and there are a few other small comments, but we can also do this as a follow on PR (or I can push some commits to this PR)

Thank you for hanging with this one

FYI @colinmarc -- once we get this in, I think @sdf-jkl plans to implement preimage for date_part. Perhaps you are interested in something similar for date_trunc

Also, FYI @jonahgao and @xudong963 / @zhuqi-lucas in case you are interested in this PR (the primary usecase is improving the handling of date/timestamp predicates)

let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else {
return Ok((None, None));
};
if !is_literal_or_literal_cast(right_expr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still an open question, but it is ok to handle as a follow on PR (aka widen the expressions)

if !is_literal_or_literal_cast(right_expr) {
return Ok(PreimageResult::None);
}
if func.signature().volatility != Volatility::Immutable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for a follow on PR, I think it would be safe to rewrite stable functions (whose values don't change during the statement)

Operator::LtEq => expr.lt(upper),
// <expr> = x ==> (<expr> >= lower) and (<expr> < upper)
//
// <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper))
// <expr> is not distinct from x ==> (<expr> is NULL) or ((<expr> >= lower) and (<expr> < upper))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this IS NOT DISTICNT rewrite is correctas it is rewritten to just the range predicate. If expr is NULL and the literal is non-NULL, the original expression is FALSE, but the rewrite evaluates to NULL (x >= lower AND x < upper), which is not equivalent and violates the “same nullability” expectation for simplified expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?

If we want to keep false:

Operator::IsNotDistinctFrom => {
    // expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
    // because we know x is NOT NULL.
    expr.clone().is_not_null().and(
        and(expr.clone().gt_eq(lower), expr.lt(upper))
    )
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudong963 this solves the issue. Thanks!

@xudong963
Copy link
Member

I'll have a look at the PR today

/// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image
///
pub(super) fn rewrite_with_preimage(
_info: &SimplifyContext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb mentioned that we should keep it in #18789 (comment), but it was a while ago.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to pass to ScalarUDFImpl::preimage but probably it can be removed from this method call

Since I want to merge this PR up from main anyways before merge, I'll clean it up too

Operator::LtEq => expr.lt(upper),
// <expr> = x ==> (<expr> >= lower) and (<expr> < upper)
//
// <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or ((<expr> >= lower) and (<expr> < upper))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb In a WHERE clause, both FALSE and NULL might behave similarly (both filter out the row), so here may be safety?

If we want to keep false:

Operator::IsNotDistinctFrom => {
    // expr IS NOT DISTINCT FROM x => must return FALSE if expr is NULL
    // because we know x is NOT NULL.
    expr.clone().is_not_null().and(
        and(expr.clone().gt_eq(lower), expr.lt(upper))
    )
}

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work!

@alamb
Copy link
Contributor

alamb commented Jan 22, 2026

Thank you @sdf-jkl and @xudong963 -- I took a final look through this and it looks really nice. Thank you.

I merged up from main and I'll plan to merge when the tests pass. Let's get this one in to keep things moving forward

@alamb
Copy link
Contributor

alamb commented Jan 22, 2026

FYI @rcurtin

@alamb
Copy link
Contributor

alamb commented Jan 22, 2026

Thanks again @sdf-jkl and @xudong963

Merged via the queue into apache:main with commit c2f3d65 Jan 22, 2026
32 checks passed
@rcurtin
Copy link

rcurtin commented Jan 22, 2026

Thank you for keeping me in the loop! 👍 Glad to see this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support "pre-image" for pruning predicate evaluation

4 participants