Skip to content

refactor!: Remove redundant binary predicate operations #949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 24, 2025

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented May 14, 2025

What changes are proposed in this pull request?

The binary predicate operations NotIn, <=, >= and != are all redundant in kernel's invertible predicate evaluation framework: They are just inverted versions of In, >, <, and = (and inversion is just a flag passed to the predicate evaluator, not a physical operator). Remove them to simplify engine implementations.

Note that the convenience methods for those operators (e.g. Predicate::le) are still present, as are the FFI kernel expression visitors. They're just implemented as NOT(<predicate>) now.

NOTE: The commits in this PR form a curated stack that reviewers will likely want to examine individually:

  1. A prefactor that reduces diff churn for later commits
  2. Remove NotIn and update tests
  3. Change kernel predicate evaluator to use > instead of <=
  4. Remove <=, >= and !=; and update tests

This PR affects the following public APIs

Removed several variants from BinaryPredicateOp

Also removed the corresponding FFI engine expression visitors (because kernel would no longer call them even if they did exist).

How was this change tested?

Refactor. Existing unit tests validate the change.

Copy link

codecov bot commented May 14, 2025

Codecov Report

Attention: Patch coverage is 77.77778% with 34 lines in your changes missing coverage. Please review.

Project coverage is 85.07%. Comparing base (9da2514) to head (41f5e0c).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/expressions/kernel_visitor.rs 0.00% 12 Missing ⚠️
ffi/src/test_ffi.rs 0.00% 11 Missing ⚠️
...src/engine/arrow_expression/evaluate_expression.rs 89.06% 1 Missing and 6 partials ⚠️
kernel/src/kernel_predicates/mod.rs 86.95% 3 Missing ⚠️
kernel/src/scan/data_skipping.rs 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #949      +/-   ##
==========================================
+ Coverage   85.06%   85.07%   +0.01%     
==========================================
  Files          90       90              
  Lines       23090    23034      -56     
  Branches    23090    23034      -56     
==========================================
- Hits        19641    19597      -44     
+ Misses       2446     2435      -11     
+ Partials     1003     1002       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label May 14, 2025
@scovich scovich force-pushed the simpler-binary-predicates branch from c656483 to 8052975 Compare May 14, 2025 21:40
Copy link
Collaborator Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commentary for reviewers

@@ -179,6 +179,15 @@ pub extern "C" fn visit_predicate_eq(
visit_predicate_binary(state, BinaryPredicateOp::Equal, a, b)
}

#[no_mangle]
pub extern "C" fn visit_predicate_ne(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow this was missing all along. Not sure why e.g. duckdb didn't notice?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this highlight a test gap on our side too for expr visitors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably... but it's not so much a test gap as it is a feature gap. Perhaps duckdb just passed NOT(eq) instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DuckDB delta currently ignores the NotEqual predicates, likely because it was missing at the time. I've created an issue to track this here duckdb/duckdb-delta#203.

Comment on lines +177 to +206
(Int8, Int8Type),
(Int16, Int16Type),
(Int32, Int32Type),
(Int64, Int64Type),
(UInt8, UInt8Type),
(UInt16, UInt16Type),
(UInt32, UInt32Type),
(UInt64, UInt64Type),
(Float16, Float16Type),
(Float32, Float32Type),
(Float64, Float64Type),
(Timestamp(TimeUnit::Second, _), TimestampSecondType),
(Timestamp(TimeUnit::Millisecond, _), TimestampMillisecondType),
(Timestamp(TimeUnit::Microsecond, _), TimestampMicrosecondType),
(Timestamp(TimeUnit::Nanosecond, _), TimestampNanosecondType),
(Date32, Date32Type),
(Date64, Date64Type),
(Time32(TimeUnit::Second), Time32SecondType),
(Time32(TimeUnit::Millisecond), Time32MillisecondType),
(Time64(TimeUnit::Microsecond), Time64MicrosecondType),
(Time64(TimeUnit::Nanosecond), Time64NanosecondType),
(Duration(TimeUnit::Second), DurationSecondType),
(Duration(TimeUnit::Millisecond), DurationMillisecondType),
(Duration(TimeUnit::Microsecond), DurationMicrosecondType),
(Duration(TimeUnit::Nanosecond), DurationNanosecondType),
(Interval(IntervalUnit::DayTime), IntervalDayTimeType),
(Interval(IntervalUnit::YearMonth), IntervalYearMonthType),
(Interval(IntervalUnit::MonthDayNano), IntervalMonthDayNanoType),
(Decimal128(_, _), Decimal128Type),
(Decimal256(_, _), Decimal256Type)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have been an indentation-only change, except I removed the ArrowDataType:: prefix in order to keep fmt at bay. Even so, the diff is much more readable when reviewing with whitespace changes hidden.

Comment on lines +160 to +163
// IN is different from all the others, and also quite complex, so factor it out.
//
// TODO: Factor out as a stand-alone function instead of a closure?
let eval_in = || match (left, right) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By pulling out this closure, I was able to eliminate the two "special" match arms for In and NotIn.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the TODO; should we just factor this out as a function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but all this in-list code will anyway change drastically when #652 lands.

Comment on lines 220 to 223
LessThan => lt,
LessThanOrEqual => lt_eq,
GreaterThan => gt,
GreaterThanOrEqual => gt_eq,
Equal => eq,
NotEqual => neq,
Distinct => distinct,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The predicate/expression split missed this simplification

@@ -173,19 +173,6 @@ impl BinaryPredicateOp {
Distinct | In | NotIn => false, // tolerates NULL input
}
}

/// Returns `<op2>` (if any) such that `B <op2> A` is equivalent to `A <op> B`.
pub(crate) fn commute(&self) -> Option<BinaryPredicateOp> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out this (single-callsite) function added more complexity than it saved. It had marginal value all along, and it became downright confusing once the bloat of redundant operators went away.

(Ordering::Equal, true) => BinaryPredicateOp::NotEqual,
(Ordering::Greater, false) => BinaryPredicateOp::GreaterThan,
(Ordering::Greater, true) => BinaryPredicateOp::LessThanOrEqual,
let pred_fn = match (ord, inverted) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing the helper functions instead of raw operators is not only simpler now, but avoids having to rework the code once e.g. NotEqual disappears.

(NotEqual, 7, vec![&batch2, &batch1]),
(NotEqual, 8, vec![&batch2, &batch1]),
#[allow(clippy::type_complexity)] // otherwise it's even more complex because no `_`
let test_cases: Vec<(fn(Expr, Expr) -> _, _, _)> = vec![
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, grabbing helper functions instead of raw predicates is both simpler and more robust.

Comment on lines +487 to 479
Distinct | In => {
debug!("Unsupported binary operator: {left:?} {op:?} {right:?}");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: AFAIK nothing prevents us from implementing Distinct (both scalar eval and data skipping forms).
We should probably implement it for completeness at some point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got an issue started! #963

/// A (possibly inverted) less-than-or-equal comparison, e.g. `<col> <= <value>`
fn eval_pred_le(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;
/// A (possibly inverted) greater-than comparison, e.g. `<col> > <value>`
fn eval_pred_gt(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out Ordering::Greater had the right idea all along. It's less confusing to work with than <=.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so everywhere we used to translate in terms of lt/le and now everything is in terms of lt/gt? agree feels nicer!

Comment on lines +755 to +752
// Given `col <= val`:
// Skip if `val` is less than _all_ values in [min, max], implies
// Skip if `val < min AND val < max` implies
// Skip if `val < min` implies
// Keep if `NOT(val < min)` implies
// Keep if `NOT(min > val)`
self.partial_cmp_min_stat(col, val, Ordering::Greater, true)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: No actual change, the content of the if/else blocks just swapped places.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -179,6 +179,15 @@ pub extern "C" fn visit_predicate_eq(
visit_predicate_binary(state, BinaryPredicateOp::Equal, a, b)
}

#[no_mangle]
pub extern "C" fn visit_predicate_ne(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this highlight a test gap on our side too for expr visitors?

Comment on lines +160 to +163
// IN is different from all the others, and also quite complex, so factor it out.
//
// TODO: Factor out as a stand-alone function instead of a closure?
let eval_in = || match (left, right) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the TODO; should we just factor this out as a function?

Comment on lines +487 to 479
Distinct | In => {
debug!("Unsupported binary operator: {left:?} {op:?} {right:?}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got an issue started! #963

/// A (possibly inverted) less-than-or-equal comparison, e.g. `<col> <= <value>`
fn eval_pred_le(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;
/// A (possibly inverted) greater-than comparison, e.g. `<col> > <value>`
fn eval_pred_gt(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so everywhere we used to translate in terms of lt/le and now everything is in terms of lt/gt? agree feels nicer!

GreaterThan,
GreaterThaneOrEqual,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and now one less typo!

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just one small suggestion.

@scovich scovich merged commit d977f64 into delta-io:main May 24, 2025
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that require a major version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants