refactor!: Remove redundant binary predicate operations #949

scovich · 2025-05-14T21:14:37Z

What changes are proposed in this pull request?

The binary predicate operations NotIn, <=, >= and != are all redundant in kernel's invertible predicate evaluation framework: They are just inverted versions of In, >, <, and = (and inversion is just a flag passed to the predicate evaluator, not a physical operator). Remove them to simplify engine implementations.

Note that the convenience methods for those operators (e.g. Predicate::le) are still present, as are the FFI kernel expression visitors. They're just implemented as NOT(<predicate>) now.

NOTE: The commits in this PR form a curated stack that reviewers will likely want to examine individually:

A prefactor that reduces diff churn for later commits
Remove NotIn and update tests
Change kernel predicate evaluator to use > instead of <=
Remove <=, >= and !=; and update tests

This PR affects the following public APIs

Removed several variants from BinaryPredicateOp

Also removed the corresponding FFI engine expression visitors (because kernel would no longer call them even if they did exist).

How was this change tested?

Refactor. Existing unit tests validate the change.

codecov · 2025-05-14T21:17:39Z

Codecov Report

Attention: Patch coverage is 77.77778% with 34 lines in your changes missing coverage. Please review.

Project coverage is 85.07%. Comparing base (9da2514) to head (41f5e0c).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
ffi/src/expressions/kernel_visitor.rs	0.00%	12 Missing ⚠️
ffi/src/test_ffi.rs	0.00%	11 Missing ⚠️
...src/engine/arrow_expression/evaluate_expression.rs	89.06%	1 Missing and 6 partials ⚠️
kernel/src/kernel_predicates/mod.rs	86.95%	3 Missing ⚠️
kernel/src/scan/data_skipping.rs	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #949      +/-   ##
==========================================
+ Coverage   85.06%   85.07%   +0.01%     
==========================================
  Files          90       90              
  Lines       23090    23034      -56     
  Branches    23090    23034      -56     
==========================================
- Hits        19641    19597      -44     
+ Misses       2446     2435      -11     
+ Partials     1003     1002       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scovich

Some commentary for reviewers

scovich · 2025-05-14T21:17:31Z

ffi/src/expressions/kernel_visitor.rs

@@ -179,6 +179,15 @@ pub extern "C" fn visit_predicate_eq(
    visit_predicate_binary(state, BinaryPredicateOp::Equal, a, b)
 }

+#[no_mangle]
+pub extern "C" fn visit_predicate_ne(


Somehow this was missing all along. Not sure why e.g. duckdb didn't notice?

does this highlight a test gap on our side too for expr visitors?

Probably... but it's not so much a test gap as it is a feature gap. Perhaps duckdb just passed NOT(eq) instead?

DuckDB delta currently ignores the NotEqual predicates, likely because it was missing at the time. I've created an issue to track this here duckdb/duckdb-delta#203.

scovich · 2025-05-14T21:19:21Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+                        (Int8, Int8Type),
+                        (Int16, Int16Type),
+                        (Int32, Int32Type),
+                        (Int64, Int64Type),
+                        (UInt8, UInt8Type),
+                        (UInt16, UInt16Type),
+                        (UInt32, UInt32Type),
+                        (UInt64, UInt64Type),
+                        (Float16, Float16Type),
+                        (Float32, Float32Type),
+                        (Float64, Float64Type),
+                        (Timestamp(TimeUnit::Second, _), TimestampSecondType),
+                        (Timestamp(TimeUnit::Millisecond, _), TimestampMillisecondType),
+                        (Timestamp(TimeUnit::Microsecond, _), TimestampMicrosecondType),
+                        (Timestamp(TimeUnit::Nanosecond, _), TimestampNanosecondType),
+                        (Date32, Date32Type),
+                        (Date64, Date64Type),
+                        (Time32(TimeUnit::Second), Time32SecondType),
+                        (Time32(TimeUnit::Millisecond), Time32MillisecondType),
+                        (Time64(TimeUnit::Microsecond), Time64MicrosecondType),
+                        (Time64(TimeUnit::Nanosecond), Time64NanosecondType),
+                        (Duration(TimeUnit::Second), DurationSecondType),
+                        (Duration(TimeUnit::Millisecond), DurationMillisecondType),
+                        (Duration(TimeUnit::Microsecond), DurationMicrosecondType),
+                        (Duration(TimeUnit::Nanosecond), DurationNanosecondType),
+                        (Interval(IntervalUnit::DayTime), IntervalDayTimeType),
+                        (Interval(IntervalUnit::YearMonth), IntervalYearMonthType),
+                        (Interval(IntervalUnit::MonthDayNano), IntervalMonthDayNanoType),
+                        (Decimal128(_, _), Decimal128Type),
+                        (Decimal256(_, _), Decimal256Type)


This would have been an indentation-only change, except I removed the ArrowDataType:: prefix in order to keep fmt at bay. Even so, the diff is much more readable when reviewing with whitespace changes hidden.

scovich · 2025-05-14T21:19:36Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+            // IN is different from all the others, and also quite complex, so factor it out.
+            //
+            // TODO: Factor out as a stand-alone function instead of a closure?
+            let eval_in = || match (left, right) {


By pulling out this closure, I was able to eliminate the two "special" match arms for In and NotIn.

+1 to the TODO; should we just factor this out as a function?

We could, but all this in-list code will anyway change drastically when #652 lands.

scovich · 2025-05-14T21:20:34Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+                LessThan => lt,
+                LessThanOrEqual => lt_eq,
+                GreaterThan => gt,
+                GreaterThanOrEqual => gt_eq,
+                Equal => eq,
+                NotEqual => neq,
+                Distinct => distinct,


The predicate/expression split missed this simplification

scovich · 2025-05-14T21:21:53Z

kernel/src/expressions/mod.rs

@@ -173,19 +173,6 @@ impl BinaryPredicateOp {
            Distinct | In | NotIn => false, // tolerates NULL input
        }
    }
-
-    /// Returns `<op2>` (if any) such that `B <op2> A` is equivalent to `A <op> B`.
-    pub(crate) fn commute(&self) -> Option<BinaryPredicateOp> {


It turned out this (single-callsite) function added more complexity than it saved. It had marginal value all along, and it became downright confusing once the bloat of redundant operators went away.

scovich · 2025-05-14T21:26:46Z

kernel/src/scan/data_skipping.rs

-            (Ordering::Equal, true) => BinaryPredicateOp::NotEqual,
-            (Ordering::Greater, false) => BinaryPredicateOp::GreaterThan,
-            (Ordering::Greater, true) => BinaryPredicateOp::LessThanOrEqual,
+        let pred_fn = match (ord, inverted) {


Referencing the helper functions instead of raw operators is not only simpler now, but avoids having to rework the code once e.g. NotEqual disappears.

scovich · 2025-05-14T21:28:13Z

kernel/tests/read.rs

-        (NotEqual, 7, vec![&batch2, &batch1]),
-        (NotEqual, 8, vec![&batch2, &batch1]),
+    #[allow(clippy::type_complexity)] // otherwise it's even more complex because no `_`
+    let test_cases: Vec<(fn(Expr, Expr) -> _, _, _)> = vec![


As above, grabbing helper functions instead of raw predicates is both simpler and more robust.

scovich · 2025-05-14T21:32:35Z

kernel/src/kernel_predicates/mod.rs

+            Distinct | In => {
                debug!("Unsupported binary operator: {left:?} {op:?} {right:?}");


aside: AFAIK nothing prevents us from implementing Distinct (both scalar eval and data skipping forms).
We should probably implement it for completeness at some point.

got an issue started! #963

scovich · 2025-05-14T21:41:22Z

kernel/src/kernel_predicates/mod.rs

-    /// A (possibly inverted) less-than-or-equal comparison, e.g. `<col> <= <value>`
-    fn eval_pred_le(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;
+    /// A (possibly inverted) greater-than comparison, e.g. `<col> > <value>`
+    fn eval_pred_gt(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;


Turns out Ordering::Greater had the right idea all along. It's less confusing to work with than <=.

so everywhere we used to translate in terms of lt/le and now everything is in terms of lt/gt? agree feels nicer!

scovich · 2025-05-14T21:42:22Z

kernel/src/kernel_predicates/mod.rs

+            // Given `col <= val`:
+            // Skip if `val` is less than _all_ values in [min, max], implies
+            // Skip if `val < min AND val < max` implies
+            // Skip if `val < min` implies
+            // Keep if `NOT(val < min)` implies
+            // Keep if `NOT(min > val)`
+            self.partial_cmp_min_stat(col, val, Ordering::Greater, true)


NOTE: No actual change, the content of the if/else blocks just swapped places.

zachschuermann

LGTM!

zachschuermann · 2025-05-19T20:16:51Z

ffi/src/expressions/kernel_visitor.rs

@@ -179,6 +179,15 @@ pub extern "C" fn visit_predicate_eq(
    visit_predicate_binary(state, BinaryPredicateOp::Equal, a, b)
 }

+#[no_mangle]
+pub extern "C" fn visit_predicate_ne(


does this highlight a test gap on our side too for expr visitors?

zachschuermann · 2025-05-19T20:18:25Z

kernel/src/engine/arrow_expression/evaluate_expression.rs

+            // IN is different from all the others, and also quite complex, so factor it out.
+            //
+            // TODO: Factor out as a stand-alone function instead of a closure?
+            let eval_in = || match (left, right) {


+1 to the TODO; should we just factor this out as a function?

zachschuermann · 2025-05-19T20:22:41Z

kernel/src/kernel_predicates/mod.rs

+            Distinct | In => {
                debug!("Unsupported binary operator: {left:?} {op:?} {right:?}");


got an issue started! #963

zachschuermann · 2025-05-19T20:25:28Z

kernel/src/kernel_predicates/mod.rs

-    /// A (possibly inverted) less-than-or-equal comparison, e.g. `<col> <= <value>`
-    fn eval_pred_le(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;
+    /// A (possibly inverted) greater-than comparison, e.g. `<col> > <value>`
+    fn eval_pred_gt(&self, col: &ColumnName, val: &Scalar, inverted: bool) -> Option<Self::Output>;


so everywhere we used to translate in terms of lt/le and now everything is in terms of lt/gt? agree feels nicer!

zachschuermann · 2025-05-19T20:26:09Z

ffi/examples/visit-expression/expression.h

  GreaterThan,
-  GreaterThaneOrEqual,


and now one less typo!

nicklan

lgtm, just one small suggestion.

kernel/src/kernel_predicates/mod.rs

scovich requested review from nicklan and hntd187 May 14, 2025 21:14

github-actions bot assigned scovich May 14, 2025

github-actions bot added the breaking-change Change that require a major version bump label May 14, 2025

scovich force-pushed the simpler-binary-predicates branch from c656483 to 8052975 Compare May 14, 2025 21:40

scovich commented May 14, 2025

View reviewed changes

scovich added 4 commits May 14, 2025 14:59

refactor: Various expression-related code cleanups

3e0d29f

remove redundant NotIn operator

b873516

refactor: Kernel predicate evaluation use gt instead of le

6c35d87

Remove redundant binary LE, GE, NE predicate operations

c609c19

scovich force-pushed the simpler-binary-predicates branch from 8052975 to c609c19 Compare May 14, 2025 22:08

scovich mentioned this pull request May 15, 2025

refactor: Make arrow predicate eval directly invertible #956

Merged

zachschuermann approved these changes May 19, 2025

View reviewed changes

nicklan approved these changes May 22, 2025

View reviewed changes

kernel/src/kernel_predicates/mod.rs Show resolved Hide resolved

scovich added 2 commits May 23, 2025 18:02

review feedback

9798077

Merge remote-tracking branch 'oss/main' into simpler-binary-predicates

41f5e0c

scovich merged commit d977f64 into delta-io:main May 24, 2025
20 of 21 checks passed

samansmink mentioned this pull request May 27, 2025

Support & test all types of predicates duckdb/duckdb-delta#203

Open

		Distinct \| In => {
		debug!("Unsupported binary operator: {left:?} {op:?} {right:?}");

refactor!: Remove redundant binary predicate operations #949

refactor!: Remove redundant binary predicate operations #949

Uh oh!

Conversation

scovich commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

codecov bot commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scovich commented May 14, 2025 •

edited

Loading

codecov bot commented May 14, 2025 •

edited

Loading