feat: add `tiledb_query_add_predicate` API to parse a SQL expression string into a `QueryCondition` #5566

rroelke · 2025-06-29T14:17:44Z

Resolves CORE-25.

#5546 enabled QueryCondition to evaluate datafusion expression trees in its apply_sparse function. This was used only in test scenarios, with a configuration parameter which would have QueryCondition translate its syntax tree into a datafusion Expr and then evaluate it.

This pull request builds upon this capability and connects it to C and C++ level APIs. We add tiledb_query_add_predicate(tiledb_ctx_t*, tiledb_query_t*, const char* predicate) which uses datafusion to parse predicate into an Expr which can be evaluated by QueryCondition.

This enables users to add a much more expressive set of predicates to their queries, which can perform arithmetic, function calls, conditional (CASE) expressions, and so on.

Design

Datafusion has a SessionContext which contains the resources and catalog of a Datafusion session. This object is used to parse and evaluate expressions and is where one registers tables, UDFs, etc.

We will lazily attach a SessionContext to a Query when tiledb_query_add_predicate is first invoked. Doing so lazily means that existing applications should observe no changes.

The tiledb_query_add_predicate initializes the SessionContext if needed and uses it to parse the expression into LogicalExpr. We accumulate a list of LogicalExpr and then combine them into a single conjunctive expression tree (possibly together with a QueryCondition syntax tree) and embed that in a new QueryCondition.

Enumerations

To reduce the scope of this pull request, we hoped to defer supporting attributes with enumerations in predicates to a future pull request (see CORE-287). However, #5546 demonstrated some examples of rewriting a QueryCondition into a DataFusion expression tree when attributes with enumerations were used in the predicates. This occurred after QueryCondition::rewrite_for_schema, which resolves the literal into a key for comparison with the key stored in the tile. Continuing to pass these tests requires some amount of enumeration support to be implemented here.

The "obvious" thing to do would be to represent enumerations in the schema using the Arrow Dictionary data type. However, this notion has a few shortcomings:

the data type and cell val num of an enumeration are co-located with the enumeration variants in storage, not with the enumeration name in the schema. This means that resolving an enumeration data type requires loading it.
the Rust DictionaryArray requires that each of the keys is a valid index into the dictionary values, whereas TileDB enumerations specifically do not have this requirement.

To work around these things, we sometimes need the Arrow schema that we generate to use the enumeration key type, and sometimes we will want it to use the enumeration value type.

This pull request adds a switch to tiledb::oxidize::arrow::schema::create and the functions coming afterwards which implement this toggle, for the generated schema to be the "storage" schema or the "view" schema. We use the "storage" schema to support the existing tests. CORE-287 will use the "view" schema to support text predicates on attributes with enumerations.

There is some more code related to enumerations added here which was in draft along the way to implementing a correct "view" schema. I left it in but we can remove it if preferred.

Testing

We add examples for the C API and C++ API which demonstrate using this new API for the same conditions used in the query_condition_sparse API examples, as well as some more complicated predicates which cannot be expressed as a query condition.
We add some sanity tests, examples, and API tests in unit-query-add-predicate.cc.
We add a mode to the rapidcheck tests in unit-sparse-global-order-reader.cc to optional render the generated QueryCondition syntax tree as a SQL string, which we then add as a predicate instead of applying it to the query normally.

Unsupported

attributes with enumerations cannot be used in predicates (CORE-287)
dimension labels are of unknown status (issue TODO)
we add errors for non-sparse global order reader for now (CORE-272, CORE-273)

Next Steps

This does not analyze predicates for attributes/dimensions to improve R-tree traversal (CORE-28) or add the desired intersection function (CORE-26).

TYPE: FEATURE
DESC: add tiledb_query_add_predicate API to parse a SQL expression string into a QueryCondition

…han a copy in order to add Rust binding (#5567) Differences in copy and move semantics between Rust and C++ mean that they are generally not very good at passing values across the boundary - most of the time it is necessary to pass references instead. #5566 intends to call `Attribute::get_enumeration_name` from Rust. Prior to these changes, that function returns `std::optional<std::string>`, which cannot be passed across the boundary (neither `std::optional` nor `std::string` can be). We can either pass a pointer to a string, or pass a string contained within a smart pointer. To avoid the additional memory allocations required for both copying the `std::string` and placing it in a `unique_ptr`, we choose the former. However, `std::optional` does not naturally support references, so we cannot do `std::optional<std::string&>`. Instead we change the return type to `std::optional<std::reference_wrapper<std::string>>`. In addition to enabling passing the result of this function to Rust without allocating additional memory, this also avoids making additional copies of the string. This is not likely to matter from a performance perspective, but is still nice! --- TYPE: NO_HISTORY DESC: Rust binding for `Attribute::get_enumeration_name`

teo-tsirpanis

Preliminary review

tiledb/sm/query/query.cc

tiledb/sm/query/query.h

teo-tsirpanis · 2025-07-01T19:57:18Z

tiledb/sm/query/query_condition.cc

@@ -2989,7 +3023,7 @@ void QueryCondition::Datafusion::apply(
        result_bitmap[i] *= bitmap[i];
      }
    } else {
-      throw std::logic_error(
+      throw QueryConditionException(


Unrelated to this PR, the trend of our exception (and previously status) types is to indicate where the error occured, rather than what the error was, putting us against the paradigm of exception types in C++ and other languages. This is not a good thing, and leads to C APIs almost always returning the monolithic TILEDB_ERR status code when they fail, which leads to poor error handling and reporting practices downstream, including the very fragile parsing of the error message, which I have seen doing both in testing and production code.

Of course we cannot fix this problem at once and in this PR, but what we can do is prevent the "where"-typed exceptions from proliferating in new code, or at the very least in existing code. Which is a long-winded way for me to say "I think we should revert this line (and the one below)".

I definitely agree with the root notion that the error reporting needs improvement. I prefer leaving this in however. What proper error handling looks like is a question for the future; and what this change does today is at least make it easier to identify all the instances of thrown exceptions in this file by searching on a common string.

tiledb/sm/c_api/tiledb_experimental.h

teo-tsirpanis · 2025-07-01T20:12:49Z

tiledb/sm/query/query.h

@@ -58,8 +58,24 @@
 #include "tiledb/sm/storage_manager/cancellation_source.h"
 #include "tiledb/sm/subarray/subarray.h"

+#ifdef HAVE_RUST
+#include "tiledb/oxidize/rust.h"


How big is rust.h? If it adds large amounts of code to the compilation, we should avoid including it in other headers, for example by converting LogicalExpr and Session to self-contained types, and storing tdb_unique_ptrs of them in the class. There is also the concern that it will affect developers' inner loop, if the header's content is too sensitive to changes in the Rust code's implementation details, it will force recompilations of the C++ code, which will be limited by moving the include to the implementation file.

The contents of this header are pinned to the version of our cxx crate dependency. I wouldn't be shocked if it never changed.

This header is generated by the cxxbridge dependency and provides C++ definition which correspond to Rust standard library types, e.g. rust::Box and rust::Vec and etc. It isn't generated by compiling any of our Rust crates.

teo-tsirpanis · 2025-07-01T20:16:43Z

tiledb/oxidize/cxx-interface/src/sm/array_schema/mod.rs

        #[cxx_name = "attribute"]
        fn attribute_by_idx(&self, idx: u32) -> *const Attribute;

        #[cxx_name = "attribute"]
        fn attribute_by_name(&self, name: &CxxString) -> *const Attribute;

+        #[cxx_name = "get_enumeration"]
+        fn const_enumeration_cxx(&self, name: &CxxString) -> SharedPtr<ConstEnumeration>;


Can we have ConstSharedPtr instead of const editions of the individual classes?

Unfortunately not. The #[cxx::bridge] macro is pretty strict about what types are allowed. SharedPtr is a special type which it knows how to expand to code on both the Rust and C++ side; custom opaque types are never allowed to be returned or passed by value.

test/support/src/query_helpers.cc

tiledb/sm/query/query.cc

rroelke · 2025-07-02T20:35:22Z

My plan was to avoid dealing with enumerations for now - but the existing QueryCondition tests rely on them, so I will have to do something with them here...

…hema

…ion contents

…it query condition tests

rroelke marked this pull request as draft June 29, 2025 14:17

rroelke mentioned this pull request Jun 29, 2025

chore: Attribute::get_enumeration_name returns a reference rather than a copy in order to add Rust binding #5567

Merged

rroelke added 4 commits June 30, 2025 10:31

tiledb_query_add_predicate

94eeeff

Remove assert post rebase

ad3f43c

Fill in WHERE a IS NOT NULL example

c10ef48

Fill in other global order unit test examples

6017582

rroelke force-pushed the rr/core-25-add-predicate branch from 3e877a3 to 6017582 Compare June 30, 2025 14:32

rroelke added 12 commits June 30, 2025 11:39

unit-query-add-predicate.cc tests for other readers

303e831

Move datafusion session to query instead of ContextResources

7aaedd5

Tweak example comment

3668cda

Enumeration::create const std::vector

4e2eebf

Tweak example

bd16e7b

cpp example

40cb894

clippy

2139297

Add test on evolved schema

f10d157

Change test names

2b72ba4

Fix non-rust build

1d962ec

Fix osx build errors

b0d36d3

Fix C API example print_elem buffer

f246bd7

teo-tsirpanis reviewed Jul 1, 2025

View reviewed changes

rroelke added 2 commits July 1, 2025 19:21

Comment new/updated test support functions

744728d

Remove unnecessary ExternType impl

bead919

rroelke commented Jul 1, 2025

View reviewed changes

tiledb/sm/query/query.cc Outdated Show resolved Hide resolved

rroelke commented Jul 1, 2025

View reviewed changes

tiledb/sm/query/query.cc Outdated Show resolved Hide resolved

rroelke added 5 commits July 1, 2025 19:33

Self-review code comments

02fb0ca

Attempt to fix query_add_predicate error

ec829e0

Undo clang-format-17 string splits

f2ddb4b

Change C++ API to use std::string

176a021

Remove logger_->status

d1c7680

rroelke added 5 commits July 1, 2025 22:12

SQL dialect in API comments

653c890

Query add predicate to in progress query

a00301c

Fix bizarre -Warray-bounds error for b_data_offsets

be32216

Query add predicate with query condition

763a3e2

Add tests demonstrating field escaping

1ab697f

rroelke added 10 commits July 3, 2025 08:18

Add some FFI for sm Buffer

2c05e33

FFI use_enumeration

666e138

Bindings for accessing enumeration contents and locating them in a sc…

069162b

…hema

ArrowSchema => ArrowArraySchema, contains dyn ArrowArray for enumerat…

9758e7f

…ion contents

Move definitions to .cc file to avoid multiple definition error

b41d1aa

Add WhichSchema to distinguish schema for view vs. storage, passes un…

d78e434

…it query condition tests

Fix wrong write size in unit_query_condition.cc

860d3d5

Fix UTF-8, unit_query_condition passes

9d8e4ff

Stopgap for enumerations in WhichSchema::View

d6889bd

clippy

d5caf2c

rroelke marked this pull request as ready for review July 3, 2025 17:44

rroelke requested review from davisp, ypatia and teo-tsirpanis July 3, 2025 17:44

rroelke and others added 6 commits July 21, 2025 10:10

Merge remote-tracking branch 'origin/main' into rr/core-25-add-predicate

314b169

RestClientFactory can construct in place

9046ef1

Handle TILEDB_RUST=OFF in unit-query-add-predicate.cc

33c8a72

HeapMemoryLinter ignores oxidize dir

a3e6617

Fix empty dimension tuple

e18ccd7

make format

f1d7eb5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add `tiledb_query_add_predicate` API to parse a SQL expression string into a `QueryCondition` #5566

feat: add `tiledb_query_add_predicate` API to parse a SQL expression string into a `QueryCondition` #5566

rroelke commented Jun 29, 2025 •

edited

Loading

Uh oh!

teo-tsirpanis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

teo-tsirpanis Jul 1, 2025

Uh oh!

rroelke Jul 21, 2025

Uh oh!

Uh oh!

teo-tsirpanis Jul 1, 2025 •

edited

Loading

Uh oh!

rroelke Jul 2, 2025

Uh oh!

rroelke Jul 2, 2025

Uh oh!

teo-tsirpanis Jul 1, 2025

Uh oh!

rroelke Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rroelke commented Jul 2, 2025

Uh oh!

Uh oh!

feat: add tiledb_query_add_predicate API to parse a SQL expression string into a QueryCondition #5566

Are you sure you want to change the base?

feat: add tiledb_query_add_predicate API to parse a SQL expression string into a QueryCondition #5566

Conversation

rroelke commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Enumerations

Testing

Unsupported

Next Steps

Uh oh!

teo-tsirpanis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

teo-tsirpanis Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

rroelke Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teo-tsirpanis Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rroelke Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

rroelke Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

teo-tsirpanis Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

rroelke Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rroelke commented Jul 2, 2025

Uh oh!

Uh oh!

feat: add `tiledb_query_add_predicate` API to parse a SQL expression string into a `QueryCondition` #5566

feat: add `tiledb_query_add_predicate` API to parse a SQL expression string into a `QueryCondition` #5566

rroelke commented Jun 29, 2025 •

edited

Loading

teo-tsirpanis Jul 1, 2025 •

edited

Loading