Skip to content

feat: add tiledb_query_add_predicate API to parse a SQL expression string into a QueryCondition #5566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

rroelke
Copy link
Member

@rroelke rroelke commented Jun 29, 2025

Resolves CORE-25.

#5546 enabled QueryCondition to evaluate datafusion expression trees in its apply_sparse function. This was used only in test scenarios, with a configuration parameter which would have QueryCondition translate its syntax tree into a datafusion Expr and then evaluate it.

This pull request builds upon this capability and connects it to C and C++ level APIs. We add tiledb_query_add_predicate(tiledb_ctx_t*, tiledb_query_t*, const char* predicate) which uses datafusion to parse predicate into an Expr which can be evaluated by QueryCondition.

This enables users to add a much more expressive set of predicates to their queries, which can perform arithmetic, function calls, conditional (CASE) expressions, and so on.

Design

Datafusion has a SessionContext which contains the resources and catalog of a Datafusion session. This object is used to parse and evaluate expressions and is where one registers tables, UDFs, etc.

We will lazily attach a SessionContext to a Query when tiledb_query_add_predicate is first invoked. Doing so lazily means that existing applications should observe no changes.

The tiledb_query_add_predicate initializes the SessionContext if needed and uses it to parse the expression into LogicalExpr. We accumulate a list of LogicalExpr and then combine them into a single conjunctive expression tree (possibly together with a QueryCondition syntax tree) and embed that in a new QueryCondition.

Enumerations

To reduce the scope of this pull request, we hoped to defer supporting attributes with enumerations in predicates to a future pull request (see CORE-287). However, #5546 demonstrated some examples of rewriting a QueryCondition into a DataFusion expression tree when attributes with enumerations were used in the predicates. This occurred after QueryCondition::rewrite_for_schema, which resolves the literal into a key for comparison with the key stored in the tile. Continuing to pass these tests requires some amount of enumeration support to be implemented here.

The "obvious" thing to do would be to represent enumerations in the schema using the Arrow Dictionary data type. However, this notion has a few shortcomings:

  1. the data type and cell val num of an enumeration are co-located with the enumeration variants in storage, not with the enumeration name in the schema. This means that resolving an enumeration data type requires loading it.
  2. the Rust DictionaryArray requires that each of the keys is a valid index into the dictionary values, whereas TileDB enumerations specifically do not have this requirement.

To work around these things, we sometimes need the Arrow schema that we generate to use the enumeration key type, and sometimes we will want it to use the enumeration value type.

This pull request adds a switch to tiledb::oxidize::arrow::schema::create and the functions coming afterwards which implement this toggle, for the generated schema to be the "storage" schema or the "view" schema. We use the "storage" schema to support the existing tests. CORE-287 will use the "view" schema to support text predicates on attributes with enumerations.

There is some more code related to enumerations added here which was in draft along the way to implementing a correct "view" schema. I left it in but we can remove it if preferred.

Testing

  • We add examples for the C API and C++ API which demonstrate using this new API for the same conditions used in the query_condition_sparse API examples, as well as some more complicated predicates which cannot be expressed as a query condition.
  • We add some sanity tests, examples, and API tests in unit-query-add-predicate.cc.
  • We add a mode to the rapidcheck tests in unit-sparse-global-order-reader.cc to optional render the generated QueryCondition syntax tree as a SQL string, which we then add as a predicate instead of applying it to the query normally.

Unsupported

  • attributes with enumerations cannot be used in predicates (CORE-287)
  • dimension labels are of unknown status (issue TODO)
  • we add errors for non-sparse global order reader for now (CORE-272, CORE-273)

Next Steps

This does not analyze predicates for attributes/dimensions to improve R-tree traversal (CORE-28) or add the desired intersection function (CORE-26).


TYPE: FEATURE
DESC: add tiledb_query_add_predicate API to parse a SQL expression string into a QueryCondition

@rroelke rroelke marked this pull request as draft June 29, 2025 14:17
rroelke added a commit that referenced this pull request Jun 30, 2025
…han a copy in order to add Rust binding (#5567)

Differences in copy and move semantics between Rust and C++ mean that
they are generally not very good at passing values across the boundary -
most of the time it is necessary to pass references instead.

#5566 intends to call `Attribute::get_enumeration_name` from Rust. Prior
to these changes, that function returns `std::optional<std::string>`,
which cannot be passed across the boundary (neither `std::optional` nor
`std::string` can be). We can either pass a pointer to a string, or pass
a string contained within a smart pointer.

To avoid the additional memory allocations required for both copying the
`std::string` and placing it in a `unique_ptr`, we choose the former.

However, `std::optional` does not naturally support references, so we
cannot do `std::optional<std::string&>`. Instead we change the return
type to `std::optional<std::reference_wrapper<std::string>>`.

In addition to enabling passing the result of this function to Rust
without allocating additional memory, this also avoids making additional
copies of the string. This is not likely to matter from a performance
perspective, but is still nice!

---
TYPE: NO_HISTORY
DESC: Rust binding for `Attribute::get_enumeration_name`
@rroelke rroelke force-pushed the rr/core-25-add-predicate branch from 3e877a3 to 6017582 Compare June 30, 2025 14:32
Copy link
Member

@teo-tsirpanis teo-tsirpanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review

@@ -2989,7 +3023,7 @@ void QueryCondition::Datafusion::apply(
result_bitmap[i] *= bitmap[i];
}
} else {
throw std::logic_error(
throw QueryConditionException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, the trend of our exception (and previously status) types is to indicate where the error occured, rather than what the error was, putting us against the paradigm of exception types in C++ and other languages. This is not a good thing, and leads to C APIs almost always returning the monolithic TILEDB_ERR status code when they fail, which leads to poor error handling and reporting practices downstream, including the very fragile parsing of the error message, which I have seen doing both in testing and production code.

Of course we cannot fix this problem at once and in this PR, but what we can do is prevent the "where"-typed exceptions from proliferating in new code, or at the very least in existing code. Which is a long-winded way for me to say "I think we should revert this line (and the one below)".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree with the root notion that the error reporting needs improvement. I prefer leaving this in however. What proper error handling looks like is a question for the future; and what this change does today is at least make it easier to identify all the instances of thrown exceptions in this file by searching on a common string.

@@ -58,8 +58,24 @@
#include "tiledb/sm/storage_manager/cancellation_source.h"
#include "tiledb/sm/subarray/subarray.h"

#ifdef HAVE_RUST
#include "tiledb/oxidize/rust.h"
Copy link
Member

@teo-tsirpanis teo-tsirpanis Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is rust.h? If it adds large amounts of code to the compilation, we should avoid including it in other headers, for example by converting LogicalExpr and Session to self-contained types, and storing tdb_unique_ptrs of them in the class. There is also the concern that it will affect developers' inner loop, if the header's content is too sensitive to changes in the Rust code's implementation details, it will force recompilations of the C++ code, which will be limited by moving the include to the implementation file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents of this header are pinned to the version of our cxx crate dependency. I wouldn't be shocked if it never changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header is generated by the cxxbridge dependency and provides C++ definition which correspond to Rust standard library types, e.g. rust::Box and rust::Vec and etc. It isn't generated by compiling any of our Rust crates.

#[cxx_name = "attribute"]
fn attribute_by_idx(&self, idx: u32) -> *const Attribute;

#[cxx_name = "attribute"]
fn attribute_by_name(&self, name: &CxxString) -> *const Attribute;

#[cxx_name = "get_enumeration"]
fn const_enumeration_cxx(&self, name: &CxxString) -> SharedPtr<ConstEnumeration>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have ConstSharedPtr instead of const editions of the individual classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. The #[cxx::bridge] macro is pretty strict about what types are allowed. SharedPtr is a special type which it knows how to expand to code on both the Rust and C++ side; custom opaque types are never allowed to be returned or passed by value.

@rroelke
Copy link
Member Author

rroelke commented Jul 2, 2025

My plan was to avoid dealing with enumerations for now - but the existing QueryCondition tests rely on them, so I will have to do something with them here...

@rroelke rroelke marked this pull request as ready for review July 3, 2025 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants