-
Notifications
You must be signed in to change notification settings - Fork 196
feat: add tiledb_query_add_predicate
API to parse a SQL expression string into a QueryCondition
#5566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…han a copy in order to add Rust binding (#5567) Differences in copy and move semantics between Rust and C++ mean that they are generally not very good at passing values across the boundary - most of the time it is necessary to pass references instead. #5566 intends to call `Attribute::get_enumeration_name` from Rust. Prior to these changes, that function returns `std::optional<std::string>`, which cannot be passed across the boundary (neither `std::optional` nor `std::string` can be). We can either pass a pointer to a string, or pass a string contained within a smart pointer. To avoid the additional memory allocations required for both copying the `std::string` and placing it in a `unique_ptr`, we choose the former. However, `std::optional` does not naturally support references, so we cannot do `std::optional<std::string&>`. Instead we change the return type to `std::optional<std::reference_wrapper<std::string>>`. In addition to enabling passing the result of this function to Rust without allocating additional memory, this also avoids making additional copies of the string. This is not likely to matter from a performance perspective, but is still nice! --- TYPE: NO_HISTORY DESC: Rust binding for `Attribute::get_enumeration_name`
3e877a3
to
6017582
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary review
@@ -2989,7 +3023,7 @@ void QueryCondition::Datafusion::apply( | |||
result_bitmap[i] *= bitmap[i]; | |||
} | |||
} else { | |||
throw std::logic_error( | |||
throw QueryConditionException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, the trend of our exception (and previously status) types is to indicate where the error occured, rather than what the error was, putting us against the paradigm of exception types in C++ and other languages. This is not a good thing, and leads to C APIs almost always returning the monolithic TILEDB_ERR
status code when they fail, which leads to poor error handling and reporting practices downstream, including the very fragile parsing of the error message, which I have seen doing both in testing and production code.
Of course we cannot fix this problem at once and in this PR, but what we can do is prevent the "where"-typed exceptions from proliferating in new code, or at the very least in existing code. Which is a long-winded way for me to say "I think we should revert this line (and the one below)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree with the root notion that the error reporting needs improvement. I prefer leaving this in however. What proper error handling looks like is a question for the future; and what this change does today is at least make it easier to identify all the instances of thrown exceptions in this file by searching on a common string.
@@ -58,8 +58,24 @@ | |||
#include "tiledb/sm/storage_manager/cancellation_source.h" | |||
#include "tiledb/sm/subarray/subarray.h" | |||
|
|||
#ifdef HAVE_RUST | |||
#include "tiledb/oxidize/rust.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is rust.h
? If it adds large amounts of code to the compilation, we should avoid including it in other headers, for example by converting LogicalExpr
and Session
to self-contained types, and storing tdb_unique_ptr
s of them in the class. There is also the concern that it will affect developers' inner loop, if the header's content is too sensitive to changes in the Rust code's implementation details, it will force recompilations of the C++ code, which will be limited by moving the include to the implementation file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of this header are pinned to the version of our cxx
crate dependency. I wouldn't be shocked if it never changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header is generated by the cxxbridge
dependency and provides C++ definition which correspond to Rust standard library types, e.g. rust::Box
and rust::Vec
and etc. It isn't generated by compiling any of our Rust crates.
#[cxx_name = "attribute"] | ||
fn attribute_by_idx(&self, idx: u32) -> *const Attribute; | ||
|
||
#[cxx_name = "attribute"] | ||
fn attribute_by_name(&self, name: &CxxString) -> *const Attribute; | ||
|
||
#[cxx_name = "get_enumeration"] | ||
fn const_enumeration_cxx(&self, name: &CxxString) -> SharedPtr<ConstEnumeration>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have ConstSharedPtr
instead of const
editions of the individual classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. The #[cxx::bridge]
macro is pretty strict about what types are allowed. SharedPtr
is a special type which it knows how to expand to code on both the Rust and C++ side; custom opaque types are never allowed to be returned or passed by value.
My plan was to avoid dealing with enumerations for now - but the existing QueryCondition tests rely on them, so I will have to do something with them here... |
…it query condition tests
Resolves CORE-25.
#5546 enabled
QueryCondition
to evaluate datafusion expression trees in itsapply_sparse
function. This was used only in test scenarios, with a configuration parameter which would haveQueryCondition
translate its syntax tree into a datafusionExpr
and then evaluate it.This pull request builds upon this capability and connects it to C and C++ level APIs. We add
tiledb_query_add_predicate(tiledb_ctx_t*, tiledb_query_t*, const char* predicate)
which uses datafusion to parsepredicate
into anExpr
which can be evaluated byQueryCondition
.This enables users to add a much more expressive set of predicates to their queries, which can perform arithmetic, function calls, conditional (
CASE
) expressions, and so on.Design
Datafusion has a
SessionContext
which contains the resources and catalog of a Datafusion session. This object is used to parse and evaluate expressions and is where one registers tables, UDFs, etc.We will lazily attach a
SessionContext
to aQuery
whentiledb_query_add_predicate
is first invoked. Doing so lazily means that existing applications should observe no changes.The
tiledb_query_add_predicate
initializes theSessionContext
if needed and uses it to parse the expression intoLogicalExpr
. We accumulate a list ofLogicalExpr
and then combine them into a single conjunctive expression tree (possibly together with aQueryCondition
syntax tree) and embed that in a newQueryCondition
.Enumerations
To reduce the scope of this pull request, we hoped to defer supporting attributes with enumerations in predicates to a future pull request (see CORE-287). However, #5546 demonstrated some examples of rewriting a QueryCondition into a DataFusion expression tree when attributes with enumerations were used in the predicates. This occurred after
QueryCondition::rewrite_for_schema
, which resolves the literal into a key for comparison with the key stored in the tile. Continuing to pass these tests requires some amount of enumeration support to be implemented here.The "obvious" thing to do would be to represent enumerations in the schema using the Arrow
Dictionary
data type. However, this notion has a few shortcomings:DictionaryArray
requires that each of the keys is a valid index into the dictionary values, whereas TileDB enumerations specifically do not have this requirement.To work around these things, we sometimes need the Arrow schema that we generate to use the enumeration key type, and sometimes we will want it to use the enumeration value type.
This pull request adds a switch to
tiledb::oxidize::arrow::schema::create
and the functions coming afterwards which implement this toggle, for the generated schema to be the "storage" schema or the "view" schema. We use the "storage" schema to support the existing tests. CORE-287 will use the "view" schema to support text predicates on attributes with enumerations.There is some more code related to enumerations added here which was in draft along the way to implementing a correct "view" schema. I left it in but we can remove it if preferred.
Testing
query_condition_sparse
API examples, as well as some more complicated predicates which cannot be expressed as a query condition.unit-query-add-predicate.cc
.unit-sparse-global-order-reader.cc
to optional render the generatedQueryCondition
syntax tree as a SQL string, which we then add as a predicate instead of applying it to the query normally.Unsupported
Next Steps
This does not analyze predicates for attributes/dimensions to improve R-tree traversal (CORE-28) or add the desired intersection function (CORE-26).
TYPE: FEATURE
DESC: add
tiledb_query_add_predicate
API to parse a SQL expression string into aQueryCondition