-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass on IN/Not In #270
Conversation
} | ||
prim_array_cmp! { | ||
left_arr, right_arr, | ||
(ArrowDataType::Int8, Int8Type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: It seems like we have several of these arrow/kernel type pairs. I wonder if there's a way to capture the pairings just once and reuse them? Maybe some kind of trait?
enum Foo {
INT,
STRING,
}
trait FooFromBar {
const F: Foo;
}
impl FooFromBar for i32 {
const F: Foo = Foo::INT;
}
impl FooFromBar for String {
const F: Foo = Foo::STRING;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are suggesting something very similar to what arrow itself does for like primitive types...
pub trait ArrowPrimitiveType: primitive::PrimitiveTypeSealed + 'static {
/// Corresponding Rust native type for the primitive type.
type Native: ArrowNativeTypeOp;
/// the corresponding Arrow data type of this primitive type.
const DATA_TYPE: DataType;
/// Returns the byte width of this primitive type.
#[deprecated(note = "Use ArrowNativeType::get_byte_width")]
fn get_byte_width() -> usize {
std::mem::size_of::<Self::Native>()
}
/// Returns a default value of this primitive type.
///
/// This is useful for aggregate array ops like `sum()`, `mean()`.
fn default_value() -> Self::Native {
Default::default()
}
}
mod primitive {
pub trait PrimitiveTypeSealed {}
}
Lifted directly from their code, I like the idea too, we will have to do a lot of this and it'll help for implementations beyond arrow to model their own engine types into ours with not a lot of hassle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something you hope to pursue in this PR? Or defer to an issue as future work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'd hope to do this in a subsequent PR, I'd like to keep this PR simple (and perhaps a bit inefficient) because I'd imagine we have design discussions to have in order to better support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the idea of capturing these mappings more formally - delta-rs rs is littered with mappings from and to arrow...
let left_arr = evaluate_expression(left.as_ref(), batch, None)?; | ||
let right_arr = evaluate_expression(right.as_ref(), batch, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstand, but I thought the SQL [NOT] IN clause was used to check whether an array does [not] contain a specific scalar value? This seems to be array vs. array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: or perhaps left_arr
and string_arr
are inaccurately named?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM... the "array" in those names refers to the batch of results, not the array-ness of individual results!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "array-ness" of the underlying physical data are called lists in arrow, which I'd imagine is meant to reduce confusion.
kernel/src/expressions/scalars.rs
Outdated
#[derive(Clone, Debug, PartialEq)] | ||
pub struct ArrayData { | ||
tpe: ArrayType, | ||
pub elements: Vec<Scalar>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't scalars already carry type info? Besides having one more thing that can get out of sync and which needs checking, this would lead to a pretty inefficient data representation due to enum discriminants etc.
If we're only using this for short literal arrays such as x in (10, 20, 30)
this might be ok... but if it also gets used for x in (select * from ...)
then we could quickly be in trouble?
Do we need to define ArrayData
that mirrors Scalar
but with each variant being a Vec<T>
instead of T
? Except then null values would be really annoying. Should we requires arrays to just always be engine data?
(I don't want to over-optimize here, but this concept is built into kernel and so if we pick something pessimal the engine has no way to work around it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: even if it's only for "short" literal arrays, BI tools are notorious for emitting literal arrays with thousands or even millions of entries -- producing multi-MB SQL queries. So we might get into trouble even without subqueries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I see your point, but I don't have an immediate idea for what might be best here. Off the top, engine data across the board instead of introducing another opaque type I think would be a better option (assuming it's reasonably achievable).
This code was written very much to suit delta-rs's partition filtering logic which rarely has more than a few values, I've only seen it get into the thousands a few times in the wild for a hive style partition filter. But more to your other point about subqueries and BI tools generating large constant lists, I have sadly seen those hit the max length of a java array and I have seen more than a few that are in the hundreds of thousands or millions. So your concern is well placed, but perhaps given more conversation about the representation would it be acceptable to table that discussion for a subsequent PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we don't need to solve it immediately, but definitely worth a tracking issue because it will bite us eventually. The advantage of going engine data route is then the engine can decide how to layout the data, whether to build a hash table, etc -- which seems very in line with kernel philosophy. I don't know how easily we could abstract that away without going the engine data route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to solution too much here, but I think this would make sense as an EngineDataLayout
trait of sorts, where certain layout specific constants are filled in. This would be somewhat of a combination of OffsetSizeTrait
ArrowNativeType
and ArrowPrimitiveType
in which the end implementation denotes layout specific information about it's engine data structure. This also has some interesting outside applications from arrays for example duckdb has a row_idx type for their row IDs. it's a uint64_t for them, but for another engine this may not be the case. I think we should reasonably strive to speak the same structure of the engine when dealing with it's data, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this situation -- we're choosing to model [NOT] IN as a binary expression, which requires us to introduce some new expression type that can represent the RHS. We could go one of two ways here:
- Introduce a general array literal type, which is transparent to kernel and which [NOT] IN expressions happen to employ as their RHS.
- Introduce a generic in-list type, which is largely opaque to kernel and which is always the RHS of a [NOT] IN expression.
More and more, I wonder if 2/ is a better choice:
- Adding transparent array literals permanently and significantly widens the kernel API. Absent at least a second use case (preferably 3+ use cases), this seems like a lot of work and future liability for small gain?
- Forcing the engine to materialize the in-list as a rust array of (literal, type) pairs will blow up memory usage, and reduces the maximum in-list size an engine could otherwise handle. This is a "compile time" problem, independent of \what evaluation strategy the engine employs later.
- Using an opaque specific expression type for IN list allows the engine to do whatever it wants, without having to tell kernel anything. Kernel would just see a trait of some kind which it dutifully and blindly forwards to the engine's expression eval. We might even be able to generalize the "opaque engine expression" concept so that kernel can "handle" engine expression types it otherwise doesn't know about.
The downside of 2/ is it might make data skipping harder, since kernel doesn't have any idea what's "inside the box." It may be that we don't care, because long in-lists may turn out to not be good data skipping targets in the first place (low pruning power). If we did decide the pruning power is worth chasing, then we would need to expand the opaque in-list expression trait with method(s) that can perform data skipping. Probably very doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be okay with data skipping being a bit harder because another upside of 2 is that it allows us to generalize expressions in a way that allows the engine to decide what is correct. What I mean is currently there is no subset of expression functions that delta says an expression evaluator must have or what you can write in an expression. I've seen some very "interesting" expressions in the wild and making this opaque to the kernel save a trait to evaluate us would allow engines to define their own semantics here without worrying about whether it's a supported function or not. I know this doesn't have much to do with NOT/IN but just a point I figured was worth mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Let's go with the opaque approach and worry about data skipping if/when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do eventually want to represent in-lists as opaque engine-defined expressions rather than array literals, is there a way to mark ArrayData
as temporary/unstable/etc so we are able to remove it later without breaking the world? If nothing else, a doc comment explaining the situation might go a long way?
@@ -67,6 +74,20 @@ impl Scalar { | |||
.try_collect()?; | |||
Arc::new(StructArray::try_new(fields, arrays, None)?) | |||
} | |||
Array(data) => { | |||
let values = data.array_elements(); | |||
let vecs: Vec<_> = values.iter().map(|v| v.to_array(num_rows)).try_collect()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% confident I understand what this code is doing but:
It looks like we need to replicate an "array scalar" value into an arrow array suitable for comparing against a batch of primitive values, right? So e.g. if I had x in (1, 2, 3)
then values
would be [1, 2, 3]
and we need to explode that into [[1, 2, 3], [1, 2, 3], ..., [1, 2, 3]]
so that the eventual evaluate_expression
call can invoke e.g. in_list? The documentation for that function is sorely incomplete, but I guess it's marching element by element through two arrays, producing a true output element each row whose list array element contains the corresponding primitive array element? That would be general enough to handle a correlated subquery, but would be quite space-inefficient for the common case (literal in-list or uncorrelated subquery) where we compare against the same array in every row -- especially in case said array is large.
Does arrow rust provide a "scalar" version of in_list
that takes two primitive arrays instead? So that e.g. scalar_in_list([1, 2, 3, 4, 5, 6], [1, 3, 5])
returns [true, false, true, false, true, false]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, go through and build the offsets. This approach was actually heavily inspired from how datafusion does this, so while I agree a very large list likely doesn't have great performance there isn't (to my current knowledge) a more idiomatic way to accomplish this. I generally dislike working with the list types in arrow, but to my knowledge arrow does not provide a scalar version of it. You would have to replicate the static value N number of times for each row. Primitive array, Generic string array on the left and list array on the right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we should really build+probe a hash table for all but the smallest in-lists, to avoid paying O(n**2)
work. It would likely pay for itself for any query that has more than about 100 rows (and smaller queries would run so fast who cares).
If arrow doesn't give a way to do that maybe we need to do it ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: the hash table thing gets pretty clearly into engine territory, so if we can figure out how to offload the optimization to engine [data], that's probably better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: the hash table thing gets pretty clearly into engine territory, so if we can figure out how to offload the optimization to engine [data], that's probably better.
We're already in engine territory here, since this is the expression evaluator. That said, I think we could make this better. See my comment below where we evaluate IN
Down with the turbo fish Co-authored-by: Ryan Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. Left a few comments.
@@ -67,6 +74,20 @@ impl Scalar { | |||
.try_collect()?; | |||
Arc::new(StructArray::try_new(fields, arrays, None)?) | |||
} | |||
Array(data) => { | |||
let values = data.array_elements(); | |||
let vecs: Vec<_> = values.iter().map(|v| v.to_array(num_rows)).try_collect()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: the hash table thing gets pretty clearly into engine territory, so if we can figure out how to offload the optimization to engine [data], that's probably better.
We're already in engine territory here, since this is the expression evaluator. That said, I think we could make this better. See my comment below where we evaluate IN
kernel/src/engine/arrow_utils.rs
Outdated
$( | ||
$data_ty => in_list( | ||
$left_arr.as_primitive::<$prim_ty>(), | ||
$right_arr.as_list::<i32>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will panic!
if $right_arr
isn't a list (like the user passed 3 in 3
or something like that). We should return an error instead that IN
needs an ArrayData
type on the right.
}, | ||
_, | ||
) => { | ||
let left_arr = evaluate_expression(left.as_ref(), batch, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we optimize this quite easily?
If right
is a Literal::(ArrayData)
, we can loop left_arr
ourselves and probe right
, possibly switching between array vs. hashtable based on the length of left_arr
.
If right
is a Literal::(something_else)
I think that's an error and we can just return.
If right
is an actual expression, we don't need to do the crazy exploding above because it'll just evaluate to something and then we can use the list_in
case.
Possibly I'm missing something, but that seems preferable to the "building of a repeated list column" above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were thinking something similar to this?
match (left.as_ref(), right.as_ref()) {
(_, Column(_)) => { // Use Arrow in_list },
(Literal(lit), Literal(Scalar::Array(ad))) => { // Just do the comparison ourselves. },
(_, _) => { // Some invalid comparison }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we can encounter one of four cases:
- Literal value vs. array of literal values -- not really meaningful for data skipping since no column stats are implicated (and anyway the engine's query optimizer's constant expression eval should optimize it away).
- Column vs. array of literal values -- the "standard" case, where I believe @nicklan was suggesting to manually search the array for each column value, rather than replicating the array N times.
- Column vs. uncorrelated subquery -- The engine would need to evaluate subqueries before triggering data skipping on the resulting array of literals. Might require some thought on how to integrate that with kernel's data skipping.
- Column vs. correlated subquery -- If the engine is unable to decorrelate the subquery into a join, it would need to evaluate subqueries and produce an arrow-style array of arrays. Probably not possible to integrate with data skipping, tho (because min and max column values are meaningless in this context).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were thinking something similar to this?
Yep!
Regarding Ryan's comment. I was indeed discussing the second bullet point case.
For uncorrelated subqueries, I think the engine could/would execute them independently and then pass that as just an array of literal values in an IN
predicate.
Correlated subqueries are obviously trickier, but as Ryan notes we probably can't use data skipping there anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstood something? This code:
(Literal(lit), Literal(Scalar::Array(ad))) => { // Just do the comparison ourselves. },
would correspond to my first bullet point case <literal scalar> IN (<... literal values ...>)
, which could be evaluated at query compile time (no need to read anything off disk, not even stats).
The second bullet point, which I suggested we might want to optimize, would be
(Column(...), Literal(Scalar::Array(ad))) => { // Just do the comparison ourselves. },
To handle this case, we'd need to grab the min/max stats for the column, and check whether they bracket any of the values in the array of scalars that represents the in-list. Any file whose stats don't bracket any in-list values can be skipped.
And then, at query eval time, we'd need to march through each scalar value in the column chunk, probing it against the array of scalars, to identify qualifying rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you understand this correctly, and I'm having a rough time coming up with a way to do column, literal that doesn't blow the code complexity up a massive amount. It's just that without some re-plumbing there doesn't look like a good way to feed column statistics into expression evaluation. Would it be acceptable to have an inefficient impl to start with and we can discuss a larger refactoring we'd might have to do to support better handling of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple reactions:
First, data skipping support for in-list is definitely an extra piece of work outside the scope of this PR, so we can defer the necessary stats plumbing to an issue and future work.
Second: The sub-optimal evaluation is (a) in our default engine's expression eval and (b) apparently The Arrow Way ("why that way?" being an obvious but out of scope question). So I agree we can defer being more clever, or possibly even choose not to be clever at all as long as we meet the "don't pessimize" bar.
The bigger priority IMO is ensuring that kernel's general handling of IN-list allows engines to do a better job if they are willing to implement the expression eval themselves. That part gets trickier because engines have a lot of flexibility in how they rewrite and later eval such expressions. For example, I would hope most engines are smart enough to optimize x in (10, 11)
as x == 10 OR x == 11
, and some engines (IBM DB2 and sqlite3 among them) choose to convert oversized in-lists into a temp relation with an equijoin clause. In both cases, kernel would never even see an in-list. And for engines that do handle (medium-sized) IN-list directly, I would expect hash tables to figure prominently for their efficient probing. So we'll need to do some thinking about what kinds of in-list eval we expect to see, and try not to accidentally block engines from using their preferred approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, see my other comment below -- maybe we don't want kernel to expose an array literal type in the first place?
…d of nullable pointer
}, | ||
_, | ||
) => { | ||
let left_arr = evaluate_expression(left.as_ref(), batch, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple reactions:
First, data skipping support for in-list is definitely an extra piece of work outside the scope of this PR, so we can defer the necessary stats plumbing to an issue and future work.
Second: The sub-optimal evaluation is (a) in our default engine's expression eval and (b) apparently The Arrow Way ("why that way?" being an obvious but out of scope question). So I agree we can defer being more clever, or possibly even choose not to be clever at all as long as we meet the "don't pessimize" bar.
The bigger priority IMO is ensuring that kernel's general handling of IN-list allows engines to do a better job if they are willing to implement the expression eval themselves. That part gets trickier because engines have a lot of flexibility in how they rewrite and later eval such expressions. For example, I would hope most engines are smart enough to optimize x in (10, 11)
as x == 10 OR x == 11
, and some engines (IBM DB2 and sqlite3 among them) choose to convert oversized in-lists into a temp relation with an equijoin clause. In both cases, kernel would never even see an in-list. And for engines that do handle (medium-sized) IN-list directly, I would expect hash tables to figure prominently for their efficient probing. So we'll need to do some thinking about what kinds of in-list eval we expect to see, and try not to accidentally block engines from using their preferred approach.
} | ||
prim_array_cmp! { | ||
left_arr, right_arr, | ||
(ArrowDataType::Int8, Int8Type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something you hope to pursue in this PR? Or defer to an issue as future work?
kernel/src/expressions/scalars.rs
Outdated
#[derive(Clone, Debug, PartialEq)] | ||
pub struct ArrayData { | ||
tpe: ArrayType, | ||
pub elements: Vec<Scalar>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this situation -- we're choosing to model [NOT] IN as a binary expression, which requires us to introduce some new expression type that can represent the RHS. We could go one of two ways here:
- Introduce a general array literal type, which is transparent to kernel and which [NOT] IN expressions happen to employ as their RHS.
- Introduce a generic in-list type, which is largely opaque to kernel and which is always the RHS of a [NOT] IN expression.
More and more, I wonder if 2/ is a better choice:
- Adding transparent array literals permanently and significantly widens the kernel API. Absent at least a second use case (preferably 3+ use cases), this seems like a lot of work and future liability for small gain?
- Forcing the engine to materialize the in-list as a rust array of (literal, type) pairs will blow up memory usage, and reduces the maximum in-list size an engine could otherwise handle. This is a "compile time" problem, independent of \what evaluation strategy the engine employs later.
- Using an opaque specific expression type for IN list allows the engine to do whatever it wants, without having to tell kernel anything. Kernel would just see a trait of some kind which it dutifully and blindly forwards to the engine's expression eval. We might even be able to generalize the "opaque engine expression" concept so that kernel can "handle" engine expression types it otherwise doesn't know about.
The downside of 2/ is it might make data skipping harder, since kernel doesn't have any idea what's "inside the box." It may be that we don't care, because long in-lists may turn out to not be good data skipping targets in the first place (low pruning power). If we did decide the pruning power is worth chasing, then we would need to expand the opaque in-list expression trait with method(s) that can perform data skipping. Probably very doable.
}, | ||
_, | ||
) => { | ||
let left_arr = evaluate_expression(left.as_ref(), batch, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, see my other comment below -- maybe we don't want kernel to expose an array literal type in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for now, assuming we can somehow retain the ability to (a) clean up the type matching; and (b) remove the ArrayData
concept once improved in-lists of the future no longer require it.
kernel/src/expressions/scalars.rs
Outdated
#[derive(Clone, Debug, PartialEq)] | ||
pub struct ArrayData { | ||
tpe: ArrayType, | ||
pub elements: Vec<Scalar>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do eventually want to represent in-lists as opaque engine-defined expressions rather than array literals, is there a way to mark ArrayData
as temporary/unstable/etc so we are able to remove it later without breaking the world? If nothing else, a doc comment explaining the situation might go a long way?
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #270 +/- ##
=======================================
Coverage ? 73.60%
=======================================
Files ? 43
Lines ? 8075
Branches ? 8075
=======================================
Hits ? 5944
Misses ? 1752
Partials ? 379 ☔ View full report in Codecov by Sentry. |
…r an in/not in expression was not actually an arrow list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting discussions on this PR!!
For me it looks good, the only comment I would have, is that we may need to open a few more follow up issues to keep track. I think I saw only one mentioned, but feel there were a few more follow ups coming up in the discussions.
opened #314 to capture centralizing the type mappings, which I think we hadn't done yet. |
Changes addressed + Ryan/Robert's approval
I still want to add another test or two for these (my current ones aren't sufficient), but just getting the conversation started here. This is also the first part of getting partition filter for eventually moving delta-rs over kernel partition filters.