-
Notifications
You must be signed in to change notification settings - Fork 85
feat!: new null_row
ExpressionHandler API
#662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: new null_row
ExpressionHandler API
#662
Conversation
note I'll be cleaning up/adding more tests. wanted to get some eyes on this approach first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I want to consider/discuss about this approach is that we require that the expression struct heirarchy matches the schema one. So a schema Struct(Struct(Scalar(Int))
requires an expression Struct(Struct(Literal(int)))
. This code wouldn't allow a Literal(int)
expression.
Idk if we want to enforce that requirement in the long run? It's very common for kernel to flatten out the fields of a schema (ex: in a visitor), so I don't see why we shouldn't allow flattened expressions.
Perhaps this acts as a safety thing. Kernel is the only one calling create_one
, and it ensures that things are nested as we expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick pass, couple reactions
(overall approach looks good)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #662 +/- ##
==========================================
+ Coverage 84.35% 84.64% +0.29%
==========================================
Files 81 82 +1
Lines 19253 19735 +482
Branches 19253 19735 +482
==========================================
+ Hits 16241 16705 +464
- Misses 2209 2214 +5
- Partials 803 816 +13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
create_one
ExpressionHandler APIcreate_one
ExpressionHandler API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing comments from an interrupted-and-forgotten review...
match self.stack.pop() { | ||
Some(array) => Ok(array), | ||
None => Err(Error::generic("didn't build array")), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relating to the other FIXME about panicking:
match self.stack.pop() { | |
Some(array) => Ok(array), | |
None => Err(Error::generic("didn't build array")), | |
} | |
let Some(array) = self.stack.pop() else { | |
return Err(Error::generic("didn't build array")); | |
} | |
let Some(array) = array.as_struct_opt() else { | |
return Err(Error::generic("not a struct")); | |
} | |
Ok(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as_struct_opt will return an &StructArray - and I want to avoid having to clone that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just ended up checking array.data_type()
though I wonder if it would be better to actually return an Arc<StructArray>
instead of the trait object ArrayRef
?
for (child, field) in child_arrays.iter().zip(struct_type.fields()) { | ||
if !field.is_nullable() && child.is_null(0) { | ||
// if we have a null child array for a not-nullable field, either all other | ||
// children must be null (and we make a null struct) or error | ||
if child_arrays.iter().all(|c| c.is_null(0)) | ||
&& self.nullability_stack.iter().any(|n| *n) | ||
{ | ||
self.stack.push(Arc::new(StructArray::new_null(fields, 1))); | ||
return Some(Cow::Borrowed(struct_type)); | ||
} else { | ||
self.set_error(Error::Generic(format!( | ||
"Non-nullable field {} is null in single-row struct", | ||
field.name() | ||
))); | ||
return None; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm i'm not convinced by this. Pls correct me if I'm missing something! We're keeping track of the parent nullability with the nullability
stack. Seems that we allow a nullability violation if any ancestor node is nullable and all the children are null. But I may have found a counter example:
Consider this schema
{
x(nullable): {
a (non-nullable),
b (non-nullable) {
c (non-nullable)
}
}
}
suppose we get the scalar: [1, NULL]
When we're processing struct b
, we'll iterate over all of its fields. We'll find that c
is null when it's non-nullable. At b
I think the nullability stack would be [true, false]
from x
and b
respectively.
Given all these, we don't return an error. We allowed c
to be null because we thought its ancestor x
is null. That's this check
if child_arrays.iter().all(|c| c.is_null(0)) && self.nullability_stack.iter().any(|n| *n)
But if x
is null, then a
should also be null, which it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming I'm not missing something, I thought up an alternate solution.
Definitions
We should fail if there is a nullability violation. Nullability violations can happen in 2 cases:
- Base case: a leaf field is non-nullable, but the value is null.
- Struct case: A struct has a nullability violation if both hold:
- at least one of its children has a nullability violation
- The struct does not resolve the nullability violation.
A nullability violation for a struct node is resolved when both hold:
1) all of its children are null
2) the node is nullable.
This is the case where the entire struct is null. All of its children may be null, and violations can be safely ignored.
Solution
We keep track of 2 variables for each node:
- Null_subtree: This is true if all of the node and all its descendants are null.
- null_violation: This is true if the node has a nullability violation (as defined above).
And an additional variable for struct nodes:
- is_resolved: This is true if the node is nullable and the node is null_subtree is True
Base case:
- null_subtree = True if the leaf is null
- null_violation = True if the field is non-nullable, but the value is null
Inductive case:
null_subtree = True
if all the children are nullis_resolved = True
if null_subtree and current node is nullablenull_violation
= True if (any child has null_violation) and !(is_resolved)
Return an error if at the top level (null_violation == true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have found a counter example:
Ignoring all code for a moment, and tweaking slightly to add d
as a sibling to c
:
x(nullable): { a (non-nullable), b (non-nullable) { c (non-nullable) d (nullable) } }
Analysis
At the time we encounter NULL for c
, there are only two possible outcomes:
b
is non-NULL => definitely an errorb
is NULL => possibly allowed (depending on whetherb
is allowed to be NULL, which in turn depends on whetherx
is NULL)
However, we are doing a depth-first traversal. So at the time we process e.g. c
we have not even seen d
yet, let alone processed parent parent b
and grandparent x
. The stack is [a:<whatever>, c:NULL]
.
Since we cannot yet know the correct handling of c
, we just push its NULL value on the stack and move on to d
(which we also just push onto the stack). Once the recursion unwinds to b
, we have two possibilities:
[a:<whatever>, c:NULL, d:NULL]
-- because all children ofb
are NULL (c
andd
), and at least one of those children is "immediately" non-nullable, we assume the intent was to express (by transitivity) the fact thatb
itself is NULL (recall thatb
is not a leaf so we can't represent its nullness directly). Result:[a:<whatever>, b:NULL]
. Whether that's good or bad is still to be determined transitively as the recursion unwinds.[a:<whatever>, c:NULL, d:<something>]
-- becaused
is non-NULL, we knowb
cannot be NULL and therefore it is an error for "immediately" non-nullablec
to be NULL. Result: **ERROR**.
Assuming we did not already error out, we again have two possibilities:
[a:NULL, b:NULL]
-- as before, all children ofx
are NULL (a
andb
), and at least one of those children is "immediately" non-nullable, so we assume the intent was to expressx
is NULL. Sincex
is immediately nullable, this is totally legitimate and the recursion completes successfully.[a:<something>, b: NULL]
-- again as before,x
cannot be NULL because it has a non-NULL childa
. So NULL value for "immediately" non-nullableb
is illegal and the recursion errors out.
Coming back to code:
The recursive algorithm would seem to be:
- For all leaf values, accept NULL values unconditionally, deferring correctness checks to the parent.
- Whenever the recursion unwinds to reach a (now complete) struct node, examine the children. We have several possible child statuses:
- All children non-NULL -- No problem, nothing to see here, move on.
- All children NULL.
- If all children are nullable, this is fine, and we interpret the parent as non-NULL with all-null children.
- Otherwise, we interpret this as an indirect way of expression that the parent itself is NULL. As with a leaf value, we accept that NULL value unconditionally, deferring correctness checks to the parent.
- Otherwise, we have a mix of NULL and non-NULL children. The parent thus cannot be NULL.
- If any of the NULL children are immediately non-nullable => ERROR
- Otherwise, no problem, nothing to see here, move on.
If we consider all combos of the above schema, that involve least one NULL:
[a:<something>, c:<something>, d:NULL]
- OK (x.b.d
is nullable)[a:<something>, c:NULL, d:<something>]
- ERROR (x.b.c
is non-nullable, detected byb
)[a:NULL, c:<something>, d:<something>]
- ERROR (x.a
is non-nullable, detected byx
)[a:<something>, c:NULL, d:NULL]
- ERROR (x.b
is non-nullable, detected byx
)[a:NULL, c:<something>, d:NULL]
- ERROR (x.a
is non-nullable, detected byx
)[a:NULL, c:NULL, d:<something>]
- ERROR (x.b.c
is non-nullable, detected byb
)[a:NULL, c:NULL, d:NULL]
- OK (x
is nullable)
Notably, I dont' think we need a stack to track nullability -- each parent just verifies its direct children for correct match-up of their nullability (and NULL values) vs. its own nullability. If there is no obvious local conflict, it makes itself either NULL or non-null as appropriate and then trusts its parent to do the same checking as needed.
Code
fn transform_struct(&mut self, struct_type: &'a StructType) -> Option<Cow<'a, StructType>> {
// NOTE: This is an optimization; the other early-return suffices to produce correct behavior.
if self.error.is_some() {
return None;
}
// Only consume newly-added entries (if any). There could be fewer than expected if
// the recursion encountered an error.
let mark = self.stack.len();
let _ = self.recurse_into_struct(struct_type);
let field_values = self.stack.split_off(mark);
if self.error.is_some() {
return None;
}
require!(field_values.len() == struct_type.len(), ...);
let mut found_non_nullable_null = false;
let mut all_null = true;
for (f, v) in struct_type.fields().zip(&field_values) {
if v.is_valid(0) {
all_null = false;
} else if !f.is_nullable() {
found_non_nullable_null = true;
}
}
let null_buffer = found_non_nullable_null.then(|| {
// The struct had a non-nullable NULL. This is only legal if all fields were NULL, which we
// interpret as the struct itself being NULL.
require!(all_null, ...);
// We already have the all-null columns we need, just need a null buffer
NullBuffer::new_null(1)
});
// Assemble the struct normally but mark it NULL? Or make a NULL struct directly?
let sa = match StructArray::try_new(..., null_buffer) { ... };
self.stack.push(sa);
None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness of testing, we probably need a schema that exercises every possible combo of fields, along with one set of leaf scalars for every possible combo of NULL and non-NULL.
There are six "interesting" combos (n
= nullable, !
= non-null):
n { n, n }
n { n, ! }
n { !, ! }
! { n, n }
! { n, ! }
! { !, ! }
Each one can have 4 distinct input value combinations, for a total of 6x4 = 24 cases to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: What happens when a struct is non-nullable, but all its children are nullable? Does this mean that we enforce that at least one of the children is non-null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing I wanted to flag: my original 'nullability stack' started off with 'false' for the root (the root struct array must not be null in order to create a RecordBatch
out of it). In the new approach, it's slightly more general and could produce a NULL top-level StructArray
which is unable to become a RecordBatch
so I've introduced just a simple one-off check that will cause create_one
to fail if the transform hands back a NULL StructArray
.
aside: I'm not sure why there isn't just an easy API for StructArray
to RecordBatch
that doesn't panic..? Am I missing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the expected in this case? do we need to treat the top-level NULLs differently? I would expect the following to fail but it seems that arrow disagrees...
x: (not_null) {
a: (nullable) LONG,
b: (not_null) LONG,
}
if values = [Null, Null]
, we get the "all null" struct collapsing at level a,b.
this gives x: (not_null) { NULL }
if we consider all-null children to always be safe, this will also simplify to just a single top-level NULL (feels incorrect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some additional context it seems arrow will happily create a StructArray
with a not-null field if the null buffer passed in to try_new
contains all of the of the corresponding child array's nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my original 'nullability stack' started off with 'false' for the root (the root struct array must not be null in order to create a RecordBatch out of it). In the new approach, it's slightly more general and could produce a NULL top-level StructArray which is unable to become a RecordBatch
That's definitely annoying, and possibly a good reason to keep old behavior that all-null only translates to null struct if some fields are non-nullable...
it seems arrow will happily create a StructArray with a not-null field if the null buffer passed in to try_new contains all of the of the corresponding child array's nulls.
Right, this is similar to our recursive algo -- whether that null top-level value is bad depends on the parent. For example, record batch as a parent does not like top-level NULL, but a nullable field as a parent is totally fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i went ahead and reverted to the "only make struct null if required" semantics - I've also documented the exhaustive list of test cases in the description (and implemented)
for (child, field) in child_arrays.iter().zip(struct_type.fields()) { | ||
if !field.is_nullable() && child.is_null(0) { | ||
// if we have a null child array for a not-nullable field, either all other | ||
// children must be null (and we make a null struct) or error | ||
if child_arrays.iter().all(|c| c.is_null(0)) | ||
&& self.nullability_stack.iter().any(|n| *n) | ||
{ | ||
self.stack.push(Arc::new(StructArray::new_null(fields, 1))); | ||
return Some(Cow::Borrowed(struct_type)); | ||
} else { | ||
self.set_error(Error::Generic(format!( | ||
"Non-nullable field {} is null in single-row struct", | ||
field.name() | ||
))); | ||
return None; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have found a counter example:
Ignoring all code for a moment, and tweaking slightly to add d
as a sibling to c
:
x(nullable): { a (non-nullable), b (non-nullable) { c (non-nullable) d (nullable) } }
Analysis
At the time we encounter NULL for c
, there are only two possible outcomes:
b
is non-NULL => definitely an errorb
is NULL => possibly allowed (depending on whetherb
is allowed to be NULL, which in turn depends on whetherx
is NULL)
However, we are doing a depth-first traversal. So at the time we process e.g. c
we have not even seen d
yet, let alone processed parent parent b
and grandparent x
. The stack is [a:<whatever>, c:NULL]
.
Since we cannot yet know the correct handling of c
, we just push its NULL value on the stack and move on to d
(which we also just push onto the stack). Once the recursion unwinds to b
, we have two possibilities:
[a:<whatever>, c:NULL, d:NULL]
-- because all children ofb
are NULL (c
andd
), and at least one of those children is "immediately" non-nullable, we assume the intent was to express (by transitivity) the fact thatb
itself is NULL (recall thatb
is not a leaf so we can't represent its nullness directly). Result:[a:<whatever>, b:NULL]
. Whether that's good or bad is still to be determined transitively as the recursion unwinds.[a:<whatever>, c:NULL, d:<something>]
-- becaused
is non-NULL, we knowb
cannot be NULL and therefore it is an error for "immediately" non-nullablec
to be NULL. Result: **ERROR**.
Assuming we did not already error out, we again have two possibilities:
[a:NULL, b:NULL]
-- as before, all children ofx
are NULL (a
andb
), and at least one of those children is "immediately" non-nullable, so we assume the intent was to expressx
is NULL. Sincex
is immediately nullable, this is totally legitimate and the recursion completes successfully.[a:<something>, b: NULL]
-- again as before,x
cannot be NULL because it has a non-NULL childa
. So NULL value for "immediately" non-nullableb
is illegal and the recursion errors out.
Coming back to code:
The recursive algorithm would seem to be:
- For all leaf values, accept NULL values unconditionally, deferring correctness checks to the parent.
- Whenever the recursion unwinds to reach a (now complete) struct node, examine the children. We have several possible child statuses:
- All children non-NULL -- No problem, nothing to see here, move on.
- All children NULL.
- If all children are nullable, this is fine, and we interpret the parent as non-NULL with all-null children.
- Otherwise, we interpret this as an indirect way of expression that the parent itself is NULL. As with a leaf value, we accept that NULL value unconditionally, deferring correctness checks to the parent.
- Otherwise, we have a mix of NULL and non-NULL children. The parent thus cannot be NULL.
- If any of the NULL children are immediately non-nullable => ERROR
- Otherwise, no problem, nothing to see here, move on.
If we consider all combos of the above schema, that involve least one NULL:
[a:<something>, c:<something>, d:NULL]
- OK (x.b.d
is nullable)[a:<something>, c:NULL, d:<something>]
- ERROR (x.b.c
is non-nullable, detected byb
)[a:NULL, c:<something>, d:<something>]
- ERROR (x.a
is non-nullable, detected byx
)[a:<something>, c:NULL, d:NULL]
- ERROR (x.b
is non-nullable, detected byx
)[a:NULL, c:<something>, d:NULL]
- ERROR (x.a
is non-nullable, detected byx
)[a:NULL, c:NULL, d:<something>]
- ERROR (x.b.c
is non-nullable, detected byb
)[a:NULL, c:NULL, d:NULL]
- OK (x
is nullable)
Notably, I dont' think we need a stack to track nullability -- each parent just verifies its direct children for correct match-up of their nullability (and NULL values) vs. its own nullability. If there is no obvious local conflict, it makes itself either NULL or non-null as appropriate and then trusts its parent to do the same checking as needed.
Code
fn transform_struct(&mut self, struct_type: &'a StructType) -> Option<Cow<'a, StructType>> {
// NOTE: This is an optimization; the other early-return suffices to produce correct behavior.
if self.error.is_some() {
return None;
}
// Only consume newly-added entries (if any). There could be fewer than expected if
// the recursion encountered an error.
let mark = self.stack.len();
let _ = self.recurse_into_struct(struct_type);
let field_values = self.stack.split_off(mark);
if self.error.is_some() {
return None;
}
require!(field_values.len() == struct_type.len(), ...);
let mut found_non_nullable_null = false;
let mut all_null = true;
for (f, v) in struct_type.fields().zip(&field_values) {
if v.is_valid(0) {
all_null = false;
} else if !f.is_nullable() {
found_non_nullable_null = true;
}
}
let null_buffer = found_non_nullable_null.then(|| {
// The struct had a non-nullable NULL. This is only legal if all fields were NULL, which we
// interpret as the struct itself being NULL.
require!(all_null, ...);
// We already have the all-null columns we need, just need a null buffer
NullBuffer::new_null(1)
});
// Assemble the struct normally but mark it NULL? Or make a NULL struct directly?
let sa = match StructArray::try_new(..., null_buffer) { ... };
self.stack.push(sa);
None
}
for (child, field) in child_arrays.iter().zip(struct_type.fields()) { | ||
if !field.is_nullable() && child.is_null(0) { | ||
// if we have a null child array for a not-nullable field, either all other | ||
// children must be null (and we make a null struct) or error | ||
if child_arrays.iter().all(|c| c.is_null(0)) | ||
&& self.nullability_stack.iter().any(|n| *n) | ||
{ | ||
self.stack.push(Arc::new(StructArray::new_null(fields, 1))); | ||
return Some(Cow::Borrowed(struct_type)); | ||
} else { | ||
self.set_error(Error::Generic(format!( | ||
"Non-nullable field {} is null in single-row struct", | ||
field.name() | ||
))); | ||
return None; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness of testing, we probably need a schema that exercises every possible combo of fields, along with one set of leaf scalars for every possible combo of NULL and non-NULL.
There are six "interesting" combos (n
= nullable, !
= non-null):
n { n, n }
n { n, ! }
n { !, ! }
! { n, n }
! { n, ! }
! { !, ! }
Each one can have 4 distinct input value combinations, for a total of 6x4 = 24 cases to test.
Just thinking out loud here... I do think we can already do a lot of data generation using the existing expression API. The main thing that is missing is the ability to communicate the desired number of rows in The code below produces data much like we want it to. let add_expr = Expression::struct_from([
Expression::literal("file:///path"),
Expression::literal(100),
Expression::literal(Scalar::Null(DeltaDataTypes::INTEGER)),
]);
let schema = StructType::new(vec![
StructField::new("path", DeltaDataTypes::STRING, false),
StructField::new("size", DeltaDataTypes::INTEGER, false),
StructField::new("size_null", DeltaDataTypes::INTEGER, true),
]);
let dummy_schema = Schema::new(vec![Field::new("a", DataType::Boolean, false)]);
let dummy_batch = RecordBatch::try_new(
Arc::new(dummy_schema),
vec![Arc::new(BooleanArray::from(vec![true]))],
)
.unwrap();
let handler = ArrowExpressionHandler {};
let evaluator = handler.get_evaluator(schema.clone().into(), add_expr, schema.into());
let data = Box::new(ArrowEngineData::new(dummy_batch));
let result = evaluator.evaluate(data.as_ref()).unwrap();
let result = result
.any_ref()
.downcast_ref::<ArrowEngineData>()
.unwrap()
.record_batch()
.clone();
print_batches(&[result]).unwrap(); As the implementation we expect engines for to provide for expression evaluation, I wonder if it is simpler for the engine if we use the expression mechanics and maybe add a method The current approach here feels more explicit, but would also incur more work for engines wanting to adopt? |
Interesting. If I try to distill/refine the idea, is it basically this?
(***) The ideal "dummy" engine data would have no columns, but arrow probably doesn't allow that. So the next best would wrap a |
let (fields, columns, nulls) = applied.into_parts(); | ||
if let Some(nulls) = nulls { | ||
if nulls.null_count() != 0 { | ||
return Err(Error::invalid_struct_data( | ||
"Top-level nulls in struct are not supported", | ||
)); | ||
} | ||
} | ||
Ok(RecordBatch::try_new( | ||
Arc::new(ArrowSchema::new(fields)), | ||
columns, | ||
)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this change so that we could leverage the existing arrow_expression infra with the new changes. without this we will panic within arrow on some of the tests I have for top-level nulls (instead of just returning an error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently workaround this limitation for "scalar" expressions elsewhere in the code by always embedding them in a dummy struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct!
/// `values`. | ||
// Note: we will stick with a Schema instead of DataType (more constrained can expand in | ||
// future) | ||
fn create_one(&self, schema: SchemaRef, values: &[Scalar]) -> DeltaResult<Box<dyn EngineData>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd keep this pub(crate)
for now if we don't know we need it. We can always move it to the trait if someone asks but it's much harder to the the other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
/// `values`. | ||
// Note: we will stick with a Schema instead of DataType (more constrained can expand in | ||
// future) | ||
fn create_one(&self, schema: SchemaRef, values: &[Scalar]) -> DeltaResult<Box<dyn EngineData>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, odd. Yeah let's make an issue just so we don't completely forget it
null_row
and create_one
ExpressionHandler APInull_row
ExpressionHandler API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a nit to maybe think about (sometime :)).
|
||
/// Any error for [`LiteralExpressionTransform`] | ||
#[derive(thiserror::Error, Debug)] | ||
pub enum Error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I do very much like the pattern to have dedicated and very specific errors in sub-modules, I also learned (the hard way :)), that this sometimes ends up in a nested mess ... One thing that worked in the past, is to use such errors, but not expose them in the top level error and make the struct non-pub.
This is likely for a thing for a follow-up though, if other feel the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I suspect we need to do what rust has always done, and what spark is now doing after many years of arbitrary expression hierarchies: Have a single error class that encapsulates a "soft" hierarchy of error classification codes (which are traditionally strings satisfying the regexp [0-9A-Z]+
(all-caps alpnanumeric). Easier to extend, easier to document, etc.
NOTE: This approach does not prevent us from internally using and defining exception hierarchies, enums, etc. It just makes the crate a lot easier to deal with because adding new private exception types is no longer a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea and this tails nicely with the fact that we need to figure out how to continue increasing our possible Error cases without breaking changes every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the short-term: I wonder if I can make this a private error and then wrap it all up in some public one just via to_string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make a tracking ticket for the error code idea? Or do we already have one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachschuermann I personally prefer Box<dyn Error>
for wrapping errors instead of an err.msg
. That way you can choose which level of info you get with a Debug
print or a Display
print. Also you can trace the error if there is a lineage of error sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StructField::nullable("b", DataType::LONG), | ||
StructField::not_null("b", DataType::LONG), | ||
StructField::nullable("c", DataType::LONG), | ||
StructField::nullable("c", DataType::LONG), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a pretty sketchy scenario... should we at least document the expected behavior? Do we keep the first or the last version for each field name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I ran across this and was surprised. I wonder if we should instead error or warn if a StructType is attempted construction with fields with duplicate names? does SQL generally allow you to have columns named the same if they differ in metadata etc.? (i'll look into this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure SQL has an opinion. You could do e.g.
SELECT 1 as x, 2 as x, 3 as x
and I learned the hard way once that spark does not block structs with duplicate names. Not sure if that is a spark bug or if spark assumes that you know what you're doing in such cases? For example, field ids could potentially allow to distinguish same-named fields in a struct, tho spark knows nothing about those.
} | ||
} | ||
|
||
fn set_error(&mut self, e: Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make this return None
, then all the call sites simplify. For example, check_error
below turns to:
match result {
Ok(val) => Some(val),
Err(err) => self.set_error(err.into()),
}
Is the "cleverness" worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm don't think type inference is good enough (would have to turbofish unit type or something) - i've made one of the edits lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Several nits/simplifications to consider before merging.
return Err(e); | ||
} | ||
pub(crate) fn try_into_expr(mut self) -> Result<Expression, Error> { | ||
self.error?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess no let _ =
needed because the compiler knows it's unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep exactly, no expectation to consume unit type but if it was another type you'd get the "warn unused"
let (fields, columns, nulls) = applied.into_parts(); | ||
if let Some(nulls) = nulls { | ||
if nulls.null_count() != 0 { | ||
return Err(Error::invalid_struct_data( | ||
"Top-level nulls in struct are not supported", | ||
)); | ||
} | ||
} | ||
Ok(RecordBatch::try_new( | ||
Arc::new(ArrowSchema::new(fields)), | ||
columns, | ||
)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently workaround this limitation for "scalar" expressions elsewhere in the code by always embedding them in a dummy struct?
fn transform_array(&mut self, _array_type: &'a ArrayType) -> Option<Cow<'a, ArrayType>> { | ||
self.set_error(Error::Unsupported( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No error check? Seems like the debug!
log isn't especially helpful if we know it could easily be triggered here?
(again below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea makes sense - added back the error checks
}; | ||
} | ||
|
||
macro_rules! test_nullability_combinations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, fancy! 🤯
What changes are proposed in this pull request?
Adds a new required method:
new_null
API for creating a new single-row null literalEngineData
. Then, we provide thecreate_one
API for creating single-rowEngineData
by implementing aSchemaTransform
(LiteralExpressionTransform
) to transform the given schema + leaf values into anExpression
which evaluates to literal values at the leaves of the schema. (implemented in a new privateExpressionHandlerExtension
trait)fn new_null
to ourExpressionHandler
trait (breaking)fn create_one
to anExpressionHandlerExtension
traitnew_null
forArrowExpressionHandler
additionally, adds a new
fields_len()
method toStructType
.This PR affects the following public APIs
new_null
API forExpressionHandler
LiteralExpressionTransformError
How was this change tested?
Bunch of new unit tests. For the nullability tests of our new
SchemaTransform
we came up with a set of 24 exhaustive test cases: