Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new create_one ExpressionHandler API #662

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zachschuermann
Copy link
Collaborator

@zachschuermann zachschuermann commented Jan 24, 2025

What changes are proposed in this pull request?

Adds a new create_one API for creating single-row EngineData by implementing a SchemaTransform to transform the given schema + leaf values into a single-row ArrowEngineData

  1. Adds the new fn create_one to our ExpressionHandler trait (breaking)
  2. Implements create_one for ArrowExpressionHandler

This PR affects the following public APIs

New create_one API required for ExpressionHandler. And added a new len() method to StructType.

How was this change tested?

Bunch of new unit tests.

@zachschuermann
Copy link
Collaborator Author

note I'll be cleaning up/adding more tests. wanted to get some eyes on this approach first

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 24, 2025
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I want to consider/discuss about this approach is that we require that the expression struct heirarchy matches the schema one. So a schema Struct(Struct(Scalar(Int)) requires an expression Struct(Struct(Literal(int))). This code wouldn't allow a Literal(int) expression.

Idk if we want to enforce that requirement in the long run? It's very common for kernel to flatten out the fields of a schema (ex: in a visitor), so I don't see why we shouldn't allow flattened expressions.

Perhaps this acts as a safety thing. Kernel is the only one calling create_one, and it ensures that things are nested as we expected.

Comment on lines +989 to +993
let actual_rb: RecordBatch = actual
.into_any()
.downcast::<ArrowEngineData>()
.unwrap()
.into();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth having these lines, handler, and the create_one call in a helper function. Something like:

fn test_create_one(schema, expr) -> RecordBatches {
    let handler = ArrowExpressionHandler;
    let actual = handler.create_one(schema, &expr).unwrap();
    actual
            .into_any()
            .downcast::<ArrowEngineData>()
            .unwrap()
            .into()
}

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
match expr {
// simple case: for literals, just create a single-row array and ensure the data types match
Expression::Literal(scalar) => {
let array = scalar.to_array(1)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lose the nullability info in field.is_nullable(). Should we add a check:

if !field.is_nullable() && array.is_null(0) {
   // return error
}

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
#[test]
fn test_create_one_string() {
let expr = Expression::struct_from([Expression::literal("a")]);
let schema = Arc::new(crate::schema::StructType::new([StructField::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we use schema::StructType several times, I think we should import it.

Expression::literal(3),
]);
let schema = Arc::new(crate::schema::StructType::new([
StructField::new("a", DeltaDataTypes::INTEGER, true),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a rebase, I think these can be moved over to StructField::nullable/StructField::not_null.

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick pass, couple reactions
(overall approach looks good)

Comment on lines +583 to +584
let scalar = &self.scalars[self.next_scalar_idx];
self.next_scalar_idx += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a type check?

let DataType::Primitive(scalar_type) = scalar.data_type() else { /* boom */ };
require!(scalar_type == ptype, /* boom */);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

}

fn transform_struct_field(&mut self, field: &'a StructField) -> Option<Cow<'a, StructField>> {
self.recurse_into_struct_field(field)
Copy link
Collaborator

@scovich scovich Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch out -- struct field (or containing array/map) decides whether the child type is nullable
(any nullable ancestor makes the field nullable)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to add this - will get to it soon

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 86.41509% with 36 lines in your changes missing coverage. Please review.

Project coverage is 84.28%. Comparing base (3305d3a) to head (f5471b1).

Files with missing lines Patch % Lines
kernel/src/engine/arrow_expression.rs 87.73% 32 Missing ⚠️
kernel/src/schema/mod.rs 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #662      +/-   ##
==========================================
+ Coverage   84.22%   84.28%   +0.06%     
==========================================
  Files          77       77              
  Lines       17694    17959     +265     
  Branches    17694    17959     +265     
==========================================
+ Hits        14902    15136     +234     
- Misses       2080     2113      +33     
+ Partials      712      710       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants