Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expression evaluators do not handle nested non-nullable fields in nullable structs. #578

Open
OussamaSaoudi-db opened this issue Dec 7, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@OussamaSaoudi-db
Copy link
Collaborator

Describe the bug

The arrow expression evaluator for the default and sync engine does not handle non-nullable fields in a nullable struct. If all the (possibly non-nullable) fields of a struct are null, and the top level struct is nullable, the expression evaluator must allow the output struct to also be null.

Consider the schema

        let nested = StructType::new([StructField::new("path", DeltaDataTypes::STRING, false)]);
        let schema = StructType::new([StructField::new("add", nested, true)]);

For engine data

[Some("fake_path"), None, Some("other_fake_path")]);

I expect the following:

[ Add { path: "fake_path" }, null, Add { path: "other_fake_path" }]

To Reproduce

I created a MRE test that can be run in arrow_expression.rs. I have a top level nullable struct add with a non-nullable field path.

  #[test]
    fn test_nested_nullability() {
        // Arrow Schema
        let field = Field::new("path", DataType::Utf8, false);
        let top = Arc::new(Field::new(
            "add",
            DataType::Struct(Fields::from(vec![field.clone()])),
            true,
        ));
        let schema = Schema::new([top]);

        // Arrow data
        let values = StringArray::from(vec![Some("fake_path"), None, Some("other_fake_path")]);
        let struct_values: ArrayRef = Arc::new(values);
        let struct_array = StructArray::from(vec![(Arc::new(field), struct_values.clone())]);
        let batch =
            RecordBatch::try_new(Arc::new(schema), vec![Arc::new(struct_array.clone())]).unwrap();

        // Delta Schema
        let nested = StructType::new([StructField::new("path", DeltaDataTypes::STRING, false)]);
        let schema = StructType::new([StructField::new("add", nested, true)]);

        let expression = Expression::struct_from([column_expr!("add.path")]);

        let evaluator = DefaultExpressionEvaluator {
            input_schema: schema.clone().into(),
            expression: Box::new(expression),
            output_type: schema.into(),
        };

        let data = ArrowEngineData::new(batch);

        evaluator.evaluate(&data).unwrap();
    }

Expected behavior

I expect the test to successfully transform the data to something like this:

[ Add { path: "fake_path" }, null, Add { path: "other_fake_path" }]

Additional context

This was caught when working on CDF scan file transformation and schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant