Skip to content

Expression evaluators do not handle nested non-nullable fields in nullable structs. #578

Open
@OussamaSaoudi-db

Description

@OussamaSaoudi-db

Describe the bug

The arrow expression evaluator for the default and sync engine does not handle non-nullable fields in a nullable struct. If all the (possibly non-nullable) fields of a struct are null, and the top level struct is nullable, the expression evaluator must allow the output struct to also be null.

Consider the schema

        let nested = StructType::new([StructField::new("path", DeltaDataTypes::STRING, false)]);
        let schema = StructType::new([StructField::new("add", nested, true)]);

For engine data

[Some("fake_path"), None, Some("other_fake_path")]);

I expect the following:

[ Add { path: "fake_path" }, null, Add { path: "other_fake_path" }]

To Reproduce

I created a MRE test that can be run in arrow_expression.rs. I have a top level nullable struct add with a non-nullable field path.

  #[test]
    fn test_nested_nullability() {
        // Arrow Schema
        let field = Field::new("path", DataType::Utf8, false);
        let top = Arc::new(Field::new(
            "add",
            DataType::Struct(Fields::from(vec![field.clone()])),
            true,
        ));
        let schema = Schema::new([top]);

        // Arrow data
        let values = StringArray::from(vec![Some("fake_path"), None, Some("other_fake_path")]);
        let struct_values: ArrayRef = Arc::new(values);
        let struct_array = StructArray::from(vec![(Arc::new(field), struct_values.clone())]);
        let batch =
            RecordBatch::try_new(Arc::new(schema), vec![Arc::new(struct_array.clone())]).unwrap();

        // Delta Schema
        let nested = StructType::new([StructField::new("path", DeltaDataTypes::STRING, false)]);
        let schema = StructType::new([StructField::new("add", nested, true)]);

        let expression = Expression::struct_from([column_expr!("add.path")]);

        let evaluator = DefaultExpressionEvaluator {
            input_schema: schema.clone().into(),
            expression: Box::new(expression),
            output_type: schema.into(),
        };

        let data = ArrowEngineData::new(batch);

        evaluator.evaluate(&data).unwrap();
    }

Expected behavior

I expect the test to successfully transform the data to something like this:

[ Add { path: "fake_path" }, null, Add { path: "other_fake_path" }]

Additional context

This was caught when working on CDF scan file transformation and schema.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions