Skip to content

Extraneous casts added due to overly strict type comparison #19943

@adragomir

Description

@adragomir

Describe the bug

When using deeply nested schemas, for example (duckdb format)

CREATE OR REPLACE TABLE raw (
  timestamp TIMESTAMP_S,  -- 0
  web STRUCT(
    webPageDetails STRUCT(
      name VARCHAR, 
      pageViews STRUCT(value INT8)
    )
  ), 
  identityMap MAP(
    VARCHAR, 
    STRUCT(
      id VARCHAR, 
      prim BOOLEAN
    )[]
  )
);
INSERT INTO raw VALUES
(
  '2025-03-01 00:00:01',
  {
    webPageDetails: {
      name: 'page1', 
      pageViews: {
        value: 100
      }
    }
  }, 
  MAP {
    'ECID': [
      {
        id: 1, 
        prim: true
      }, 
      {
        id: 2, 
        prim: false
      }, 
    ]
  }
);

COPY raw TO 'raw.parquet' (FORMAT PARQUET);

and doing the query

    SELECT
        identityMap['ECID'][1]['id']
    FROM
        raw

The datafusion_optimizer::analyzer::type_coercion step transforms the plan from

    Projection: get_field(array_element(get_field(raw.identityMap, Utf8("ECID")), Int64(1)), Utf8("id"))
      TableScan: raw

to

    Projection: get_field(array_element(CAST(get_field(raw.identityMap, Utf8("ECID")) AS List(nullable Struct("id": nullable Utf8, "prim": nullable Boolean))), Int64(1)), Utf8("id"))
      TableScan: raw

The reason is that inside the guts of coerce_arguments_for_signature_with_scalar_udf & friends, the type comparisons are done using equality. But for arrays, there are functions that return the same types, the only difference being the missing field name ("element", in the case of parquet lists). So we add a cast, even though the schemas are identical.

Arrow has

    /// Compares the datatype with another, ignoring nested field names
    /// and metadata.
    pub fn equals_datatype(&self, other: &DataType) -> bool {

Maybe we should use this instead of plain comparison ?

To Reproduce

Execute duckdb < raw.sql with the sql in the beginning, and then run the sql query in a df context with the loaded parquet

Expected behavior

The optimized plan should not have a cast

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions