-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When using deeply nested schemas, for example (duckdb format)
CREATE OR REPLACE TABLE raw (
timestamp TIMESTAMP_S, -- 0
web STRUCT(
webPageDetails STRUCT(
name VARCHAR,
pageViews STRUCT(value INT8)
)
),
identityMap MAP(
VARCHAR,
STRUCT(
id VARCHAR,
prim BOOLEAN
)[]
)
);
INSERT INTO raw VALUES
(
'2025-03-01 00:00:01',
{
webPageDetails: {
name: 'page1',
pageViews: {
value: 100
}
}
},
MAP {
'ECID': [
{
id: 1,
prim: true
},
{
id: 2,
prim: false
},
]
}
);
COPY raw TO 'raw.parquet' (FORMAT PARQUET);
and doing the query
SELECT
identityMap['ECID'][1]['id']
FROM
raw
The datafusion_optimizer::analyzer::type_coercion step transforms the plan from
Projection: get_field(array_element(get_field(raw.identityMap, Utf8("ECID")), Int64(1)), Utf8("id"))
TableScan: raw
to
Projection: get_field(array_element(CAST(get_field(raw.identityMap, Utf8("ECID")) AS List(nullable Struct("id": nullable Utf8, "prim": nullable Boolean))), Int64(1)), Utf8("id"))
TableScan: raw
The reason is that inside the guts of coerce_arguments_for_signature_with_scalar_udf & friends, the type comparisons are done using equality. But for arrays, there are functions that return the same types, the only difference being the missing field name ("element", in the case of parquet lists). So we add a cast, even though the schemas are identical.
Arrow has
/// Compares the datatype with another, ignoring nested field names
/// and metadata.
pub fn equals_datatype(&self, other: &DataType) -> bool {
Maybe we should use this instead of plain comparison ?
To Reproduce
Execute duckdb < raw.sql with the sql in the beginning, and then run the sql query in a df context with the loaded parquet
Expected behavior
The optimized plan should not have a cast
Additional context
aditanase
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working