-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When writing queries on parquet files with field metadata and not stripping that
metadata, DataFusion errors out with the above error.
To Reproduce
Repro
-- First, ensure that parquet metadata is not skipped (it is skipped by default)
> set datafusion.execution.parquet.skip_metadata = false;
SELECT
'foo' AS name,
COUNT(
CASE
WHEN prev_value = false AND value = TRUE THEN 1
ELSE NULL
END
) AS count_true_rises
FROM
(
SELECT
value,
LAG(value) OVER (ORDER BY time ) AS prev_value
FROM
'repro.parquet'
);Results in
Internal error: Physical input schema should be the same as the one converted from logical input schema. Differences: .
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
I made the parquet file available here:
Here is the code to generate the parquet file (I am not sure how to create parquet files with metadata otherwise):
Details
use std::collections::HashMap;
use std::fs::File;
use std::sync::Arc;
use arrow::array::{BooleanArray, RecordBatch, TimestampNanosecondArray};
use arrow::datatypes::{DataType, Field, Schema, SchemaRef, TimeUnit};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// write a parquet file which has a metadata
let mut metadata = HashMap::new();
metadata.insert(String::from("year"), String::from("2015"));
let schema: SchemaRef = Arc::new(Schema::new(vec![
Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None), false),
Field::new("value", DataType::Boolean, false)
.with_metadata(metadata),
]));
let time = TimestampNanosecondArray::from(vec![1_420_070_400_000_000_000i64, 1_420_070_401_000_000_000i64]);
let value = BooleanArray::from(vec![true, false]);
let batch = RecordBatch::try_new(schema.clone(), vec![
Arc::new(time),
Arc::new(value),
])?;
println!("Writing parquet file with metadata repro.parquet...");
let writer = File::create("repro.parquet")?;
let mut arrow_writer = parquet::arrow::ArrowWriter::try_new(
writer,
schema.clone(),
None,
)?;
arrow_writer.write(&batch)?;
arrow_writer.close()?;
Ok(())
}Note this is all the more confusing because the error lists no differences
... converted from logical input schema. Differences: . <-- no differences are listed!!!
The difference is the metadata on the value field.
Expected behavior
I expect the query to pass without error
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working