Skip to content

Internal error: Physical input schema should be the same as the one converted from logical input schema. #18337

@alamb

Description

@alamb

Describe the bug

When writing queries on parquet files with field metadata and not stripping that
metadata, DataFusion errors out with the above error.

To Reproduce

Repro

-- First, ensure that parquet metadata is not skipped (it is skipped by default)
> set datafusion.execution.parquet.skip_metadata = false;

SELECT
  'foo' AS name,
  COUNT(
    CASE
      WHEN prev_value = false AND value = TRUE THEN 1
      ELSE NULL
      END
     ) AS count_true_rises
FROM
  (
    SELECT
      value,
      LAG(value) OVER (ORDER BY time ) AS prev_value
    FROM
      'repro.parquet'
);

Results in

Internal error: Physical input schema should be the same as the one converted from logical input schema. Differences: .
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues

I made the parquet file available here:

parquet-with-metadata.zip

Here is the code to generate the parquet file (I am not sure how to create parquet files with metadata otherwise):

Details

use std::collections::HashMap;
use std::fs::File;
use std::sync::Arc;
use arrow::array::{BooleanArray, RecordBatch, TimestampNanosecondArray};
use arrow::datatypes::{DataType, Field, Schema, SchemaRef, TimeUnit};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // write a parquet file which has a metadata
    let mut metadata = HashMap::new();
    metadata.insert(String::from("year"), String::from("2015"));
    let schema: SchemaRef = Arc::new(Schema::new(vec![
        Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None), false),
        Field::new("value", DataType::Boolean, false)
            .with_metadata(metadata),
    ]));

    let time = TimestampNanosecondArray::from(vec![1_420_070_400_000_000_000i64, 1_420_070_401_000_000_000i64]);
    let value = BooleanArray::from(vec![true, false]);
    let batch = RecordBatch::try_new(schema.clone(), vec![
        Arc::new(time),
        Arc::new(value),
    ])?;


    println!("Writing parquet file with metadata repro.parquet...");
    let writer = File::create("repro.parquet")?;
    let mut arrow_writer = parquet::arrow::ArrowWriter::try_new(
        writer,
        schema.clone(),
        None,
    )?;
    arrow_writer.write(&batch)?;
    arrow_writer.close()?;

    Ok(())
}

Note this is all the more confusing because the error lists no differences

...  converted from logical input schema. Differences: . <-- no differences are listed!!!

The difference is the metadata on the value field.

Expected behavior

I expect the query to pass without error

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions