Skip to content

The parquet format specification is not followed for Interval type (i.e. timedeltas) #937

@mgab

Description

@mgab

Describe the issue:

The way timedelta values (a.k.a. durations, intervals...) are stored in parquet does not follow the file format specification. According to the parquet specification, the logical type Interval should be stored as:

INTERVAL is used for an interval of time. It must annotate a fixed_len_byte_array of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date.
(...)

Currently, fastparquet does not follow the format specification on this type. This affects the ability to read parquets written with other tools or to read with other tools parquets written with fastparquet if there is any field with this type.

I guess it might be a known issue rather than a bug, but I couldn't find info about it.

Minimal Complete Verifiable Example:

import pandas as pd
from fastparquet import write

df = pd.DataFrame([{'seconds': 30, 'duration': pd.to_timedelta(30, unit='seconds')}])

write('/test/test.parquet', df)

Then use either hangxie/parquet-tools, ktrueda/parquet-tools or any similar tool to inspect the schema to find that it looks like:

{"Tag":"name=Schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=INT64, convertedtype=TIME_MICROS, repetitiontype=OPTIONAL"}
]}

instead of something along the lines of

{"Tag":"name=Duckdb_schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT32, convertedtype=INT_32, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=FIXED_LEN_BYTE_ARRAY, convertedtype=INTERVAL, length=12, repetitiontype=OPTIONAL"}
]}

Anything else we need to know?:

There's a bit more context on this StackOverflow question

Environment:

  • Pandas version: 2.2.2
  • Python version: 2024.5.0
  • Operating System: macOS 14.6.1
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions