Skip to content

Use Arrow Extension Types Instead of Type Variation Hacks #19944

@kadinrabo

Description

@kadinrabo

DataFusion encodes Arrow-specific types (like unsigned integers) by misusing type_variation_reference. This violates Substrait's technology principle to avoid specialization for a single producer.

Per the spec, type_variation_reference is for physical variations of the same type where "all variations are expected to have the same semantics." Signed and unsigned integers have different semantics.

Types affected:

  • UInt8/16/32/64
  • LargeUtf8/LargeBinary/LargeList
  • Decimal256
  • Duration
  • Date64
  • Time32
  • Time64

Solution

Use Arrow's official extension_types.yaml which already defines these types (u8, u16, large_string, decimal256, etc.).

Before:

Type::I8 { type_variation_reference: 1 }  // means UInt8

After:

extension_uris: [{ uri: ".../extension_types.yaml" }]
Type::UserDefined { name: "u8" }

The consumer already handles extension types, so backwards compatibility can be maintained.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions