Skip to content

Conversation

sdf-jkl
Copy link
Contributor

@sdf-jkl sdf-jkl commented Oct 13, 2025

Which issue does this PR close?

Rationale for this change

Add support for Variant::Utf-8, LargeUtf8, Utf8View. This needs to add a new builder VariantToStringArrowRowBuilder, because LargeUtf8, Utf8View are not ArrowPritimitiveType's

What changes are included in this PR?

  • Added support for Variant::Utf-8, LargeUtf8, Utf8View by adding a new enum and builder for utf8 and largeUtf8 and added utf8view to primitive builder.
  • Added a new variable data_capacity to make_string_variant_to_arrow_row_builder to support string types.
  • Updated the make_string_variant_to_arrow_row_builder in variant_get to include the variable.

Are these changes tested?

Added a variant_get test for utf8 type and created two separate tests for largeUtf8 and Utf8view because these types can't be shredded.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Oct 13, 2025
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Oct 14, 2025

@alamb @scovich Please review when you can, thank you!


perfectly_shredded_to_arrow_primitive_test!(
get_variant_perfectly_shredded_utf8_as_utf8,
DataType::Utf8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add tests for other types(LargeUtf8/Utf8View) here?

The test here wants to cover the variant_get logic, and the tests added in variant_to_arrow.rs were to cover the logic of the builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shredding is not supported for LargeUtf8/Utf8View' per specification.

I originally added the tests for them inside variant_get but got the error saying these types do not support shredding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be from the VariantArray constructor, which invokes this code:

fn canonicalize_and_verify_data_type(
    data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
      ...
    let new_data_type = match data_type {
          ...
        // We can _possibly_ allow (some of) these some day?                                                                                                                                                                                                                   
        LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
            fail!()
        }

I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:

Shredded values must use the following Parquet types:

Variant Type Parquet Physical Type Parquet Logical Type
...
binary BINARY
string BINARY STRING
...
array GROUP; see Arrays below LIST

But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?

A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.

If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.

If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).

@alamb @cashmand, any advice here?

TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
StringView(VariantToUtf8ViewArrowBuilder<'a>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added StringView to the PrimitiveVariantToArrowRowBuilder and the other two to StringVariantToArrowRowBuilder, is there a particular reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating memory for primitive builders only requires a capacity field for the number of items to pre-allocate.

For Utf8/LargeUtf8 builders capacity and another field are required: data_capacity - the total number of (utf8) bytes to allocate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any meaningful call sites that pass a data capacity -- only some unit tests.

Ultimately, variant_get will call make_variant_to_arrow_row_builder, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?

This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get might need to create.

Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.

I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.

TBD whether that average string size should be a parameter that originates with the caller of variant_get and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const would be much better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! It seems to uncover a couple issues that might need some guidance from experts, see comments.

Comment on lines +150 to +157
impl<'a> StringVariantToArrowRowBuilder<'a> {
pub fn append_null(&mut self) -> Result<()> {
use StringVariantToArrowRowBuilder::*;
match self {
Utf8(b) => b.append_null(),
LargeUtf8(b) => b.append_null(),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any string-specific logic that would merit a nested enum like this?
Can we make this builder generic and use it in two new variants of the top-level enum?

Comment on lines +484 to +503
define_variant_to_primitive_builder!(
struct VariantToUtf8ArrowRowBuilder<'a>
|item_capacity, data_capacity: usize| -> StringBuilder { StringBuilder::with_capacity(item_capacity, data_capacity) },
|value| value.as_string(),
type_name: "Utf8"
);

define_variant_to_primitive_builder!(
struct VariantToLargeUtf8ArrowBuilder<'a>
|item_capacity, data_capacity: usize| -> LargeStringBuilder {LargeStringBuilder::with_capacity(item_capacity, data_capacity)},
|value| value.as_string(),
type_name: "LargeUtf8"
);

define_variant_to_primitive_builder!(
struct VariantToUtf8ViewArrowBuilder<'a>
|capacity| -> StringViewBuilder {StringViewBuilder::with_capacity(capacity)},
|value| value.as_string(),
type_name: "Utf8View"
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the ListLikeArray trait in arrow_to_variant.rs -- I suspect a StringLikeArrayBuilder trait would be very helpful here, because the "shape" of all the string arrays is very similar even tho the specific array types and possibly some method names might differ).

Suggested change
define_variant_to_primitive_builder!(
struct VariantToUtf8ArrowRowBuilder<'a>
|item_capacity, data_capacity: usize| -> StringBuilder { StringBuilder::with_capacity(item_capacity, data_capacity) },
|value| value.as_string(),
type_name: "Utf8"
);
define_variant_to_primitive_builder!(
struct VariantToLargeUtf8ArrowBuilder<'a>
|item_capacity, data_capacity: usize| -> LargeStringBuilder {LargeStringBuilder::with_capacity(item_capacity, data_capacity)},
|value| value.as_string(),
type_name: "LargeUtf8"
);
define_variant_to_primitive_builder!(
struct VariantToUtf8ViewArrowBuilder<'a>
|capacity| -> StringViewBuilder {StringViewBuilder::with_capacity(capacity)},
|value| value.as_string(),
type_name: "Utf8View"
);
define_variant_to_primitive_builder!(
struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
|capacity| -> B { B::with_capacity(capacity) },
|value| value.as_string(),
type_name: B::type_name()
);

where

trait StringLikeArrayBuilder: ArrayBuilder {
    fn type_name() -> &'static str;
    fn with_capacity(capacity: usize) -> Self;
    fn append_value(&mut self, value: &str);
    fn append_null(&mut self);
}

impl StringLikeArrayBuilder for StringViewBuilder {
      ...
}

impl<O: OffsetSizeTrait> StringLikeArrayBuilder for GenericStringBuilder<O> {
      ...
    fn with_capacity(capacity: usize) -> Self {
        Self::with_capacity(capacity, capacity*AVERAGE_STRING_LENGTH)
    }
      ...
}

As noted in a different comment, we don't have any meaningful way to predict the needed data_capacity of a string builder, so IMO <GenericStringBuilder as StringLikeArrayBuilder>::with_capacity should just scale the item capacity by some reasonable guess at the average string size, and call that the data capacity.

TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
StringView(VariantToUtf8ViewArrowBuilder<'a>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any meaningful call sites that pass a data capacity -- only some unit tests.

Ultimately, variant_get will call make_variant_to_arrow_row_builder, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?

This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get might need to create.

Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.

I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.

TBD whether that average string size should be a parameter that originates with the caller of variant_get and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const would be much better.

TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
StringView(VariantToUtf8ViewArrowBuilder<'a>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.


perfectly_shredded_to_arrow_primitive_test!(
get_variant_perfectly_shredded_utf8_as_utf8,
DataType::Utf8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be from the VariantArray constructor, which invokes this code:

fn canonicalize_and_verify_data_type(
    data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
      ...
    let new_data_type = match data_type {
          ...
        // We can _possibly_ allow (some of) these some day?                                                                                                                                                                                                                   
        LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
            fail!()
        }

I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:

Shredded values must use the following Parquet types:

Variant Type Parquet Physical Type Parquet Logical Type
...
binary BINARY
string BINARY STRING
...
array GROUP; see Arrays below LIST

But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?

A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.

If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.

If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).

@alamb @cashmand, any advice here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Add variant_to_arrow Utf-8, LargeUtf8, Utf8View types support

3 participants