Skip to content

Conversation

rluvaton
Copy link
Member

@rluvaton rluvaton commented Sep 29, 2025

Which issue does this PR close?

Rationale for this change

Fix reading old parquet files

What changes are included in this PR?

tests and the fix, but mostly tests.

Are these changes tested?

yes

Are there any user-facing changes?

No

…ated struct/primitive with provided arrow schema

closes:
- apache#8495
@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 29, 2025
@rluvaton rluvaton marked this pull request as ready for review September 29, 2025 18:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I took a quick review of this PR and the code looks reasonable to me, but I don't understand the legacy inferring logic / problem so I can't really review this PR fully yet

Can someone help me out with a link / document that describes what the legacy inferring is? Is it https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/LogicalTypes.md?plain=1#L718-L723

Poking around that file, it looks like @etseidl may know something about this as he authored several commits, for example

/// Converts `self` into an arrow list, with its current type as the field type
/// accept an optional `list_data_type` to specify the type of list to create
///
/// This is used to convert deprecated repeated columns (not in a list), into their arrow representation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reference we can add a link to (I am not familiar with the "deprecated repeated columns" you are referencing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@rluvaton rluvaton Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, updated

| Some(DataType::LargeList(field_hint))
| Some(DataType::FixedSizeList(field_hint, _)) => Some(field_hint.as_ref()),
Some(_) => unreachable!(
"should be validated earlier that list_data_type is only a type of list"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though this should be impossible, I worry about panic'ing here because if there is a bug that error is more severe than "just an error"

ALso, while this may be true at the moment, I can imagine that some future refactor messes it up, in which case this may become reachable and the compiler won't complain

I would prefer returning a general_err! with some sort of "Internal error: should be validated..." type message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

I will try and find time tomorrow to review this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

providing the same schema that read from backward compatible parquet fails: incompatible arrow schema, expected struct got List

3 participants