Skip to content

Conversation

@ruolin59
Copy link

@ruolin59 ruolin59 commented Oct 30, 2025

What changes are proposed in this pull request, and why are they necessary?

When Avro schemas define single-element unions (e.g., ["string"] for primitives, [{"type":"record",...}] for array items/map values), the schema merge logic in MergeHiveSchemaWithAvro was incorrectly unwrapping them, changing non-nullable types to nullable types or losing the union structure entirely.

This issue occurs because some table systems (like OpenHouse) don't represent single-element unions as uniontype<...> in the Hive schema. When the Hive schema simply shows string or struct<...> instead of uniontype<string> or uniontype<struct<...>>, the merge logic treats them as regular types and unwraps the Avro union.

Changes made:

  1. HiveSchemaWithPartnerVisitor.java: Added logic at the top level of the visit() method to:

    • Detect when the Avro partner schema has a single-element union but the Hive TypeInfo doesn't declare a union (via shouldUnwrapPartner())
    • Temporarily unwrap the partner before the switch statement for proper type matching
    • Rewrap the result after processing to preserve the original Avro structure
  2. SchemaUtilities.java:

    • Added wrapInSingleElementUnion() utility method along with extractSingleElementUnion()
    • These methods provide a consistent API for single-element union handling
  3. Key behavior: The fix only applies when there's a mismatch between Hive and Avro:

    • If Hive declares uniontype<...>, normal union processing occurs (no unwrap/rewrap)
    • If Hive doesn't declare uniontype but Avro has a single-element union, we unwrap for matching then rewrap to preserve structure

This ensures that Avro schemas with single-element unions (commonly found in avro.schema.literal properties) maintain their exact structure after merging, preventing unintended nullability changes.

How was this patch tested?

Added two comprehensive unit tests in MergeHiveSchemaWithAvroTests:

  1. shouldHandleSingleElementUnionsInAllTypes:

    • Tests single-element unions in primitives (["string"], ["boolean"]), array items, and map values
    • Hive schema does NOT use uniontype (simulating OpenHouse-style schemas)
    • Verifies that all single-element unions are preserved in the output
  2. shouldHandleSingleElementUnionsWithHiveUnionType:

    • Tests backward compatibility when Hive explicitly declares uniontype<...>
    • Includes both regular multi-type unions and single-element unions
    • Ensures the fix doesn't interfere with normal union processing

Both tests verify that:

  • The input Avro schema structure is preserved (single-element unions remain wrapped)
  • Field names, nullability, docs, and default values are maintained
  • Complex nested structures (records within arrays/maps) work correctly

All existing tests continue to pass, confirming no regressions.

Also manually ran integration tests in spark, by generating a jar. Confirmed that the original symptoms are now fixed, ie. view on top of the table that had the single-element-union now show correct nullability for all fields

Also ran integration regression tests with coral-tools:

rexec ./coral-tools/scripts/rdevRegressionCheck.sh trino 2.2.54 https://github.com/ruolin59/coral.git fix-array-and-map-union-extraction 1-10

diff coral_v2.2.snapshot_failures.txt coral_v2.2.54_failures.txt yields no outputs;
diff coral_v2.2.54_TrinoSQL.txt coral_v2.2.snapshot_TrinoSQL.txt yields no outputs;
diff coral_v2.2.54_successes.txt coral_v2.2.snapshot_successes.txt yields no outputs

if (schema.getType() != Schema.Type.MAP) {
return null;
}
Schema valueType = schema.getValueType();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a decent amount of repeated code in both the methods - mapValuePartner and listElementPartner. Can we abstract that out to a private method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link

@pandaamit91 pandaamit91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! But would prefer to have another set of eyes (more experienced in the Coral repo) approve it as well.

@wmoustafa
Copy link
Contributor

The description is not quite adding up since "unions" are not a thing in Coral. Could you rewrite the description using "records" which is the actual primitive? That would help understand the problem better.

@wmoustafa
Copy link
Contributor

If you really need to use unions for the description, you may talk about unions in the Hive type system, then records in the Coral type system but you have to describe the whole flow and conversion chain with that segregation (but not mix unions with Coral type system description). If unions are not needed and we can state the problem just with records, then even better.

@ruolin59
Copy link
Author

@wmoustafa the "union" here isn't a union related to Hive, but union as related to avro literal schema definition. This is addressed in the description where it says

This causes a type mismatch when comparing against Hive schemas, which don't represent these structures as unions, leading to incorrect schema generation where fields that should be non-nullable (like "type": "string") become nullable (like "type": ["null", "string"])

the "union" here is more referring to the fact that it's de-facto either a union of [element, null], or "single-union" where it's just a union of [element]. Not sure if that make sense

@ruolin59 ruolin59 requested a review from wmoustafa November 5, 2025 17:47
@ruolin59 ruolin59 force-pushed the ruolin/fix-array-and-map-union-extraction branch from 6cf3099 to 45ab0d1 Compare November 7, 2025 22:57
@ruolin59 ruolin59 force-pushed the ruolin/fix-array-and-map-union-extraction branch from 45ab0d1 to 2ddf120 Compare November 7, 2025 23:01
@ruolin59 ruolin59 changed the title [Coral-Schema] fixes schema nullability mismatch bug [Coral-Schema] handle single-element-unions when the supplied column schema does not specify <uniontype> Nov 8, 2025
@ruolin59 ruolin59 changed the title [Coral-Schema] handle single-element-unions when the supplied column schema does not specify <uniontype> [Coral-Schema] handle single-element unions when the supplied column schema does not specify <uniontype> Nov 8, 2025
@aastha25
Copy link
Contributor

From Coral-Schema’s perspective, it converges hive schema & avro schema literal. The input here is a "single-element union", but it isn’t represented as a "uniontype" in the passed hive schema, which is unexpected. This appears to be a caller-side bug. The caller should supply it explicitly as uniontype; we shouldn’t introduce special handling in Coral-Schema to infer user intent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants