-
Notifications
You must be signed in to change notification settings - Fork 200
[Coral-Schema] handle single-element unions when the supplied column schema does not specify <uniontype> #557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| if (schema.getType() != Schema.Type.MAP) { | ||
| return null; | ||
| } | ||
| Schema valueType = schema.getValueType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a decent amount of repeated code in both the methods - mapValuePartner and listElementPartner. Can we abstract that out to a private method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
pandaamit91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! But would prefer to have another set of eyes (more experienced in the Coral repo) approve it as well.
|
The description is not quite adding up since "unions" are not a thing in Coral. Could you rewrite the description using "records" which is the actual primitive? That would help understand the problem better. |
|
If you really need to use unions for the description, you may talk about unions in the Hive type system, then records in the Coral type system but you have to describe the whole flow and conversion chain with that segregation (but not mix unions with Coral type system description). If unions are not needed and we can state the problem just with records, then even better. |
|
@wmoustafa the "union" here isn't a union related to Hive, but union as related to avro literal schema definition. This is addressed in the description where it says
the "union" here is more referring to the fact that it's de-facto either a union of [element, null], or "single-union" where it's just a union of [element]. Not sure if that make sense |
coral-schema/src/main/java/com/linkedin/coral/schema/avro/MergeHiveSchemaWithAvro.java
Outdated
Show resolved
Hide resolved
6cf3099 to
45ab0d1
Compare
45ab0d1 to
2ddf120
Compare
|
From Coral-Schema’s perspective, it converges hive schema & avro schema literal. The input here is a "single-element union", but it isn’t represented as a "uniontype" in the passed hive schema, which is unexpected. This appears to be a caller-side bug. The caller should supply it explicitly as uniontype; we shouldn’t introduce special handling in Coral-Schema to infer user intent. |
What changes are proposed in this pull request, and why are they necessary?
When Avro schemas define single-element unions (e.g.,
["string"]for primitives,[{"type":"record",...}]for array items/map values), the schema merge logic inMergeHiveSchemaWithAvrowas incorrectly unwrapping them, changing non-nullable types to nullable types or losing the union structure entirely.This issue occurs because some table systems (like OpenHouse) don't represent single-element unions as
uniontype<...>in the Hive schema. When the Hive schema simply showsstringorstruct<...>instead ofuniontype<string>oruniontype<struct<...>>, the merge logic treats them as regular types and unwraps the Avro union.Changes made:
HiveSchemaWithPartnerVisitor.java: Added logic at the top level of thevisit()method to:TypeInfodoesn't declare a union (viashouldUnwrapPartner())SchemaUtilities.java:wrapInSingleElementUnion()utility method along withextractSingleElementUnion()Key behavior: The fix only applies when there's a mismatch between Hive and Avro:
uniontype<...>, normal union processing occurs (no unwrap/rewrap)uniontypebut Avro has a single-element union, we unwrap for matching then rewrap to preserve structureThis ensures that Avro schemas with single-element unions (commonly found in
avro.schema.literalproperties) maintain their exact structure after merging, preventing unintended nullability changes.How was this patch tested?
Added two comprehensive unit tests in
MergeHiveSchemaWithAvroTests:shouldHandleSingleElementUnionsInAllTypes:["string"],["boolean"]), array items, and map valuesuniontype(simulating OpenHouse-style schemas)shouldHandleSingleElementUnionsWithHiveUnionType:uniontype<...>Both tests verify that:
All existing tests continue to pass, confirming no regressions.
Also manually ran integration tests in spark, by generating a jar. Confirmed that the original symptoms are now fixed, ie. view on top of the table that had the single-element-union now show correct nullability for all fields
Also ran integration regression tests with
coral-tools:diff coral_v2.2.snapshot_failures.txt coral_v2.2.54_failures.txtyields no outputs;diff coral_v2.2.54_TrinoSQL.txt coral_v2.2.snapshot_TrinoSQL.txtyields no outputs;diff coral_v2.2.54_successes.txt coral_v2.2.snapshot_successes.txtyields no outputs