-
Notifications
You must be signed in to change notification settings - Fork 216
Add support for collated strings in OpConverter #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Is this ready for review? |
Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0. |
|
charlenelyu-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case SqlIntegerType => OpDataTypes.IntType | ||
| case SqlLongType => OpDataTypes.LongType | ||
| case SqlStringType => OpDataTypes.StringType | ||
| case _: SqlStringType => OpDataTypes.StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't actually. Without this change this line was only matching the case object of StringType class, but we want it to match it as well as each instance we create for collated types eg. StringType("UTF8_LCASE")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding this as a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
client/src/test/scala/io/delta/sharing/filters/OpConverterSuite.scala
Outdated
Show resolved
Hide resolved
If the change only applies to 2.13 not in 2.12, you would need to make 2 copies of the files. 1 in 2.13 folder and apply the new change, 1 in 2.12 folder which still has the old code. Check out examples like: client/src/main/scala-2.13/org/apache/spark/sql/DeltaSharingScanUtils.scala |
| case _ => | ||
| None | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we guard this change via some config/param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but since OpConverter does not have a context of SqlConfig we would have to change a whole lot of code and public APIs to pipe this through to here. So I would say that this is not worth doing for this specific change.
client/src/main/scala-2.12/io/delta/sharing/filters/CollationExtractor.scala
Outdated
Show resolved
Hide resolved
| case SqlIntegerType => OpDataTypes.IntType | ||
| case SqlLongType => OpDataTypes.LongType | ||
| case SqlStringType => OpDataTypes.StringType | ||
| case _: SqlStringType => OpDataTypes.StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding this as a comment.
charlenelyu-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the change!

Summary
Since its 4.0 release, Spark now supports parametrizing
StringTypewith different collations which define how string data is compared. This PR adds backwards-compatible support for passing collation information in Delta Sharing predicate expressions, with version-specific implementations for Spark 3.5 (Scala 2.12) and Spark 4.0 (Scala 2.13).Changes
Core Implementation
ExprContextcase class to hold expression metadata, specificallycollationIdentifierfor string comparisons-
EqualOp,LessThanOp,LessThanOrEqualOp,GreaterThanOp,GreaterThanOrEqualOp- Also applies to
Inexpressions which are converted toEqualOpchainsCollationExtractorimplementations:- Scala 2.13 (Spark 4.0): Extracts collation information from Spark's
StringTypeand populatescollationIdentifierin format: provider.collationName.icuVersion (e.g.,icu.UNICODE_CI.75.1,spark.UTF8_LCASE.75.1)- Scala 2.12 (Spark 3.5): Does not create
collationIdentifierand instead defaults toUTF8_BINARYcomparisons as collations are just a writer feature and delta.OpConverterto:- Call
CollationExtractor.extractCollationIdentifier()to extract collation informationBackwards Compatibility
exprCtxparameter is optional (Option[ExprContext] = None), ensuring existing code continues to workExprContext, allowing non-collation-aware servers to ignore itUTF8_BINARYcollations (non-collated strings) work on both Spark 3.5 and 4.0Validation
Added safety checks to prevent invalid comparisons:
IllegalArgumentExceptionwhen comparing strings with different collationsProtocol Documentation
Updated PROTOCOL.md to document the new
exprCtxfield and collationIdentifier format with examples.