Add support for collated strings in OpConverter #802

stefankandic · 2025-11-19T14:46:10Z

Summary

Since its 4.0 release, Spark now supports parametrizing StringType with different collations which define how string data is compared. This PR adds backwards-compatible support for passing collation information in Delta Sharing predicate expressions, with version-specific implementations for Spark 3.5 (Scala 2.12) and Spark 4.0 (Scala 2.13).

Changes

Core Implementation

Added ExprContext case class to hold expression metadata, specifically collationIdentifier for string comparisons
Extended comparison operations to accept optional exprCtx parameter:
- EqualOp, LessThanOp, LessThanOrEqualOp, GreaterThanOp, GreaterThanOrEqualOp
- Also applies to In expressions which are converted to EqualOp chains
Created version-specific CollationExtractor implementations:
- Scala 2.13 (Spark 4.0): Extracts collation information from Spark'sStringType and populates collationIdentifier in format: provider.collationName.icuVersion (e.g., icu.UNICODE_CI.75.1, spark.UTF8_LCASE.75.1)
- Scala 2.12 (Spark 3.5): Does not create collationIdentifier and instead defaults to UTF8_BINARY comparisons as collations are just a writer feature and delta.
Updated OpConverter to:
- Call CollationExtractor.extractCollationIdentifier() to extract collation information

Backwards Compatibility

The exprCtx parameter is optional (Option[ExprContext] = None), ensuring existing code continues to work
The valueType field remains as plain "string" (not "string collate "), maintaining compatibility with older clients
Collation information is stored separately in ExprContext, allowing non-collation-aware servers to ignore it
Default UTF8_BINARY collations (non-collated strings) work on both Spark 3.5 and 4.0

Validation

Added safety checks to prevent invalid comparisons:

Throws IllegalArgumentException when comparing strings with different collations

Protocol Documentation

Updated PROTOCOL.md to document the new exprCtx field and collationIdentifier format with examples.

linzhou-db · 2025-11-21T06:19:27Z

Is this ready for review?

stefankandic · 2025-11-21T16:56:41Z

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

linzhou-db · 2025-11-21T17:00:31Z

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

cc @littlegrasscao

charlenelyu-db

This looks good to me and thanks for the nice test coverage! Can you also update the PROTOCOL.md to reflect the new field?

charlenelyu-db · 2025-11-21T20:42:04Z

client/src/main/scala/io/delta/sharing/filters/OpConverter.scala

      case SqlIntegerType => OpDataTypes.IntType
      case SqlLongType => OpDataTypes.LongType
-      case SqlStringType => OpDataTypes.StringType
+      case _: SqlStringType => OpDataTypes.StringType


nit: can revert

We can't actually. Without this change this line was only matching the case object of StringType class, but we want it to match it as well as each instance we create for collated types eg. StringType("UTF8_LCASE")

Consider adding this as a comment.

client/src/test/scala/io/delta/sharing/filters/OpConverterSuite.scala

littlegrasscao · 2025-11-21T21:19:34Z

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

cc @littlegrasscao

If the change only applies to 2.13 not in 2.12, you would need to make 2 copies of the files. 1 in 2.13 folder and apply the new change, 1 in 2.12 folder which still has the old code.

Check out examples like: client/src/main/scala-2.13/org/apache/spark/sql/DeltaSharingScanUtils.scala

chakankardb · 2025-11-22T00:09:09Z

client/src/main/scala/io/delta/sharing/filters/OpConverter.scala

+      case _ =>
+        None
+    }
+  }


Can we guard this change via some config/param?

We could but since OpConverter does not have a context of SqlConfig we would have to change a whole lot of code and public APIs to pipe this through to here. So I would say that this is not worth doing for this specific change.

PROTOCOL.md

client/src/main/scala-2.12/io/delta/sharing/filters/CollationExtractor.scala

chakankardb · 2025-11-24T22:34:03Z

client/src/main/scala/io/delta/sharing/filters/OpConverter.scala

      case SqlIntegerType => OpDataTypes.IntType
      case SqlLongType => OpDataTypes.LongType
-      case SqlStringType => OpDataTypes.StringType
+      case _: SqlStringType => OpDataTypes.StringType


Consider adding this as a comment.

charlenelyu-db

LGTM. Thanks for the change!

stefankandic added 3 commits November 19, 2025 14:44

initial

93f0fec

refactor

e4e92e3

remove gt test

a1a246a

linzhou-db requested review from chakankardb, charlenelyu-db and linzhou-db November 21, 2025 06:19

add exprContext to comparator expressions

74b5ab5

linzhou-db requested a review from littlegrasscao November 21, 2025 17:00

charlenelyu-db previously approved these changes Nov 21, 2025

View reviewed changes

chakankardb reviewed Nov 22, 2025

View reviewed changes

stefankandic added 2 commits November 24, 2025 16:08

split collation logic and tests in 2.12 and 2.13 folders

0523ee2

add test for json payload and protocol.md change

fd07671

stefankandic dismissed charlenelyu-db’s stale review via fd07671 November 24, 2025 16:42

revert accidental change

3568168

chakankardb reviewed Nov 24, 2025

View reviewed changes

address PR comments

0fdfafc

stefankandic requested review from chakankardb and charlenelyu-db November 25, 2025 16:07

chakankardb previously approved these changes Nov 25, 2025

View reviewed changes

default to UTF8 comparisons for collations in Spark 3.5

268bbe7

stefankandic dismissed chakankardb’s stale review via 268bbe7 November 26, 2025 09:26

charlenelyu-db approved these changes Dec 13, 2025

View reviewed changes

Add support for collated strings in OpConverter #802

Are you sure you want to change the base?

Add support for collated strings in OpConverter #802

Uh oh!

Conversation

stefankandic commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

Backwards Compatibility

Validation

Protocol Documentation

Uh oh!

linzhou-db commented Nov 21, 2025

Uh oh!

stefankandic commented Nov 21, 2025

Uh oh!

linzhou-db commented Nov 21, 2025

Uh oh!

charlenelyu-db left a comment

Choose a reason for hiding this comment

Uh oh!

charlenelyu-db Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

stefankandic Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

chakankardb Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

stefankandic Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

littlegrasscao commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chakankardb Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

stefankandic Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chakankardb Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

charlenelyu-db left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stefankandic commented Nov 19, 2025 •

edited

Loading

littlegrasscao commented Nov 21, 2025 •

edited

Loading