Skip to content

Conversation

@stefankandic
Copy link

@stefankandic stefankandic commented Nov 19, 2025

Summary

Since its 4.0 release, Spark now supports parametrizing StringType with different collations which define how string data is compared. This PR adds backwards-compatible support for passing collation information in Delta Sharing predicate expressions, with version-specific implementations for Spark 3.5 (Scala 2.12) and Spark 4.0 (Scala 2.13).

Changes

Core Implementation

  1. Added ExprContext case class to hold expression metadata, specifically collationIdentifier for string comparisons
  2. Extended comparison operations to accept optional exprCtx parameter:
    - EqualOp, LessThanOp, LessThanOrEqualOp, GreaterThanOp, GreaterThanOrEqualOp
    - Also applies to In expressions which are converted to EqualOp chains
  3. Created version-specific CollationExtractor implementations:
    - Scala 2.13 (Spark 4.0): Extracts collation information from Spark'sStringType and populates collationIdentifier in format: provider.collationName.icuVersion (e.g., icu.UNICODE_CI.75.1, spark.UTF8_LCASE.75.1)
    - Scala 2.12 (Spark 3.5): Does not create collationIdentifier and instead defaults to UTF8_BINARY comparisons as collations are just a writer feature and delta.
  4. Updated OpConverter to:
    - Call CollationExtractor.extractCollationIdentifier() to extract collation information

Backwards Compatibility

  • The exprCtx parameter is optional (Option[ExprContext] = None), ensuring existing code continues to work
  • The valueType field remains as plain "string" (not "string collate "), maintaining compatibility with older clients
  • Collation information is stored separately in ExprContext, allowing non-collation-aware servers to ignore it
  • Default UTF8_BINARY collations (non-collated strings) work on both Spark 3.5 and 4.0

Validation

Added safety checks to prevent invalid comparisons:

  • Throws IllegalArgumentException when comparing strings with different collations

Protocol Documentation

Updated PROTOCOL.md to document the new exprCtx field and collationIdentifier format with examples.

@linzhou-db
Copy link
Collaborator

Is this ready for review?

@stefankandic
Copy link
Author

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

@linzhou-db
Copy link
Collaborator

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

cc @littlegrasscao

charlenelyu-db
charlenelyu-db previously approved these changes Nov 21, 2025
Copy link
Collaborator

@charlenelyu-db charlenelyu-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and thanks for the nice test coverage! Can you also update the PROTOCOL.md to reflect the new field?

image

case SqlIntegerType => OpDataTypes.IntType
case SqlLongType => OpDataTypes.LongType
case SqlStringType => OpDataTypes.StringType
case _: SqlStringType => OpDataTypes.StringType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can revert

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't actually. Without this change this line was only matching the case object of StringType class, but we want it to match it as well as each instance we create for collated types eg. StringType("UTF8_LCASE")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding this as a comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@littlegrasscao
Copy link
Collaborator

littlegrasscao commented Nov 21, 2025

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

cc @littlegrasscao

If the change only applies to 2.13 not in 2.12, you would need to make 2 copies of the files. 1 in 2.13 folder and apply the new change, 1 in 2.12 folder which still has the old code.

Check out examples like: client/src/main/scala-2.13/org/apache/spark/sql/DeltaSharingScanUtils.scala

case _ =>
None
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we guard this change via some config/param?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but since OpConverter does not have a context of SqlConfig we would have to change a whole lot of code and public APIs to pipe this through to here. So I would say that this is not worth doing for this specific change.

case SqlIntegerType => OpDataTypes.IntType
case SqlLongType => OpDataTypes.LongType
case SqlStringType => OpDataTypes.StringType
case _: SqlStringType => OpDataTypes.StringType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding this as a comment.

chakankardb
chakankardb previously approved these changes Nov 25, 2025
Copy link
Collaborator

@charlenelyu-db charlenelyu-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the change!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants