[SPARK-51214][ML][PYTHON][CONNECT] Don't eagerly remove the cached models for `fit_transform` #49948

zhengruifeng · 2025-02-14T03:32:34Z

What changes were proposed in this pull request?

Don't eagerly remove the cached models for fit_transform:
1, still keep the Delete ml command protobuf, but no longer call it in __del__ in the python client side;
2, build the ml cache with guava CacheBuilder and soft references, and specify the maximum size and time out.

Why are the changes needed?

a common ml pipeline pattern is fit_transform:

def fit_transform(df):
    model = estimator.fit(df)
    return model.transform(df)

df2 = fit_transform(df)
df2.count()

existing implementation eagerly deletes the intermediate model from the ml cache, right after fit_transform, and thus causes NPE

pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.lang.NullPointerException) Cannot invoke "org.apache.spark.ml.Model.copy(org.apache.spark.ml.param.ParamMap)" because "model" is null

JVM stacktrace:
java.lang.NullPointerException
	at org.apache.spark.sql.connect.ml.ModelAttributeHelper.transform(MLHandler.scala:68)
	at org.apache.spark.sql.connect.ml.MLHandler$.transformMLRelation(MLHandler.scala:313)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:231)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$3(SessionHolder.scala:477)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:476)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:147)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:133)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelationalGroupedAggregate(SparkConnectPlanner.scala:2318)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformAggregate(SparkConnectPlanner.scala:2299)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:165)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$3(SessionHolder.scala:477)

Does this PR introduce any user-facing change?

yes

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2025-02-14T03:33:18Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala

+
+private[connect] object MLCache {
+  // The maximum number of distinct items in the cache.
+  private val MAX_CACHED_ITEMS = 100


we can add configs later (in 4.1)

What happens if we can't find an item in the cache, do we throw a proper exception?

currently, it throws a NPE

let me add an error class for it

fix

ecc66b4

github-actions bot added SQL ML PYTHON CONNECT labels Feb 14, 2025

zhengruifeng commented Feb 14, 2025

View reviewed changes

zhengruifeng requested a review from grundprinzip February 14, 2025 03:38

zhengruifeng added 3 commits February 14, 2025 11:43

nit

4636c4e

error class

3b57c0d

nit

c468314

grundprinzip approved these changes Feb 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51214][ML][PYTHON][CONNECT] Don't eagerly remove the cached models for `fit_transform` #49948

[SPARK-51214][ML][PYTHON][CONNECT] Don't eagerly remove the cached models for `fit_transform` #49948

zhengruifeng commented Feb 14, 2025 •

edited

Loading

zhengruifeng Feb 14, 2025

grundprinzip Feb 14, 2025

zhengruifeng Feb 14, 2025

[SPARK-51214][ML][PYTHON][CONNECT] Don't eagerly remove the cached models for fit_transform #49948

Are you sure you want to change the base?

[SPARK-51214][ML][PYTHON][CONNECT] Don't eagerly remove the cached models for fit_transform #49948

Conversation

zhengruifeng commented Feb 14, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng Feb 14, 2025

Choose a reason for hiding this comment

grundprinzip Feb 14, 2025

Choose a reason for hiding this comment

zhengruifeng Feb 14, 2025

Choose a reason for hiding this comment

[SPARK-51214][ML][PYTHON][CONNECT] Don't eagerly remove the cached models for `fit_transform` #49948

[SPARK-51214][ML][PYTHON][CONNECT] Don't eagerly remove the cached models for `fit_transform` #49948

zhengruifeng commented Feb 14, 2025 •

edited

Loading