[SPARK-51974][CONNECT][ML] Limit model size and per-session model cache size #50751

WeichenXu123 · 2025-04-29T09:09:09Z

What changes were proposed in this pull request?

Limit model size and per-session model cache size.

Configurations:

spark.connect.session.connectML.mlCache.memoryControl.enabled
spark.connect.session.connectML.mlCache.memoryControl.maxInMemorySize
spark.connect.session.connectML.mlCache.memoryControl.offloadingTimeout
spark.connect.session.connectML.mlCache.memoryControl.maxModelSize
spark.connect.session.connectML.mlCache.memoryControl.maxSize

Only when spark.connect.session.connectML.mlCache.memoryControl.enabled is true, the rest configures can work.

Why are the changes needed?

Motivation: This is for ML cache management, to avoid huge ML cache affecting Spark driver availability.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <[email protected]>

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

xi-db · 2025-05-02T10:06:19Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala

@@ -61,6 +61,31 @@ private[connect] class MLCache(sessionHolder: SessionHolder) extends Logging {

  private[ml] val totalSizeBytes: AtomicLong = new AtomicLong(0)

+  private[ml] val totalModelCacheSizeBytes: AtomicLong = new AtomicLong(0)


The "totalSizeBytes" above is tracking the total size in memory. Why do we need to use a different totalModelCacheSizeBytes?

totalModelCacheSizeBytes is for counting in-memory + offloaded data size
totalSizeBytes is for counting in-memory size, I will rename it to totalInMemorySizeBytes

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala

common/utils/src/main/resources/error/error-conditions.json

zhengruifeng

I think we need to reformat the error classes, otherwise LGTM

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 · 2025-05-05T13:02:14Z

Refactored configurations: @zhengruifeng @xi-db

spark.connect.session.connectML.mlCache.memoryControl.enabled
spark.connect.session.connectML.mlCache.memoryControl.maxInMemorySize
spark.connect.session.connectML.mlCache.memoryControl.offloadingTimeout
spark.connect.session.connectML.mlCache.memoryControl.maxModelSize
spark.connect.session.connectML.mlCache.memoryControl.maxSize

Only when spark.connect.session.connectML.mlCache.memoryControl.enabled is true, the rest configures can work.

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng · 2025-05-06T00:05:36Z

LGTM pending CI

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 · 2025-05-07T14:55:29Z

merged to master

…he size ### What changes were proposed in this pull request? Limit model size and per-session model cache size. Configurations: ``` spark.connect.session.connectML.mlCache.memoryControl.enabled spark.connect.session.connectML.mlCache.memoryControl.maxInMemorySize spark.connect.session.connectML.mlCache.memoryControl.offloadingTimeout spark.connect.session.connectML.mlCache.memoryControl.maxModelSize spark.connect.session.connectML.mlCache.memoryControl.maxSize ``` Only when spark.connect.session.connectML.mlCache.memoryControl.enabled is true, the rest configures can work. ### Why are the changes needed? Motivation: This is for ML cache management, to avoid huge ML cache affecting Spark driver availability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50751 from WeichenXu123/tree-early-stop. Authored-by: Weichen Xu <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 added 3 commits April 28, 2025 18:15

init

9c90878

Signed-off-by: Weichen Xu <[email protected]>

update

8f1dd65

Signed-off-by: Weichen Xu <[email protected]>

update

a737354

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 marked this pull request as draft April 29, 2025 09:09

github-actions bot added ML MLLIB labels Apr 29, 2025

update

282a782

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 changed the title ~~[draft] Tree training early stop capped by model size~~ Tree training early stop capped by model size May 1, 2025

github-actions bot added SQL CONNECT labels May 1, 2025

WeichenXu123 changed the title ~~Tree training early stop capped by model size~~ [SPARK-51974][CONNECT][ML] Limit model size and per-session model cache size May 1, 2025

WeichenXu123 marked this pull request as ready for review May 1, 2025 14:29

WeichenXu123 added 7 commits May 1, 2025 22:36

refine

6b083fd

Signed-off-by: Weichen Xu <[email protected]>

Merge branch 'master' into tree-early-stop

f106449

update

9af6cff

Signed-off-by: Weichen Xu <[email protected]>

fix

1cf42fb

Signed-off-by: Weichen Xu <[email protected]>

refine msg

f5dd09a

Signed-off-by: Weichen Xu <[email protected]>

update

588e5da

Signed-off-by: Weichen Xu <[email protected]>

Merge branch 'master' into tree-early-stop

dd584ae

xi-db reviewed May 2, 2025

View reviewed changes

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala Outdated Show resolved Hide resolved

xi-db reviewed May 2, 2025

View reviewed changes

zhengruifeng reviewed May 5, 2025

View reviewed changes

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala Outdated Show resolved Hide resolved

zhengruifeng reviewed May 5, 2025

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

zhengruifeng reviewed May 5, 2025

View reviewed changes

WeichenXu123 added 4 commits May 5, 2025 20:11

merge master

7e93efb

Signed-off-by: Weichen Xu <[email protected]>

refactor config names

a071a6a

Signed-off-by: Weichen Xu <[email protected]>

update

49710f4

Signed-off-by: Weichen Xu <[email protected]>

update

d425ede

Signed-off-by: Weichen Xu <[email protected]>

github-actions bot added the PYTHON label May 5, 2025

update

f6b3e21

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng approved these changes May 6, 2025

View reviewed changes

WeichenXu123 added 9 commits May 6, 2025 11:27

update

e883892

Signed-off-by: Weichen Xu <[email protected]>

format

4e2e449

Signed-off-by: Weichen Xu <[email protected]>

update

25c7818

Signed-off-by: Weichen Xu <[email protected]>

update

2b0adaa

Signed-off-by: Weichen Xu <[email protected]>

more unit tests

502cc23

Signed-off-by: Weichen Xu <[email protected]>

disable summary

3515092

Signed-off-by: Weichen Xu <[email protected]>

format

b7da2b4

Signed-off-by: Weichen Xu <[email protected]>

Merge branch 'master' into tree-early-stop

23a66fb

update

f3ddbb7

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 closed this in ca77925 May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51974][CONNECT][ML] Limit model size and per-session model cache size #50751

[SPARK-51974][CONNECT][ML] Limit model size and per-session model cache size #50751

Uh oh!

WeichenXu123 commented Apr 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

xi-db May 2, 2025

Uh oh!

WeichenXu123 May 5, 2025

Uh oh!

Uh oh!

Uh oh!

zhengruifeng left a comment

Uh oh!

WeichenXu123 commented May 5, 2025

Uh oh!

zhengruifeng commented May 6, 2025

Uh oh!

WeichenXu123 commented May 7, 2025

Uh oh!

Uh oh!

		@@ -61,6 +61,31 @@ private[connect] class MLCache(sessionHolder: SessionHolder) extends Logging {

		private[ml] val totalSizeBytes: AtomicLong = new AtomicLong(0)

		private[ml] val totalModelCacheSizeBytes: AtomicLong = new AtomicLong(0)

[SPARK-51974][CONNECT][ML] Limit model size and per-session model cache size #50751

[SPARK-51974][CONNECT][ML] Limit model size and per-session model cache size #50751

Uh oh!

Conversation

WeichenXu123 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

xi-db May 2, 2025

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented May 5, 2025

Uh oh!

zhengruifeng commented May 6, 2025

Uh oh!

WeichenXu123 commented May 7, 2025

Uh oh!

Uh oh!

WeichenXu123 commented Apr 29, 2025 •

edited

Loading