Skip to content

[Core] Support Jvm memory shrinking for DynamicOffHeapSizingMemoryTarget #9585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 9, 2025

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented May 9, 2025

What changes were proposed in this pull request?

Support Jvm memory shrinking for DynamicOffHeapSizingMemoryTarget
When a memory request cannot be satisfied, we first lower the value of option MaxHeapFreeRatio and trigger a full GC to force the JVM to return as much memory as possible back to the operating system. This approach incurs the overhead of a full GC each time.
Currently, only Java 11 and Java 17 are supported. And this has only been validated with the G1 GC.
Config

......
"spark.memory.offHeap.enabled": "false",
"spark.gluten.memory.dynamic.offHeap.sizing.memory.fraction": "0.95",
"spark.gluten.memory.dynamic.offHeap.sizing.enabled" "true"
......

JVM memory shrinking

2025-05-09 09:35:46,862 WARN DynamicOffHeapSizingMemoryTarget [gc-thread-pool]: Starting async full gc to shrink JVM memory: Total On-heap: 58074333184, Free On-heap: 33397024368, Total Off-heap: 8388608, Used On-Heap: 24677308816, Executor memory: 60129542144.
2025-05-09 09:35:48,811 WARN DynamicOffHeapSizingMemoryTarget [gc-thread-pool]: Finished async full gc to shrink JVM memory: Total On-heap: 1199570944, Free On-heap: 116309888, Total Off-heap: 8388608, Used On-Heap: 1083261056, Executor memory: 60129542144, [GC Retry times: 0].

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels May 9, 2025
Copy link

github-actions bot commented May 9, 2025

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented May 9, 2025

Run Gluten Clickhouse CI on x86

@zhli1142015
Copy link
Contributor Author

cc @FelixYBW and @zhztheplayer , thanks.

Copy link

github-actions bot commented May 9, 2025

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

Thank you for iterating this feature!

Curious, are there any future plans on the feature from your end? Or this is the final revision?

@zhli1142015
Copy link
Contributor Author

From a functionality standpoint, this covers everything for now. We may add some monitoring and debugging improvements later, and adjust the strategies used based on customer feedback.

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

@zhli1142015
Copy link
Contributor Author

Is there some thing wrong with the build pipeline? cc @zhouyuan and @FelixYBW
https://github.com/apache/incubator-gluten/actions/runs/14986805726/job/42102259698?pr=9585

/usr/bin/docker pull apache/gluten:centos-8-jdk8
  centos-8-jdk8: Pulling from apache/gluten
  no matching manifest for linux/amd64 in the manifest list entries
  Warning: Docker pull failed with exit code 1, back off 4.4[17](https://github.com/apache/incubator-gluten/actions/runs/14986809352/job/42102270035?pr=9596#step:2:20) seconds before retry.
  /usr/bin/docker pull apache/gluten:centos-8-jdk8
  centos-8-jdk8: Pulling from apache/gluten
  no matching manifest for linux/amd64 in the manifest list entries
  Warning: Docker pull failed with exit code 1, back off 1.601 seconds before retry.
  /usr/bin/docker pull apache/gluten:centos-8-jdk8
  centos-8-jdk8: Pulling from apache/gluten
  no matching manifest for linux/amd64 in the manifest list entries
  Error: Docker pull failed with exit code 1

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't got time to take a closer look on this but feel free to merge if you are confident.

By the way, it feels we'd have basic spilling support for the feature to make it usable? Any thoughts on that?

@zhli1142015
Copy link
Contributor Author

zhli1142015 commented May 15, 2025

By the way, it feels we'd have basic spilling support for the feature to make it usable? Any thoughts on that?

Thanks for pointing this out. Through testing I found that with this PR the spill logic is still functioning. In our tests we observed that once the dynamic memory manager frees enough JVM memory, if the native task still needs more memory, spilling will be triggered as expected.
Please let me know if I have missed anything here.

Copy link

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

Thanks for pointing this out. Through testing I found that with this PR the spill logic is still functioning. In our tests we observed that once the dynamic memory manager frees enough JVM memory, if the native task still needs more memory, spilling will be triggered as expected.
Please let me know if I have missed anything here.

Thanks for the inputs. I thought the feature would not trigger spill correctly since it was not reporting the usage to Spark (Spark triggers Velox spilling), maybe I missed something there. Will have a look once available. (non-blocking)

Copy link

Run Gluten Clickhouse CI on x86

@zhli1142015
Copy link
Contributor Author

zhli1142015 commented May 27, 2025

This PR previously missed a change: when this feature is enabled, the memory allocations that Gluten reports to Spark should be counted against the ON_HEAP quota rather than OFF_HEAP. From my understanding, this change is the key for this feature to report memory usage correctly; with it, spilling will work as expected.

Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I just learned more details on this feature

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jun 5, 2025

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jul 9, 2025

Run Gluten Clickhouse CI on x86

@zhli1142015 zhli1142015 merged commit 07b18e7 into apache:main Jul 9, 2025
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants