[Serve] Group `DeploymentHandle` autoscaling metrics pushes by process #55016

JoshKarpel · 2025-07-29T19:02:20Z

Why are these changes needed?

We're seeing a lot of pressure on the Serve Controller from metrics push tasks when running thousands of Serve apps. A lot of this pressure is purely from the overhead of lots of RPC connections incoming to the controller. We can amortize this overhead (and presumably similar overhead in the handles too) by having the metrics push happen at the per-process level instead of the per-handle level.

We've made this change on our setup and it has reduced CPU time spent on this in the Controller, and also our ingress application replicas that have all the handles.

Related issue number

Closes #45777

Second attempt on #45957 , see previous discussion there for more context.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <[email protected]>

…shing-v2 Signed-off-by: Josh Karpel <[email protected]>

Signed-off-by: Josh Karpel <[email protected]>

# Conflicts: # python/ray/serve/_private/router.py

Signed-off-by: Josh Karpel <[email protected]>

gemini-code-assist

Summary of Changes

Hello @JoshKarpel, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the autoscaling metrics reporting within Ray Serve to alleviate RPC overhead on the Serve Controller. The core change involves grouping DeploymentHandle autoscaling metrics at the process level before transmitting them, rather than sending individual reports. This refactoring is expected to improve the scalability and performance of the Serve system, particularly when managing a large number of applications. Additionally, it introduces more explicit and configurable intervals for metrics collection and pushing.

Highlights

Metrics Grouping for Performance: I've introduced a SharedHandleMetricsPusher class that aggregates autoscaling metrics from multiple DeploymentHandles within the same process. This allows for sending these metrics to the controller in a single bulk RPC call, significantly reducing network overhead and pressure on the Serve Controller, especially in large-scale deployments.
Standardized Metrics Reporting: I've refactored the internal metrics reporting by creating a dedicated HandleMetricReport dataclass. This new structure encapsulates all necessary handle metrics, simplifying the API for AutoscalingStateManager and the Serve Controller, which now directly consume these unified report objects.
Granular Control over Intervals: I've deprecated the metrics_interval_s field in AutoscalingConfig in favor of new environment variables (e.g., RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S, RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S). This provides more precise and independent control over the recording and pushing intervals for both replica and handle autoscaling metrics.
Controller API Enhancement: I've added a new bulk_record_handle_metrics method to the Serve Controller. This method is designed to efficiently receive and process sequences of HandleMetricReport objects, supporting the new grouped metrics push mechanism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Overall, this is a well-executed pull request that addresses a significant performance bottleneck in Ray Serve. The introduction of SharedHandleMetricsPusher to batch metric reports is a smart solution. The related code refactoring, such as using the HandleMetricReport dataclass, improves code clarity and maintainability.

My review includes one high-priority comment regarding a potential bug in an environment variable name and one medium-priority comment about a future performance consideration for the new shared pusher. Once these are addressed, the PR should be in great shape.

python/ray/serve/_private/constants.py

python/ray/serve/_private/router.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Josh Karpel <[email protected]>

Signed-off-by: Josh Karpel <[email protected]>

…deprecated (#55102) ## Why are these changes needed? Deprecation announcements for this future PR #55016 ## Related issue number Related to #45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]>

# Conflicts: # python/ray/serve/_private/constants.py # python/ray/serve/_private/replica.py # python/ray/serve/_private/router.py # python/ray/serve/config.py # src/ray/protobuf/serve.proto

Signed-off-by: Josh Karpel <[email protected]>

…deprecated (#55102) ## Why are these changes needed? Deprecation announcements for this future PR #55016 ## Related issue number Related to #45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]> Signed-off-by: sampan <[email protected]>

Signed-off-by: Josh Karpel <[email protected]>

…shing-v2

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel · 2025-09-16T21:12:47Z

python/requirements/test-requirements.txt

 pytest==7.4.4
 pytest-asyncio==0.17.2
 pytest-aiohttp==1.1.0
+pytest-env


https://pypi.org/project/pytest-env/

I was struggling with how to conveniently override the environment variables that control the metrics push interval early enough, and in every case where Pytest is being used, so I introduced this new dependency to handle it.

i mentioned this in the other comment. Use python/ray/serve/tests/BUILD.bazel to set env variables.

you should revert this change

JoshKarpel · 2025-09-16T21:44:46Z

Could you update the description with profile/benchmark information that showcases the benefit from this implementation.

I went back through our internal discussions on this and unfortunately I can't find any profiling results from this change specifically in our setup :(

I'll see if I can get a chance to reproduce them but my time is pretty limited right now.

Signed-off-by: Josh Karpel <[email protected]>

abrarsheikh · 2025-09-17T16:50:25Z

pytest.ini

+    RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S = 0.1
+    RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S = 0.1


The place you want to add this is python/ray/serve/tests/BUILD.bazel. In ray we manage env variables through bazel

The trouble with that is it won't apply to local runs if you're not using bazel - is the expectation that bazel is used for local test runs too? If so I'm happy to set these there, but I guess I assumed that it should be possible to use plain pytest commands locally.

yea, that is a pain point we have accepted for now, to use bazel to run local tests.

Ah, I see! I will set them there then.

abrarsheikh · 2025-09-17T19:31:32Z

python/ray/serve/_private/autoscaling_state.py

-            2 * self._config.metrics_interval_s,
-            RAY_SERVE_MIN_HANDLE_METRICS_TIMEOUT_S,
-        )
+        timeout_s = 2.5 * RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S


right but why drop RAY_SERVE_MIN_HANDLE_METRICS_TIMEOUT_S, wouldn't that be regression in feature?

abrarsheikh · 2025-09-17T19:35:00Z

python/ray/serve/_private/router.py

@@ -1,3 +1,5 @@
+from __future__ import annotations


what does this do?

It lets https://github.com/ray-project/ray/pull/55016/files#diff-b81ab953bc243bf2c453db721719903571d53fdd13d9607a292a4658951d66f0R1068 work - no need to quote forward-referenced types

abrarsheikh · 2025-09-17T19:36:58Z

python/ray/serve/_private/router.py

                )
+                shared.register(self)
            else:
                self.metrics_pusher.register_or_update_task(


i think we should also use the shared pusher in the else case.

abrarsheikh · 2025-09-17T19:39:08Z

python/ray/serve/_private/router.py

+        self._metrics_pusher.start()
+
+        self._metrics_pusher.register_or_update_task(
+            "push_metrics_to_controller",


nit: extract ReplicaMetricsManager.PUSH_METRICS_TO_CONTROLLER_TASK_NAME into a constant in commons.py. And use that here

abrarsheikh · 2025-09-17T19:40:59Z

python/requirements/test-requirements.txt

 pytest==7.4.4
 pytest-asyncio==0.17.2
 pytest-aiohttp==1.1.0
+pytest-env


i mentioned this in the other comment. Use python/ray/serve/tests/BUILD.bazel to set env variables.

you should revert this change

…metrics timeout Signed-off-by: Josh Karpel <[email protected]>

Signed-off-by: Josh Karpel <[email protected]>

…deprecated (ray-project#55102) ## Why are these changes needed? Deprecation announcements for this future PR ray-project#55016 ## Related issue number Related to ray-project#45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>

# Conflicts: # python/ray/serve/_private/autoscaling_state.py # python/ray/serve/_private/controller.py # python/ray/serve/_private/replica.py # python/ray/serve/_private/router.py # python/ray/serve/tests/BUILD.bazel # python/ray/serve/tests/test_deploy_2.py

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel added 12 commits June 10, 2024 16:54

introduce shared handle metrics pusher

9b3e368

Signed-off-by: Josh Karpel <[email protected]>

minor optimizations

45dc6e5

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'refs/heads/master' into issue-45777-amortize-metrics-pu…

7cf7299

…shing-v2 Signed-off-by: Josh Karpel <[email protected]>

debug logs

84f6c10

Signed-off-by: Josh Karpel <[email protected]>

updated HandleMetricReport docstring, change to debug log

4a48902

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

af5d76a

tidy up

415a44b

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

088800f

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

86da5fa

# Conflicts: # python/ray/serve/_private/router.py

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

b87c75e

# Conflicts: # python/ray/serve/_private/router.py

deprecate AutoscalingConfig.metrics_interval_s

f771410

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

e5b4123

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

python/ray/serve/_private/constants.py Outdated Show resolved Hide resolved

python/ray/serve/_private/router.py Outdated Show resolved Hide resolved

JoshKarpel and others added 8 commits July 29, 2025 14:07

Update python/ray/serve/_private/constants.py

ff60ebe

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Josh Karpel <[email protected]>

tidy up a few things on the metrics report generation path

49b2188

Signed-off-by: Josh Karpel <[email protected]>

lint

bf6abbf

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

c99a762

still supporting 3.9

a0618d5

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

e16cae0

remove more uses of metrics_interval_s

3bdfd6a

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

e093563

JoshKarpel mentioned this pull request Jul 31, 2025

[Serve] Mark per-deployment autoscaling metrics push interval config deprecated #55102

Merged

8 tasks

JoshKarpel added 3 commits August 11, 2025 09:44

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

b56e914

# Conflicts: # python/ray/serve/_private/constants.py # python/ray/serve/_private/replica.py # python/ray/serve/_private/router.py # python/ray/serve/config.py # src/ray/protobuf/serve.proto

use statistics.fmean, increase timeout

69df796

Signed-off-by: Josh Karpel <[email protected]>

update deprecation statements

ebb95c0

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel added 2 commits August 12, 2025 09:43

revert

ef7726a

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

c58162a

JoshKarpel added 4 commits September 11, 2025 11:05

continued cleanup after merging master

5de0523

Signed-off-by: Josh Karpel <[email protected]>

remove dead import

352a8b7

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'refs/heads/master' into issue-45777-amortize-metrics-pu…

87f71c3

…shing-v2

use pytest-env to set overrides

ea4bf41

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel force-pushed the issue-45777-amortize-metrics-pushing-v2 branch from 0c63f4c to ea4bf41 Compare September 16, 2025 17:22

JoshKarpel added 2 commits September 16, 2025 13:35

loosen pin

e2ae6f2

Signed-off-by: Josh Karpel <[email protected]>

compiled reqs

22b43a4

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel commented Sep 16, 2025

View reviewed changes

flo-ri-an mentioned this pull request Sep 17, 2025

[Core]: Fatal error reported in dashboard #56581

Closed

JoshKarpel added 2 commits September 17, 2025 11:02

fix method name

df78bef

Signed-off-by: Josh Karpel <[email protected]>

fix more method names

5f71b67

Signed-off-by: Josh Karpel <[email protected]>

abrarsheikh reviewed Sep 17, 2025

View reviewed changes

undo pytest-env, set envs in bazel, constify, reintroduce min handle …

e6a6c25

…metrics timeout Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel force-pushed the issue-45777-amortize-metrics-pushing-v2 branch from 2bf1f57 to e6a6c25 Compare September 17, 2025 20:36

JoshKarpel added 2 commits September 17, 2025 16:12

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

719562a

fix reqs

2274c03

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel mentioned this pull request Sep 17, 2025

Aggregate autoscaling metrics on controller #56306

Merged

JoshKarpel added 2 commits September 18, 2025 13:28

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

240b869

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

de9c3b1

JoshKarpel added 4 commits October 10, 2025 12:58

fix pre-commit error

34f1a13

Signed-off-by: Josh Karpel <[email protected]>

fix pre-commit error again

16910cc

Signed-off-by: Josh Karpel <[email protected]>

fix pre-commit error again again

7695ceb

Signed-off-by: Josh Karpel <[email protected]>

This comment was marked as outdated.

Sign in to view

JoshKarpel added 4 commits October 10, 2025 14:20

fix import

8f0b5cc

Signed-off-by: Josh Karpel <[email protected]>

fix another thing

83dda95

Signed-off-by: Josh Karpel <[email protected]>

unregister when autoscaling config is None

5697380

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

b8386c9

		RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S = 0.1
		RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S = 0.1

[Serve] Group DeploymentHandle autoscaling metrics pushes by process #55016

Are you sure you want to change the base?

[Serve] Group DeploymentHandle autoscaling metrics pushes by process #55016

Uh oh!

Conversation

JoshKarpel commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshKarpel commented Sep 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Serve] Group `DeploymentHandle` autoscaling metrics pushes by process #55016

[Serve] Group `DeploymentHandle` autoscaling metrics pushes by process #55016

JoshKarpel commented Jul 29, 2025 •

edited

Loading