-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Serve] Group DeploymentHandle
autoscaling metrics pushes by process
#55016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Serve] Group DeploymentHandle
autoscaling metrics pushes by process
#55016
Conversation
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
…shing-v2 Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
# Conflicts: # python/ray/serve/_private/router.py
# Conflicts: # python/ray/serve/_private/router.py
Signed-off-by: Josh Karpel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @JoshKarpel, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on optimizing the autoscaling metrics reporting within Ray Serve to alleviate RPC overhead on the Serve Controller. The core change involves grouping DeploymentHandle
autoscaling metrics at the process level before transmitting them, rather than sending individual reports. This refactoring is expected to improve the scalability and performance of the Serve system, particularly when managing a large number of applications. Additionally, it introduces more explicit and configurable intervals for metrics collection and pushing.
Highlights
- Metrics Grouping for Performance: I've introduced a
SharedHandleMetricsPusher
class that aggregates autoscaling metrics from multipleDeploymentHandle
s within the same process. This allows for sending these metrics to the controller in a single bulk RPC call, significantly reducing network overhead and pressure on the Serve Controller, especially in large-scale deployments. - Standardized Metrics Reporting: I've refactored the internal metrics reporting by creating a dedicated
HandleMetricReport
dataclass. This new structure encapsulates all necessary handle metrics, simplifying the API forAutoscalingStateManager
and the Serve Controller, which now directly consume these unified report objects. - Granular Control over Intervals: I've deprecated the
metrics_interval_s
field inAutoscalingConfig
in favor of new environment variables (e.g.,RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S
,RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S
). This provides more precise and independent control over the recording and pushing intervals for both replica and handle autoscaling metrics. - Controller API Enhancement: I've added a new
bulk_record_handle_metrics
method to the Serve Controller. This method is designed to efficiently receive and process sequences ofHandleMetricReport
objects, supporting the new grouped metrics push mechanism.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
Overall, this is a well-executed pull request that addresses a significant performance bottleneck in Ray Serve. The introduction of SharedHandleMetricsPusher
to batch metric reports is a smart solution. The related code refactoring, such as using the HandleMetricReport
dataclass, improves code clarity and maintainability.
My review includes one high-priority comment regarding a potential bug in an environment variable name and one medium-priority comment about a future performance consideration for the new shared pusher. Once these are addressed, the PR should be in great shape.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
…deprecated (#55102) ## Why are these changes needed? Deprecation announcements for this future PR #55016 ## Related issue number Related to #45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]>
# Conflicts: # python/ray/serve/_private/constants.py # python/ray/serve/_private/replica.py # python/ray/serve/_private/router.py # python/ray/serve/config.py # src/ray/protobuf/serve.proto
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
…deprecated (#55102) ## Why are these changes needed? Deprecation announcements for this future PR #55016 ## Related issue number Related to #45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]> Signed-off-by: sampan <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
0c63f4c
to
ea4bf41
Compare
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
pytest==7.4.4 | ||
pytest-asyncio==0.17.2 | ||
pytest-aiohttp==1.1.0 | ||
pytest-env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pypi.org/project/pytest-env/
I was struggling with how to conveniently override the environment variables that control the metrics push interval early enough, and in every case where Pytest is being used, so I introduced this new dependency to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mentioned this in the other comment. Use python/ray/serve/tests/BUILD.bazel
to set env variables.
you should revert this change
I went back through our internal discussions on this and unfortunately I can't find any profiling results from this change specifically in our setup :( I'll see if I can get a chance to reproduce them but my time is pretty limited right now. |
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
pytest.ini
Outdated
RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S = 0.1 | ||
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The place you want to add this is python/ray/serve/tests/BUILD.bazel. In ray we manage env variables through bazel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trouble with that is it won't apply to local runs if you're not using bazel - is the expectation that bazel is used for local test runs too? If so I'm happy to set these there, but I guess I assumed that it should be possible to use plain pytest
commands locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, that is a pain point we have accepted for now, to use bazel to run local tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! I will set them there then.
2 * self._config.metrics_interval_s, | ||
RAY_SERVE_MIN_HANDLE_METRICS_TIMEOUT_S, | ||
) | ||
timeout_s = 2.5 * RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right but why drop RAY_SERVE_MIN_HANDLE_METRICS_TIMEOUT_S
, wouldn't that be regression in feature?
@@ -1,3 +1,5 @@ | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It lets https://github.com/ray-project/ray/pull/55016/files#diff-b81ab953bc243bf2c453db721719903571d53fdd13d9607a292a4658951d66f0R1068 work - no need to quote forward-referenced types
python/ray/serve/_private/router.py
Outdated
) | ||
shared.register(self) | ||
else: | ||
self.metrics_pusher.register_or_update_task( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should also use the shared pusher in the else case.
python/ray/serve/_private/router.py
Outdated
self._metrics_pusher.start() | ||
|
||
self._metrics_pusher.register_or_update_task( | ||
"push_metrics_to_controller", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extract ReplicaMetricsManager.PUSH_METRICS_TO_CONTROLLER_TASK_NAME
into a constant in commons.py. And use that here
pytest==7.4.4 | ||
pytest-asyncio==0.17.2 | ||
pytest-aiohttp==1.1.0 | ||
pytest-env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mentioned this in the other comment. Use python/ray/serve/tests/BUILD.bazel
to set env variables.
you should revert this change
…metrics timeout Signed-off-by: Josh Karpel <[email protected]>
2bf1f57
to
e6a6c25
Compare
Signed-off-by: Josh Karpel <[email protected]>
…deprecated (ray-project#55102) ## Why are these changes needed? Deprecation announcements for this future PR ray-project#55016 ## Related issue number Related to ray-project#45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
# Conflicts: # python/ray/serve/_private/autoscaling_state.py # python/ray/serve/_private/controller.py # python/ray/serve/_private/replica.py # python/ray/serve/_private/router.py # python/ray/serve/tests/BUILD.bazel # python/ray/serve/tests/test_deploy_2.py
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Why are these changes needed?
We're seeing a lot of pressure on the Serve Controller from metrics push tasks when running thousands of Serve apps. A lot of this pressure is purely from the overhead of lots of RPC connections incoming to the controller. We can amortize this overhead (and presumably similar overhead in the handles too) by having the metrics push happen at the per-process level instead of the per-handle level.
We've made this change on our setup and it has reduced CPU time spent on this in the Controller, and also our ingress application replicas that have all the handles.
See also #55102
Related issue number
Closes #45777
Second attempt on #45957 , see previous discussion there for more context.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.