[Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer #56225

nadongjun · 2025-09-04T06:49:33Z

Why are these changes needed?

This PR introduces deployment-level autoscaling observability in Serve. The controller now emits a single, structured JSON log line (serve_autoscaling_snapshot) per autoscaling-enabled deployment each control-loop tick.

This avoids recomputation in the controller call sites and provides a stable, machine-parsable surface for tooling and debugging.

Changed

Add get_observability_snapshot in AutoscalingState and manager wrapper to generate compact snapshots (replica counts, queued/total requests, metric freshness).
Add ServeEventSummarizer to build payloads, reduce duplicate logs, and summarize recent scaling decisions.

Example log (single line):

Logs can be found in controller log files, e.g. /tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log.

serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]}

Follow-ups

Expose the same snapshot data via serve status -v and CLI/SDK surfaces.
Aggregate per-app snapshots and external scaler history.

Related issue number

#55834

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Dongjun Na <[email protected]>

gemini-code-assist

Code Review

This pull request introduces valuable observability features for autoscaling in Ray Serve by adding structured JSON logs for autoscaling snapshots. The implementation is solid, with a new ServeEventSummarizer to handle log formatting and throttling, and new methods in AutoscalingState to provide the necessary data.

My review includes a few suggestions for improvement:

A high-severity issue where a hardcoded policy name is used in ScalingDecision objects, which should be corrected to use the dynamically determined policy name.
A medium-severity issue in the logging utility where missing timestamps are replaced with the current time, which could be misleading.
A medium-severity suggestion to refactor duplicated logic for accessing configuration values to improve code maintainability.

python/ray/serve/_private/controller.py

python/ray/serve/_private/logging_utils.py

Signed-off-by: Dongjun Na <[email protected]>

abrarsheikh

my main feedback about this PR is that we are creating many intermediate free form dictionaries, and it is not clear to me why we need them all but importantly they create ambiguity in future about what each dictionary is supposed to contain making maintaining code harder. The code can be reorganized better, used typed objects to function that need to return large dictionaries.

python/ray/serve/_private/controller.py

python/ray/serve/_private/logging_utils.py

python/ray/serve/_private/controller.py

…except and unused func(note_once_per_interval) Signed-off-by: Dongjun Na <[email protected]>

…er, and add constant Signed-off-by: Dongjun Na <[email protected]>

…remove unnecessary getattr Signed-off-by: Dongjun Na <[email protected]>

Signed-off-by: Dongjun Na <[email protected]>

akyang-anyscale

Thanks for the contribution @nadongjun! Have you thought about how this would change/work with application-level autoscaling which is in flight: #56149? When application-level autoscaling is enabled, deployment does not autoscale by-itself, so that may change how user should interpret the logs.

As a feedback for the PR, I would recommend packaging the various autoscaling relevant values into objects, and pass that object around. It's somewhat difficult to track all the different variables and where they come from, and makes the code a bit harder to parse.

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/controller.py

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/controller.py

- Rename get_observability_snapshot → get_snapshot for clarity - Rename proposed_replicas → target_replicas across snapshot flow - Return last_metrics_age_s=None when no metrics; map to "unknown" in summarizer - Flatten replicas_allowed{min,max} into top-level min, max in snapshot payload - Move look_back_period_s to top-level for consistency - Rename DecisionSummary → AutoscalingDecisionSummary for clarity - Replace tuple-based SnapshotSignature with typed dataclass - Use DeploymentID directly as dedupe key instead of (app_name, dep_name) - Inline snapshot computation in controller; remove _compute_snapshot_inputs - Push scaling_status formatting into log_deployment_snapshot - Update tests to validate new payload shape (min/max, no replicas_allowed) Signed-off-by: Dongjun Na <[email protected]>

- Standardize payload to return 'timestamp_s' for snapshots. - Return metrics health as last_metrics_age_s Signed-off-by: Dongjun Na <[email protected]>

Signed-off-by: Dongjun Na <[email protected]>

nadongjun · 2025-09-05T23:46:50Z

@abrarsheikh @akyang-anyscale Thanks for the detailed review!

@akyang-anyscale That’s a fair point. serve_autoscaling_snapshot log format currently only covers deployment-level autoscaling. Once application-level autoscaling is added, we’ll log deployment and application-level snapshots separately.

I’ve already switched to typed dataclasses (e.g., DeploymentSnapshot, AutoscalingDecisionSummary) so the controller passes structured objects instead of dicts. I’ll do the same for application-level autoscaling to keep things consistent.

Signed-off-by: Dongjun Na <[email protected]>

python/ray/serve/_private/autoscaling_state.py

abrarsheikh · 2025-09-08T21:56:37Z

python/ray/serve/_private/autoscaling_state.py


        return total_requests

+    def get_deployment_snapshot(self, curr_target_num_replicas: int) -> Dict[str, Any]:


get_deployment_snapshot is a expensive operation to be performed on every control loop iteration, reason because that it calls get_total_num_requests, loops over replicas and handle. These are expensive operations for a large cluster. Second, it calls self.get_decision_num_replicas which internally executed autoscaling policy which was be expensive.

I suggest instead constructing the DeploymentAutoscalingSnapshot object every time get_decision_num_replicas run and storing that on the class object. Then get_deployment_snapshot simply return the cached DeploymentAutoscalingSnapshot object.

Good call, I’ve applied this. Now the snapshot is constructed once during get_decision_num_replicas() and cached, and get_deployment_snapshot() just returns the cached object.

python/ray/serve/_private/event_summarizer.py

Signed-off-by: Dongjun Na <[email protected]>

cursor · 2025-10-02T07:35:19Z

python/ray/serve/_private/common.py

+            and self.app == other.app
+            and self.deployment == other.deployment
+        )
+


Bug: Timestamp Comparison Fails Snapshot Deduplication

The is_scaling_equivalent method, meant for autoscaling snapshot log deduplication, compares timestamp_str, app, and deployment. Since timestamp_str always differs between snapshots, this logic prevents effective deduplication, causing the controller to log every snapshot even when the scaling state is unchanged. Comparing actual scaling-related fields would be more effective.

cursor · 2025-10-02T07:47:07Z

python/ray/serve/_private/common.py

+            self.timestamp_str == other.timestamp_str
+            and self.app == other.app
+            and self.deployment == other.deployment
+        )


Bug: Autoscaling Snapshot Deduplication Fails

The is_scaling_equivalent method's logic for autoscaling snapshot deduplication is flawed. It compares timestamp_str, which is unique for each snapshot, making deduplication ineffective. This results in logging every snapshot even when the actual scaling state remains unchanged.

python/ray/serve/tests/test_controller.py

abrarsheikh · 2025-10-02T17:03:28Z

python/ray/serve/tests/test_controller.py

+    then validates the JSON payload shape and a few key fields. This test validates only the earliest snapshot.
+    """
+
+    DEPLOY_NAME = f"snap_app_{int(time.time())}"


its desirable to have to test with actual autoscaling behavior so that we assert some real values. The 0 traffic case is not interesting

I removed the fixed sleeps and tuned the autoscaling config so it scales quickly. The test now sends load to trigger 1 -> 2 replicas, making it deterministic and less flaky.

Signed-off-by: Dongjun Na <[email protected]>

python/ray/serve/_private/application_state.py

python/ray/serve/_private/autoscaling_state.py

Signed-off-by: Dongjun Na <[email protected]>

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/common.py

vaishdho1 · 2025-10-08T23:11:57Z

python/ray/serve/_private/common.py

+    @staticmethod
+    def format_metrics_health_text(
+        *,
+        time_since_last_collected_metrics_s: Optional[float],
+        look_back_period_s: Optional[float],
+    ) -> str:


@nadongjun Why are we passing the look_back_period_s parameter here? Is there some use case for this parameter in this function?

python/ray/serve/_private/autoscaling_state.py

abrarsheikh · 2025-10-09T05:12:29Z

python/ray/serve/_private/controller.py

+            self._autoscaling_logger.info(
+                "", extra={"type": "deployment", "snapshot": payload}
+            )


why do we have extra's here. What is type and why do we need it. also why not do self._autoscaling_logger.info(payload)

The type field is included to help CLI tools easily distinguish between snapshot types such as application, deployment, and external in the autoscaling_snapshot_*.log files.

The extra parameter is used to ensure the logs are emitted in structured JSON format, using self._autoscaling_logger.info(payload) alone would output a plain string instead of structured data.

python/ray/serve/tests/test_controller.py

vaishdho1 · 2025-10-11T20:08:00Z

python/ray/serve/_private/autoscaling_state.py

+            DecisionRecord(
+                timestamp_str=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
+                current_num_replicas=ctx.current_num_replicas,
+                target_num_replicas=decision_num_replicas,
+                reason=f"current={ctx.current_num_replicas}, target={decision_num_replicas}",


The reason for the decision here is currently written as the current_num_replicas and target_num_replicas. Shouldn't this reason be policy specific as to why this decision was made?

Yes, that’s right. The current version is a simple implementation that only logs the current and target replica counts to indicate when autoscaling is triggered. We plan to extend it later to include policy-specific reasons for external and custom policies.

vaishdho1 · 2025-10-11T22:41:34Z

python/ray/serve/_private/autoscaling_state.py

+            policy_name=ctx.config.policy.name,
+            look_back_period_s=look_back_period_s,
+            queued_requests=float(queued_requests),
+            ongoing_requests=float(ctx.total_num_requests),


Are the metrics displayed policy agnostic?

Yes, the displayed metrics are policy-agnostic.

While fields like policy_name or look_back_period_s can differ depending on the active policy, they’re only contextual information. The metrics such as queued_requests and ongoing_requests are runtime values gathered from handles and replicas, independent of any specific policy logic.

Signed-off-by: Dongjun Na <[email protected]>

python/ray/serve/_private/common.py

Signed-off-by: Dongjun Na <[email protected]>

python/ray/serve/_private/common.py

nadongjun added 2 commits September 4, 2025 14:06

[Serve] Add deployment-level autoscaling snapshot and event summarizer

58bb94f

Signed-off-by: Dongjun Na <[email protected]>

Add autoscaling snapshot log test

5a8ba60

Signed-off-by: Dongjun Na <[email protected]>

nadongjun requested a review from a team as a code owner September 4, 2025 06:49

nadongjun changed the title ~~[Serve][1/N] Add deployment-level autoscaling snapshot and event summarizer~~ [Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved

python/ray/serve/_private/logging_utils.py Outdated Show resolved Hide resolved

ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Sep 4, 2025

rafactor and lint

b5fdc02

Signed-off-by: Dongjun Na <[email protected]>

abrarsheikh requested review from abrarsheikh and akyang-anyscale September 4, 2025 16:27

Merge branch 'master' into serve-obsv-deployment

4786916

abrarsheikh reviewed Sep 5, 2025

View reviewed changes

nadongjun added 8 commits September 5, 2025 11:32

[comment 1, 2, 4] Replace dicts with typed objects, remove redundant …

0f4c16d

…except and unused func(note_once_per_interval) Signed-off-by: Dongjun Na <[email protected]>

[comment 3, 5, 6] Move/rename autoscaling summarizer, use global logg…

27e2d06

…er, and add constant Signed-off-by: Dongjun Na <[email protected]>

[comment 8, 9] Add InternalAutoscalingConfig for normalized type and …

bcc5925

…remove unnecessary getattr Signed-off-by: Dongjun Na <[email protected]>

[comment 10] use to_log_dict() and emit_deployment_snapshot()

d6de660

Signed-off-by: Dongjun Na <[email protected]>

Add constant(AUTOSCALER_SUMMARIZER_DECISION_LIMIT)

0c0064d

Signed-off-by: Dongjun Na <[email protected]>

[comment 7] Refactor _emit_deployment_autoscaling_snapshots

309d474

Signed-off-by: Dongjun Na <[email protected]>

lint

86e25fe

Signed-off-by: Dongjun Na <[email protected]>

lint

7eeeac5

Signed-off-by: Dongjun Na <[email protected]>

akyang-anyscale reviewed Sep 5, 2025

View reviewed changes

nadongjun added 3 commits September 6, 2025 08:06

[review] Snapshot timestamp, metrics age formatting

e04d968

- Standardize payload to return 'timestamp_s' for snapshots. - Return metrics health as last_metrics_age_s Signed-off-by: Dongjun Na <[email protected]>

[review] Inline build_deployment_snapshot into log_deployment_snapshot

0107c85

Signed-off-by: Dongjun Na <[email protected]>

nadongjun added 3 commits September 8, 2025 08:29

Merge branch 'master' into serve-obsv-deployment

625948b

Refactor improve readability

258165c

Signed-off-by: Dongjun Na <[email protected]>

Refactor improve readability

4cceedd

Signed-off-by: Dongjun Na <[email protected]>

abrarsheikh reviewed Sep 8, 2025

View reviewed changes

Update comments and var name

fd239b7

Signed-off-by: Dongjun Na <[email protected]>

nadongjun added 6 commits October 2, 2025 15:03

update testcode

9ed7aba

Signed-off-by: Dongjun Na <[email protected]>

add structured JSON logging for autoscaling snapshots

1cd24ea

Signed-off-by: Dongjun Na <[email protected]>

add snapshot type

c6e405f

Signed-off-by: Dongjun Na <[email protected]>

use DeploymentSnapshot type instead of object

1482244

Signed-off-by: Dongjun Na <[email protected]>

use timestamp_str, app, and deployment for scaling equivalence

f08ea59

Signed-off-by: Dongjun Na <[email protected]>

refactor autoscaling snapshot test with simplified

789eb3e

Signed-off-by: Dongjun Na <[email protected]>

cursor bot reviewed Oct 2, 2025

View reviewed changes

nadongjun requested a review from abrarsheikh October 2, 2025 07:41

Merge branch 'master' into serve-obsv-deployment

c6f4cc1

cursor bot reviewed Oct 2, 2025

View reviewed changes

arcyleung approved these changes Oct 2, 2025

View reviewed changes

abrarsheikh reviewed Oct 2, 2025

View reviewed changes

nadongjun added 2 commits October 3, 2025 14:15

update testcode

3934e79

Signed-off-by: Dongjun Na <[email protected]>

update testcode snapshot key

6f51766

Signed-off-by: Dongjun Na <[email protected]>

nadongjun requested a review from abrarsheikh October 3, 2025 05:20

cursor bot reviewed Oct 3, 2025

View reviewed changes

python/ray/serve/_private/application_state.py Outdated Show resolved Hide resolved

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

update return type

5bc42df

Signed-off-by: Dongjun Na <[email protected]>

cursor bot reviewed Oct 3, 2025

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

python/ray/serve/_private/common.py Show resolved Hide resolved

vaishdho1 reviewed Oct 9, 2025

View reviewed changes

abrarsheikh reviewed Oct 9, 2025

View reviewed changes

vaishdho1 reviewed Oct 11, 2025

View reviewed changes

nadongjun added 2 commits October 13, 2025 09:35

Merge remote-tracking branch 'origin/master' into serve-obsv-deployment

18c2020

Signed-off-by: Dongjun Na <[email protected]>

Update sum handle queue in ctx & snapshot policy class path

db06c91

Signed-off-by: Dongjun Na <[email protected]>

cursor bot reviewed Oct 13, 2025

View reviewed changes

python/ray/serve/_private/common.py Show resolved Hide resolved

nadongjun added 3 commits October 13, 2025 10:48

Remove sleep

6935a4e

Signed-off-by: Dongjun Na <[email protected]>

Remove exception handling

fa3456d

Signed-off-by: Dongjun Na <[email protected]>

Use get_serve_logs_dir

b942137

Signed-off-by: Dongjun Na <[email protected]>

nadongjun requested a review from abrarsheikh October 13, 2025 02:27

cursor bot reviewed Oct 13, 2025

View reviewed changes

python/ray/serve/_private/common.py Show resolved Hide resolved


		return total_requests

		def get_deployment_snapshot(self, curr_target_num_replicas: int) -> Dict[str, Any]:

[Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer #56225

Are you sure you want to change the base?

[Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer #56225

Uh oh!

Conversation

nadongjun commented Sep 4, 2025

Why are these changes needed?

Changed

Example log (single line):

Follow-ups

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akyang-anyscale left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nadongjun commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Oct 2, 2025

Choose a reason for hiding this comment

Bug: Timestamp Comparison Fails Snapshot Deduplication

Uh oh!

cursor bot Oct 2, 2025

Choose a reason for hiding this comment

Bug: Autoscaling Snapshot Deduplication Fails

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akyang-anyscale left a comment •

edited

Loading