Skip to content

Conversation

nadongjun
Copy link
Contributor

Why are these changes needed?

This PR introduces deployment-level autoscaling observability in Serve. The controller now emits a single, structured JSON log line (serve_autoscaling_snapshot) per autoscaling-enabled deployment each control-loop tick.

This avoids recomputation in the controller call sites and provides a stable, machine-parsable surface for tooling and debugging.

Changed

  • Add get_observability_snapshot in AutoscalingState and manager wrapper to generate compact snapshots (replica counts, queued/total requests, metric freshness).
  • Add ServeEventSummarizer to build payloads, reduce duplicate logs, and summarize recent scaling decisions.

Example log (single line):

Logs can be found in controller log files, e.g. /tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log.

serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]}

Follow-ups

  • Expose the same snapshot data via serve status -v and CLI/SDK surfaces.
  • Aggregate per-app snapshots and external scaler history.

Related issue number

#55834

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@nadongjun nadongjun requested a review from a team as a code owner September 4, 2025 06:49
@nadongjun nadongjun changed the title [Serve][1/N] Add deployment-level autoscaling snapshot and event summarizer [Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer Sep 4, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable observability features for autoscaling in Ray Serve by adding structured JSON logs for autoscaling snapshots. The implementation is solid, with a new ServeEventSummarizer to handle log formatting and throttling, and new methods in AutoscalingState to provide the necessary data.

My review includes a few suggestions for improvement:

  • A high-severity issue where a hardcoded policy name is used in ScalingDecision objects, which should be corrected to use the dynamically determined policy name.
  • A medium-severity issue in the logging utility where missing timestamps are replaced with the current time, which could be misleading.
  • A medium-severity suggestion to refactor duplicated logic for accessing configuration values to improve code maintainability.

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Sep 4, 2025
Signed-off-by: Dongjun Na <[email protected]>
Copy link
Contributor

@abrarsheikh abrarsheikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main feedback about this PR is that we are creating many intermediate free form dictionaries, and it is not clear to me why we need them all but importantly they create ambiguity in future about what each dictionary is supposed to contain making maintaining code harder. The code can be reorganized better, used typed objects to function that need to return large dictionaries.

Copy link
Contributor

@akyang-anyscale akyang-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @nadongjun! Have you thought about how this would change/work with application-level autoscaling which is in flight: #56149? When application-level autoscaling is enabled, deployment does not autoscale by-itself, so that may change how user should interpret the logs.

As a feedback for the PR, I would recommend packaging the various autoscaling relevant values into objects, and pass that object around. It's somewhat difficult to track all the different variables and where they come from, and makes the code a bit harder to parse.

- Rename get_observability_snapshot → get_snapshot for clarity
- Rename proposed_replicas → target_replicas across snapshot flow
- Return last_metrics_age_s=None when no metrics; map to "unknown" in summarizer
- Flatten replicas_allowed{min,max} into top-level min, max in snapshot payload
- Move look_back_period_s to top-level for consistency
- Rename DecisionSummary → AutoscalingDecisionSummary for clarity
- Replace tuple-based SnapshotSignature with typed dataclass
- Use DeploymentID directly as dedupe key instead of (app_name, dep_name)
- Inline snapshot computation in controller; remove _compute_snapshot_inputs
- Push scaling_status formatting into log_deployment_snapshot
- Update tests to validate new payload shape (min/max, no replicas_allowed)

Signed-off-by: Dongjun Na <[email protected]>
- Standardize payload to return 'timestamp_s' for snapshots.

- Return metrics health as last_metrics_age_s

Signed-off-by: Dongjun Na <[email protected]>
@nadongjun
Copy link
Contributor Author

@abrarsheikh @akyang-anyscale Thanks for the detailed review!

@akyang-anyscale That’s a fair point. serve_autoscaling_snapshot log format currently only covers deployment-level autoscaling. Once application-level autoscaling is added, we’ll log deployment and application-level snapshots separately.

I’ve already switched to typed dataclasses (e.g., DeploymentSnapshot, AutoscalingDecisionSummary) so the controller passes structured objects instead of dicts. I’ll do the same for application-level autoscaling to keep things consistent.


return total_requests

def get_deployment_snapshot(self, curr_target_num_replicas: int) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_deployment_snapshot is a expensive operation to be performed on every control loop iteration, reason because that it calls get_total_num_requests, loops over replicas and handle. These are expensive operations for a large cluster. Second, it calls self.get_decision_num_replicas which internally executed autoscaling policy which was be expensive.

I suggest instead constructing the DeploymentAutoscalingSnapshot object every time get_decision_num_replicas run and storing that on the class object. Then get_deployment_snapshot simply return the cached DeploymentAutoscalingSnapshot object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I’ve applied this. Now the snapshot is constructed once during get_decision_num_replicas() and cached, and get_deployment_snapshot() just returns the cached object.

and self.app == other.app
and self.deployment == other.deployment
)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Timestamp Comparison Fails Snapshot Deduplication

The is_scaling_equivalent method, meant for autoscaling snapshot log deduplication, compares timestamp_str, app, and deployment. Since timestamp_str always differs between snapshots, this logic prevents effective deduplication, causing the controller to log every snapshot even when the scaling state is unchanged. Comparing actual scaling-related fields would be more effective.

Fix in Cursor Fix in Web

@nadongjun nadongjun requested a review from abrarsheikh October 2, 2025 07:41
self.timestamp_str == other.timestamp_str
and self.app == other.app
and self.deployment == other.deployment
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Autoscaling Snapshot Deduplication Fails

The is_scaling_equivalent method's logic for autoscaling snapshot deduplication is flawed. It compares timestamp_str, which is unique for each snapshot, making deduplication ineffective. This results in logging every snapshot even when the actual scaling state remains unchanged.

Fix in Cursor Fix in Web

then validates the JSON payload shape and a few key fields. This test validates only the earliest snapshot.
"""

DEPLOY_NAME = f"snap_app_{int(time.time())}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its desirable to have to test with actual autoscaling behavior so that we assert some real values. The 0 traffic case is not interesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the fixed sleeps and tuned the autoscaling config so it scales quickly. The test now sends load to trigger 1 -> 2 replicas, making it deterministic and less flaky.

Signed-off-by: Dongjun Na <[email protected]>
@nadongjun nadongjun requested a review from abrarsheikh October 3, 2025 05:20
Signed-off-by: Dongjun Na <[email protected]>
Comment on lines +802 to +807
@staticmethod
def format_metrics_health_text(
*,
time_since_last_collected_metrics_s: Optional[float],
look_back_period_s: Optional[float],
) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nadongjun Why are we passing the look_back_period_s parameter here? Is there some use case for this parameter in this function?

Comment on lines +448 to +450
self._autoscaling_logger.info(
"", extra={"type": "deployment", "snapshot": payload}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have extra's here. What is type and why do we need it. also why not do self._autoscaling_logger.info(payload)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type field is included to help CLI tools easily distinguish between snapshot types such as application, deployment, and external in the autoscaling_snapshot_*.log files.

The extra parameter is used to ensure the logs are emitted in structured JSON format, using self._autoscaling_logger.info(payload) alone would output a plain string instead of structured data.

Comment on lines +288 to +292
DecisionRecord(
timestamp_str=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
current_num_replicas=ctx.current_num_replicas,
target_num_replicas=decision_num_replicas,
reason=f"current={ctx.current_num_replicas}, target={decision_num_replicas}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the decision here is currently written as the current_num_replicas and target_num_replicas. Shouldn't this reason be policy specific as to why this decision was made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s right. The current version is a simple implementation that only logs the current and target replica counts to indicate when autoscaling is triggered. We plan to extend it later to include policy-specific reasons for external and custom policies.

Comment on lines 682 to 685
policy_name=ctx.config.policy.name,
look_back_period_s=look_back_period_s,
queued_requests=float(queued_requests),
ongoing_requests=float(ctx.total_num_requests),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the metrics displayed policy agnostic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the displayed metrics are policy-agnostic.

While fields like policy_name or look_back_period_s can differ depending on the active policy, they’re only contextual information. The metrics such as queued_requests and ongoing_requests are runtime values gathered from handles and replicas, independent of any specific policy logic.

Signed-off-by: Dongjun Na <[email protected]>
Signed-off-by: Dongjun Na <[email protected]>
Signed-off-by: Dongjun Na <[email protected]>
@nadongjun nadongjun requested a review from abrarsheikh October 13, 2025 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants