Skip to content

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Mar 7, 2025

Why are these changes needed?

This PR implements an alpha version of the RayService Incremental Upgrade REP.

The RayService controller logic to reconcile a RayService during an incremental upgrade is as follows:

  1. Validate the IncrementalUpgradeOptions and accept/reject the RayService CR accordingly
  2. Call reconcileGateway - on the first call this should create a new Gateway CR and subsequent calls will update the Listeners as necessary based on any changes to the RayService.Spec
  3. Call reconcileHTTPRoute - on the first call this should create a HTTPRoute CR with two backendRefs, one pointing to the old cluster and one to the pending cluster with weights 100 and 0 accordingly. Every subsequent call to reconcileHTTPRoute will update the HTTPRoute by changing the weight of each backendRef by StepSizePercent until the weight associated with each cluster equals the TargetCapacity associated with that cluster. The backendRef weight is exposed through the RayService Status field TrafficRoutedPercent. The weight is only changed if it's been at least IntervalSeconds since RayService.Status.LastTrafficMigratedTime, otherwise the controller waits until the next iteration and checks again.
  4. The controller then checks if the TrafficRoutedPercent == TargetCapacity, if so the target_capacity can be updated for one of the clusters.
  5. The controller then calls reconcileServeTargetCapacity. If the total target_capacity of both Serve configs is less than or equal to 100%, the pending cluster's target_capacity can be safely scaled up by MaxSurgePercent. If the total target_capacity is greater than 100%, the active cluster target_capacity can be decreased by MaxSurgePercent.
  6. The controller then continues with the reconciliation logic as normal. Once TargetCapacity and TrafficRoutedPercent of the pending RayService instance RayCluster equal 100%, the upgrade is complete.

Related issue number

#3209

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@MortalHappiness MortalHappiness self-assigned this Mar 11, 2025
@ryanaoleary ryanaoleary marked this pull request as ready for review March 25, 2025 11:26
@ryanaoleary ryanaoleary force-pushed the incremental-upgrade branch from 8f9a396 to 486f98b Compare May 15, 2025 00:35
@ryanaoleary
Copy link
Collaborator Author

ryanaoleary commented May 15, 2025

I've now added unit tests and one basic e2e test for the incremental upgrade feature so this should be good to start reviewing. In addition to the unit tests, here's some instructions for manually testing this feature in your cluster.

  1. Create a cloud provider cluster with a Gateway controller installed. I used a GKE cluster with Gateway API enabled, which installs the GKE gateway controller in your cluster.

  2. Retrieve the name of the Gateway class to use:

kubectl get gatewayclass
NAME                                       CONTROLLER                                   ACCEPTED   AGE
gke-l7-global-external-managed             networking.gke.io/gateway                    True       10h
gke-l7-gxlb                                networking.gke.io/gateway                    True       10h
gke-l7-regional-external-managed           networking.gke.io/gateway                    True       10h
gke-l7-rilb                                networking.gke.io/gateway                    True       10h
gke-persistent-regional-external-managed   networking.gke.io/persistent-ip-controller   True       10h
gke-persistent-regional-internal-managed   networking.gke.io/persistent-ip-controller   True       10h
  1. Install the KubeRay operator in your cluster with a dev image built with these changes.:
# from helm-chart/kuberay-operator
# replace image in values.yaml with: `us-docker.pkg.dev/ryanaoleary-gke-dev/kuberay/kuberay` and tag with `latest`, or use your own image

helm install kuberay-operator .
  1. Create a RayService CR named ray-service-incremental-upgrade.yaml with the following specifications, feel free to edit the IncrementalUpgradeOptions to test different upgrade behaviors.
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: gke-l7-rilb
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                resources:
                  requests:
                    cpu: "500m"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"
kubectl apply -f ray-service-incremental-upgrade.yaml
  1. Trigger an incremental upgrade by editing the above serveConfigV2 or a field in the PodSpec and re-applying the RayService yaml.

I'll put more comments with manual test results and add more e2e test cases, but this should be good to start reviewing/iterating on to get merge-ready before the v1.4 release.

@andrewsykim
Copy link
Member

andrewsykim commented May 16, 2025

Tried to test this manually and not seeing Gateway reconcile with this log line:

{"level":"info","ts":"2025-05-16T00:39:32.181Z","logger":"controllers.RayService","msg":"checkIfNeedIncrementalUpgradeUpdate","RayService":{"name":"deepseek-r1-distill-qwen-32b","namespace":"default"},"reconcileID":"ea9e1839-24fd-464a-9f59-49e917f6c495","incrementalUpgradeUpdate":false,"reason":"Gateway for RayService IncrementalUpgrade is not ready."}

Do I need to set spec.gateway for the gateway reconcile to trigger? I didn't think it was needed since the example you shared didn't have it

@ryanaoleary
Copy link
Collaborator Author

Tried to test this manually and not seeing Gateway reconcile with this log line:

{"level":"info","ts":"2025-05-16T00:39:32.181Z","logger":"controllers.RayService","msg":"checkIfNeedIncrementalUpgradeUpdate","RayService":{"name":"deepseek-r1-distill-qwen-32b","namespace":"default"},"reconcileID":"ea9e1839-24fd-464a-9f59-49e917f6c495","incrementalUpgradeUpdate":false,"reason":"Gateway for RayService IncrementalUpgrade is not ready."}

Do I need to set spec.gateway for the gateway reconcile to trigger? I didn't think it was needed since the example you shared didn't have it

No it should be called automatically when IncrementalUpgrade is enabled and there are non-nil pending and active RayClusters. I was thinking that spec.Gateway should be set by the controller if it doesn't exist, but the reconcile appears to be failing for some reason so I'm debugging it now. For the stale cached value issue - I'll change the other places in the code to return the actual Gateway object (rather than just rayServiceInstance.Spec.Gateway) from context, similar to how getRayServiceInstance works.

@ryanaoleary ryanaoleary force-pushed the incremental-upgrade branch from 83f265f to 17f7aa4 Compare May 21, 2025 11:56
@ryanaoleary
Copy link
Collaborator Author

I'm running into some issues now with the allowed ports/protocols for listeners with different Gateway controllers (e.g. the GKE controller is pretty restrictive). I'm working now to figure out how to send traffic from the Serve service -> to the Gateway -> to the active and pending RayCluster head services through the HTTPRoute. An alternative would be just to have users directly send traffic to the Gateway which would be set to HTTP and port 80, but I don't really want users to have to change the path/endpoint they send traffic to for the Serve applications during an upgrade.

@andrewsykim
Copy link
Member

Discussed with Ryan offline, there's a validation in the GKE gateway controller that disallows port 8000 for Serve. But this validation will be removed soon. For now we will test with allowed ports like port 80 and change it back to 8000 before merging

@ryanaoleary ryanaoleary requested a review from andrewsykim May 24, 2025 02:20
@ryanaoleary
Copy link
Collaborator Author

I moved the e2e test to it's own folder since it's an experimental feature and shouldn't be part of the pre-submit tests yet.

@andrewsykim
Copy link
Member

@ryanaoleary can you resolve all the merge conflcits? I can do some testing on this branch once the conflcits are resolved

@ryanaoleary ryanaoleary force-pushed the incremental-upgrade branch from adc6236 to 7694114 Compare June 4, 2025 04:20
@ryanaoleary
Copy link
Collaborator Author

ryanaoleary commented Jun 4, 2025

@ryanaoleary can you resolve all the merge conflcits? I can do some testing on this branch once the conflcits are resolved

All the conflicts have been resolved, this is the image I'm currently using for testing: us-docker.pkg.dev/ryanaoleary-gke-dev/kuberay/kuberay:latest

@ryanaoleary
Copy link
Collaborator Author

ryanaoleary commented Jun 4, 2025

Fixed a target_capacity getting set bug with 5f46a1b. I've now tested it fully e2e with Istio as follows:

  1. Install Istio
istioctl install --set profile=test -y
  1. Create a Istio GatewayClass with the following contents:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller
kubectl apply -f istio-gateway-class.yaml
  1. Create a RayService CR with the following spec:
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 1
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "500m"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"
kubectl apply -f ray-service-incremental-upgrade.yaml 
  1. Validate RayService Pods are created:
kubectl get pods
NAME                                                            READY   STATUS    RESTARTS   AGE
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running   0          4m17s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running   0          100s
rayservice-incremental-upgrade-qfmsg-head                       2/2     Running   0          98s
rayservice-incremental-upgrade-qfmsg-small-group-worker-7dnn5   1/1     Running   0          76s
rayservice-incremental-upgrade-qfmsg-small-group-worker-jsbjs   1/1     Running   0          98s
  1. Validate Gateway and HTTPRoute are created

Gateway:

kubectl describe gateways
Name:         rayservice-incremental-upgrade-gateway
Namespace:    default
Labels:       <none>
Annotations:  networking.gke.io/addresses: /projects/624496840364/global/addresses/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--0zyq5o7gnhde
              networking.gke.io/backend-services:
                /projects/624496840364/global/backendServices/gkegw1-9ok6-default-gw-serve404-80-8zjp3d8cqfsu, /projects/624496840364/global/backendServic...
              networking.gke.io/firewalls: /projects/624496840364/global/firewalls/gkegw1-9ok6-l7-default-global
              networking.gke.io/forwarding-rules:
                /projects/624496840364/global/forwardingRules/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--ux5k3yt9mxdb
              networking.gke.io/health-checks:
                /projects/624496840364/global/healthChecks/gkegw1-9ok6-default-gw-serve404-80-8zjp3d8cqfsu, /projects/624496840364/global/healthChecks/gke...
              networking.gke.io/last-reconcile-time: 2025-06-04T10:59:25Z
              networking.gke.io/lb-route-extensions: 
              networking.gke.io/lb-traffic-extensions: 
              networking.gke.io/ssl-certificates: 
              networking.gke.io/target-http-proxies:
                /projects/624496840364/global/targetHttpProxies/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--v14nm0nv81bk
              networking.gke.io/target-https-proxies: 
              networking.gke.io/url-maps: /projects/624496840364/global/urlMaps/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--v14nm0nv81bk
API Version:  gateway.networking.k8s.io/v1
Kind:         Gateway
Metadata:
  Creation Timestamp:             2025-06-04T10:58:28Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2025-06-04T10:59:22Z
  Finalizers:
    gateway.finalizer.networking.gke.io
  Generation:  3
  Owner References:
    API Version:           ray.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RayService
    Name:                  rayservice-incremental-upgrade
    UID:                   945c03ba-6551-4928-b294-6522e13e91af
  Resource Version:        1749034812216063007
  UID:                     f2273e8f-cf19-4cae-9508-9cebe76b21aa
Spec:
  Gateway Class Name:  istio
  Listeners:
    Allowed Routes:
      Namespaces:
        From:  Same
    Name:      rayservice-incremental-upgrade-listener
    Port:      80
    Protocol:  HTTP
Status:
  Addresses:
    Type:   IPAddress
    Value:  173.255.121.76
  Conditions:
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               The OSS Gateway API has deprecated this condition, do not depend on it.
    Observed Generation:   1
    Reason:                Scheduled
    Status:                True
    Type:                  Scheduled
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               Resource accepted
    Observed Generation:   3
    Reason:                Accepted
    Status:                True
    Type:                  Accepted
    Last Transition Time:  2025-06-04T11:00:03Z
    Message:               Resource programmed, assigned to service(s) rayservice-incremental-upgrade-gateway-istio.default.svc.cluster.local:80
    Observed Generation:   3
    Reason:                Programmed
    Status:                True
    Type:                  Programmed
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               The OSS Gateway API has altered the "Ready" condition semantics and reserved it for future use.  GKE Gateway will stop emitting it in a future update, use "Programmed" instead.
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               
    Observed Generation:   1
    Reason:                Healthy
    Status:                True
    Type:                  networking.gke.io/GatewayHealthy
  Listeners:
    Attached Routes:  1
    Conditions:
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                ResolvedRefs
      Status:                True
      Type:                  ResolvedRefs
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                Programmed
      Status:                True
      Type:                  Programmed
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               The OSS Gateway API has altered the "Ready" condition semantics and reserved it for future use.  GKE Gateway will stop emitting it in a future update, use "Programmed" instead.
      Observed Generation:   1
      Reason:                Ready
      Status:                True
      Type:                  Ready
      Last Transition Time:  2025-06-04T10:59:34Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                NoConflicts
      Status:                False
      Type:                  Conflicted
    Name:                    rayservice-incremental-upgrade-listener
    Supported Kinds:
      Group:  gateway.networking.k8s.io
      Kind:   HTTPRoute
      Group:  gateway.networking.k8s.io
      Kind:   GRPCRoute
Events:
  Type     Reason  Age                  From                   Message
  ----     ------  ----                 ----                   -------
  Normal   ADD     2m10s                sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   SYNC    76s (x8 over 90s)    sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   UPDATE  72s (x4 over 2m10s)  sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   SYNC    72s                  sc-gateway-controller  SYNC on default/rayservice-incremental-upgrade-gateway was a success

HTTPRoute:

kubectl describe httproutes
Name:         httproute-rayservice-incremental-upgrade
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  gateway.networking.k8s.io/v1
Kind:         HTTPRoute
Metadata:
  Creation Timestamp:  2025-06-04T11:00:12Z
  Generation:          1
  Owner References:
    API Version:           ray.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RayService
    Name:                  rayservice-incremental-upgrade
    UID:                   2b74618c-a884-4b3d-a960-27b32d6d7c50
  Resource Version:        1749034812216031016
  UID:                     f0906529-40ea-4e46-ad50-31bfb239acb0
Spec:
  Parent Refs:
    Group:      gateway.networking.k8s.io
    Kind:       Gateway
    Name:       rayservice-incremental-upgrade-gateway
    Namespace:  default
  Rules:
    Backend Refs:
      Group:      
      Kind:       Service
      Name:       rayservice-incremental-upgrade-qfmsg-head-svc
      Namespace:  default
      Port:       8000
      Weight:     100
    Matches:
      Path:
        Type:   PathPrefix
        Value:  /
Status:
  Parents:
    Conditions:
      Last Transition Time:  2025-06-04T11:00:12Z
      Message:               Route was valid
      Observed Generation:   1
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2025-06-04T11:00:12Z
      Message:               All references resolved
      Observed Generation:   1
      Reason:                ResolvedRefs
      Status:                True
      Type:                  ResolvedRefs
    Controller Name:         istio.io/gateway-controller
    Parent Ref:
      Group:      gateway.networking.k8s.io
      Kind:       Gateway
      Name:       rayservice-incremental-upgrade-gateway
      Namespace:  default
Events:
  Type    Reason  Age   From                   Message
  ----    ------  ----  ----                   -------
  Normal  ADD     107s  sc-gateway-controller  default/httproute-rayservice-incremental-upgrade
  1. Send a request to the RayService using the Gateway
kubectl run curl --image=radial/busyboxplus:curl -i --tty

curl -X POST -H 'Content-Type: application/json' http://173.255.121.76/fruit/ -d '["MANGO", 2]'
6
  1. Modify the RayService spec as follows and initiate an upgrade by re-applying the spec:
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app_updated
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.5
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.5
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.5
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.5
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "1"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"
kubectl apply -f ray-service-incremental-upgrade.yaml 
  1. Verify IncrementalUpgrade is triggered:
kubectl get pods
NAME                                                            READY   STATUS              RESTARTS        AGE
curl                                                            1/1     Running             1 (2m10s ago)   3m17s
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running             0               8m40s
rayservice-incremental-upgrade-44r5r-head                       0/2     ContainerCreating   0               1s
rayservice-incremental-upgrade-44r5r-small-group-worker-p6kmz   0/1     Init:0/1            0               1s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running             0               6m3s
rayservice-incremental-upgrade-qfmsg-head                       2/2     Running             0               6m1s
rayservice-incremental-upgrade-qfmsg-small-group-worker-7dnn5   1/1     Running             0               5m39s
rayservice-incremental-upgrade-qfmsg-small-group-worker-jsbjs   1/1     Running             0               6m1s
  1. Use serve status to validate target_capacity is updated on both active and pending RayCluster
# pending cluster Head pod
kubectl exec -it rayservice-incremental-upgrade-44r5r-head -- /bin/bash
Defaulted container "ray-head" out of: ray-head, autoscaler
(base) ray@rayservice-incremental-upgrade-44r5r-head:~$ serve status
proxies:
  b4bfd9d59aa2fdc8a3c464268f83a124f178b3f59c8291b7ac72b4eb: HEALTHY
  98e99c593f7c0ab6546faf7ffd4279e52bfdecb589862df29a84a285: HEALTHY
applications:
  fruit_app_updated:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1749035152.4872093
    deployments:
      MangoStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      OrangeStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      PearStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      FruitMarket:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
target_capacity: 80.0

# active cluster Head pod - after one iteration
serve status
proxies:
  c52833bdc0af5dedaa852dbdc78c9f76bd940dd144a7032f795f51cf: HEALTHY
  3c6f771276464144b506ab5c2e5ac98e7529c67790df9db98e464e52: HEALTHY
  efd31857830f53a8d39c21c404e823d2e5d0a3a8dc7ddc9d4af7e1bc: HEALTHY
applications:
  fruit_app_updated_again:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1749039739.5597744
    deployments:
      MangoStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      OrangeStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      PearStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      FruitMarket:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
target_capacity: 20.0
  1. Validate incremental upgrade completes successfully
kubectl get pods
NAME                                                            READY   STATUS    RESTARTS        AGE
curl                                                            1/1     Running   1 (6m43s ago)   7m50s
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running   0               13m
rayservice-incremental-upgrade-44r5r-head                       2/2     Running   0               4m34s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running   0               10m

The behavior that TrafficRoutedPercent (i.e. the weights on the HTTPRoute) for the pending and active RayClusters respectively are incrementally migrated by StepSizePercent every IntervalSeconds is validated through the e2e test which can be ran with make test-e2e-incremental-upgrade.

const (
// During upgrade, IncrementalUpgrade strategy will create an upgraded cluster to gradually scale
// and migrate traffic to using Gateway API.
IncrementalUpgrade RayServiceUpgradeType = "IncrementalUpgrade"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe too late to change this, but wondering if RollingUpgrade be a more appropriate name? I assume most people are more familiar with this term. WDYT @ryanaoleary @kevin85421 @MortalHappiness

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking this PR, we can cahnge it during alpha phase)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late to reply to this, but I have no strong preference either way. IncrementalUpgrade is what was used in the feature request and REP so that's why I stuck with it, but if there's a preference from any KubeRay maintainers or users I'm down to go through and change the feature name / all the related variable names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rueian for sharing opinion.
I think RollingUpgrade is more a more straight forward name for me too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @kevin85421 since from offline discussion you seemed to have a preference against using RollingUpgrade here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 what do you think about ClusterUpgrade and ClusterUpgradeOptions? I prefer to keep the upgrade term generic as the exact behavior could be changed in the future.

@Future-Outlier was also wondering about the history of why we called it "incremental" upgrades.

@ryanaoleary ryanaoleary requested a review from andrewsykim June 4, 2025 19:28
@joshdevins
Copy link

Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release?

@ryanaoleary
Copy link
Collaborator Author

Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release?

This PR is targeted for KubeRay v1.5, it still needs review and I'll prioritize iterating and resolving comments to get it merged.

Signed-off-by: Ryan O'Leary <[email protected]>

Update go mod dependencies for gateway v1

Signed-off-by: Ryan O'Leary <[email protected]>

Add reconcile Gateway and HTTPRoute

Signed-off-by: Ryan O'Leary <[email protected]>

Add TargetCapacity and TrafficRoutedPercent to RayServiceStatus

Signed-off-by: Ryan O'Leary <[email protected]>

Add controller logic initial commit

Signed-off-by: Ryan O'Leary <[email protected]>

Add IncrementalUpgrade check to ShouldUpdate

Signed-off-by: Ryan O'Leary <[email protected]>

Update controller logic to reconcile incremental upgrade

Signed-off-by: Ryan O'Leary <[email protected]>

TrafficRoutedPercent should not set default value

Signed-off-by: Ryan O'Leary <[email protected]>

Remove test changes to TPU manifest

Signed-off-by: Ryan O'Leary <[email protected]>

Move helper function to utils

Signed-off-by: Ryan O'Leary <[email protected]>

Fix lint

Signed-off-by: Ryan O'Leary <[email protected]>

Fix field alignment

Signed-off-by: Ryan O'Leary <[email protected]>

Fix bad merge

Signed-off-by: Ryan O'Leary <[email protected]>

Fix CRDs and add validation test case

Signed-off-by: Ryan O'Leary <[email protected]>

Test create HTTPRoute and create Gateway

Signed-off-by: Ryan O'Leary <[email protected]>

Add reconcile tests for Gateway and HTTPRoute

Signed-off-by: Ryan O'Leary <[email protected]>

Fix lint

Signed-off-by: Ryan O'Leary <[email protected]>

Add tests for util functions and fix golangci-lint

Signed-off-by: Ryan O'Leary <[email protected]>

Add basic e2e test case

Signed-off-by: Ryan O'Leary <[email protected]>

Fix GetGatewayListeners logic and test

Signed-off-by: Ryan O'Leary <[email protected]>

Add gatewayv1 scheme to util runtime

Signed-off-by: Ryan O'Leary <[email protected]>

Check if IncrementalUpgrade is enabled before checking Gateway

Signed-off-by: Ryan O'Leary <[email protected]>

Fix reconcile logic for Gateway and HTTPRoute

Signed-off-by: Ryan O'Leary <[email protected]>

Add feature gate

Signed-off-by: Ryan O'Leary <[email protected]>

Always create Gateway and HTTPRoute for IncrementalUpgrade

Signed-off-by: Ryan O'Leary <[email protected]>

Fix target_capacity reonciliation logic

Signed-off-by: Ryan O'Leary <[email protected]>

Add additional unit tests

Signed-off-by: Ryan O'Leary <[email protected]>

Move e2e test and add another unit test

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
ryanaoleary and others added 5 commits October 6, 2025 22:02
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 7, 2025

Hi, @ryanaoleary

I think it would be safer (this can make sure no requests drop)
if we test the PR (RayService Incremental Upgrade) under two conditions:
1. On GKE — ideally in a larger-scale scenario if possible (because I can’t mock that properly on my laptop).
2. Test many combinations of parameters on my local macOS machine and your Debian machine
(maybe via a Python script with 3 nested loops for variables stepSizePercent, intervalSeconds, and maxSurgePercent)

and continuously send requests to the gateway.

You can also DM me on ray slack before this is merged if you think that’s a faster way to coordinate.
My slack handle is Han-Ju Chen (ray team) (GitHub - Future-Outlier)

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during upgrading process, sometime will have scenario we drop request, we have to figure out wht.

--------------------------------------------------
Statistics after 200 requests: Success: 195, Failure: 5
--------------------------------------------------
Request #201: Success! Status: 200, Response: 10
Request #202: Success! Status: 200, Response: 10
Request #203: Failed! Status: 503, Response: no healthy upstream
Request #204: Success! Status: 200, Response: 10
Request #205: Success! Status: 200, Response: 10
Request #206: Success! Status: 200, Response: 10
Request #207: Success! Status: 200, Response: 10
Request #208: Failed! Status: 503, Response: no healthy upstream
Request #209: Failed! Status: 503, Response: no healthy upstream
Request #210: Success! Status: 200, Response: 10
Request #211: Success! Status: 200, Response: 10
Request #212: Success! Status: 200, Response: 10
Request #213: Success! Status: 200, Response: 10
Request #214: Success! Status: 200, Response: 10
Request #215: Failed! Status: 503, Response: no healthy upstream

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my python script for you to test serving request while upgrading locally.

import subprocess
import time
import sys

def run_kubectl_command(cmd, check=True):
    """Execute kubectl command and return result"""
    try:
        # Execute command with check=True, if command fails it will throw an exception
        result = subprocess.run(
            cmd,
            capture_output=True,
            text=True,
            check=check,
            encoding='utf-8' # Explicitly specify encoding
        )
        return result
    except subprocess.CalledProcessError as e:
        print(f"Error running command: {' '.join(cmd)}")
        print(f"Stderr: {e.stderr}")
        return None
    except FileNotFoundError:
        print("Error: 'kubectl' command not found. Please ensure it is installed and in your PATH.")
        sys.exit(1)

def ensure_curl_pod_is_running(pod_name="curl", namespace="default"):
    """Check if Pod exists and is running, if not, create it"""
    # Check Pod status
    check_cmd = ["kubectl", "get", "pod", pod_name, "-n", namespace, "-o", "jsonpath='{.status.phase}'"]
    try:
        result = subprocess.run(check_cmd, capture_output=True, text=True, check=True, encoding='utf-8')
        # Remove single quotes
        status = result.stdout.strip("'")
        if status == "Running":
            print(f"Pod '{pod_name}' is already running.")
            return True
        else:
            print(f"Pod '{pod_name}' exists but is not running (Status: {status}). Deleting it to recreate.")
            delete_cmd = ["kubectl", "delete", "pod", pod_name, "-n", namespace, "--grace-period=0"]
            run_kubectl_command(delete_cmd, check=False)
            time.sleep(5) # Wait for deletion to complete
    except subprocess.CalledProcessError:
        # get pod failed, means pod doesn't exist
        print(f"Pod '{pod_name}' not found. Creating it...")

    # Create Pod and execute a command that never ends
    run_cmd = [
        "kubectl", "run", pod_name,
        "--image=radial/busyboxplus:curl",
        "--restart=Never",
        "-n", namespace,
        "--", "sh", "-c", "tail -f /dev/null" # Keep Pod running permanently
    ]

    if not run_kubectl_command(run_cmd):
        print(f"Failed to start pod '{pod_name}'.")
        return False

    # Wait for Pod to enter Running state
    print(f"Waiting for pod '{pod_name}' to be ready...")
    wait_cmd = ["kubectl", "wait", "--for=condition=Ready", f"pod/{pod_name}", "-n", namespace, "--timeout=120s"]
    if not run_kubectl_command(wait_cmd):
        print("Pod did not become ready in time.")
        return False

    print(f"Pod '{pod_name}' is ready!")
    return True

def send_curl_request(pod_name, request_count, namespace="default"):
    """Execute curl request in Pod"""
    # Update URL according to your requirements
    url = "http://192.168.8.201/fruit"
    headers = {
        "Content-Type": "application/json",
        "Host": "rayservice-incremental-upgrade.default.svc.cluster.local"
    }
    data = '["MANGO", 2]'

    # Build curl command to execute inside Pod
    curl_cmd = [
        "kubectl", "exec", pod_name, "-n", namespace, "--",
        "curl",
        "-s",                   # Silent mode
        "-w", "\\n%{http_code}", # Print HTTP status code on the last line of output
        "-X", "POST",
        "-H", f"Content-Type: {headers['Content-Type']}",
        "-H", f"Host: {headers['Host']}",
        "-d", data,
        url
    ]
    # print("curl_cmd: ", curl_cmd)

    result = run_kubectl_command(curl_cmd, check=False)
    if not result:
        print(f"Request #{request_count}: Failed to execute curl command in pod.")
        return False

    # Parse output from kubectl exec
    output = result.stdout.strip()
    if '\n' not in output:
        print(f"Request #{request_count}: Unexpected response format: {output}")
        if result.stderr:
            print(f"Stderr: {result.stderr.strip()}")
        return False

    # Split response body and status code
    parts = output.rsplit('\n', 1)
    response_body = parts[0]
    status_code_str = parts[1]

    try:
        status_code = int(status_code_str)
        # Check status code and response content
        if status_code == 200:
            print(f"Request #{request_count}: Success! Status: {status_code}, Response: {response_body}")
            return True
        else:
            print(f"Request #{request_count}: Failed! Status: {status_code}, Response: {response_body}")
            return False
    except ValueError:
        print(f"Request #{request_count}: Invalid status code '{status_code_str}' in response.")
        return False

def main():
    pod_name = "curl"

    # Step 1: Ensure curl pod exists and is running
    if not ensure_curl_pod_is_running(pod_name):
        print("Exiting due to failure in preparing the curl pod.")
        return

    request_count = 0
    success_count = 0
    failure_count = 0

    print("\nStarting to send requests... Press Ctrl+C to stop.")
    try:
        # Step 2: Enter loop, send request every 0.1 seconds
        while True:
            request_count += 1
            if send_curl_request(pod_name, request_count):
                success_count += 1
            else:
                failure_count += 1

            # Every 100 requests, display statistics once
            if request_count % 100 == 0:
                print("-" * 50)
                print(f"Statistics after {request_count} requests: Success: {success_count}, Failure: {failure_count}")
                print("-" * 50)

            time.sleep(0.1)

    except KeyboardInterrupt:
        print("\nScript interrupted by user.")
    except Exception as e:
        print(f"\nAn unexpected error occurred: {e}")
    finally:
        print("\nScript finished.")
        print(f"Final Statistics: Total: {request_count}, Success: {success_count}, Failure: {failure_count}")
        # Note: The script will not automatically delete the Pod after completion, you can manually execute the following command to clean up:
        # kubectl delete pod curl -n default

if __name__ == "__main__":
    main()

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgrade scenario

Image

after 300 request (1 request per 0.1 second), I failed 7, and I think this is not acceptable.

no upgrade scenario

image

I am not sure if this is ray serve's bug or kuberay's bug, need discussion.
cc @ryanaoleary @andrewsykim @kevin85421 @rueian

@Future-Outlier
Copy link
Member

Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release?

Hi, @joshdevins are you interested in testing this PR before the release?
it will help a lot

@ryanaoleary
Copy link
Collaborator Author

Hi, @ryanaoleary

I think it would be safer (this can make sure no requests drop) if we test the PR (RayService Incremental Upgrade) under two conditions: 1. On GKE — ideally in a larger-scale scenario if possible (because I can’t mock that properly on my laptop). 2. Test many combinations of parameters on my local macOS machine and your Debian machine (maybe via a Python script with 3 nested loops)

and continuously send requests to the gateway.

You can also DM me on ray slack before this is merged if you think that’s a faster way to coordinate. My slack handle is Han-Ju Chen (ray team) (GitHub - Future-Outlier)

Sounds good, thank you!! I'll work on testing this more today before the community meeting and I'll message if I have questions / have any updates. I'm currently looking into the two issues you mentioned (#3166 (review) and #3166 (comment)) since I think they may be related.

@Future-Outlier
Copy link
Member

Hi @ryanaoleary,

I'm just checking in on this, as the branch cut date is approaching around 10/20.

I wanted to let you know that I'm happy to help debug, brainstorm solutions, or work on the implementation with you if that would be helpful. Please let me know!

@ryanaoleary
Copy link
Collaborator Author

Hi @ryanaoleary,

I'm just checking in on this, as the branch cut date is approaching around 10/20.

I wanted to let you know that I'm happy to help debug, brainstorm solutions, or work on the implementation with you if that would be helpful. Please let me know!

Sounds good, thank you!! I really appreciate the help. I just pushed f785c6e which should fix the two main issues brought up in the comments.

  1. The old cluster serve config was never being served during the ugprade. Traffic was being correctly split, but the issue was that the new reconcile path involved reconciling the serve config on the active cluster during upgrade (previously only the pending cluster's serve config would be reconciled), and so the active cluster would be updated with the new serve config. The fix is to avoid updating the serve config of the active RayService, except for the target_capacity.
  2. Requests were being dropped during upgrade. This was because we were only checking whether the head pod was ready before increasing traffic to the new cluster. We now check that the actual serve Application and Deployments are healthy before increasing traffic, which should provide a more reliable upgrade.

I'll do some testing later today with the load test python script you provided to ensure no requests are dropped, but it is passing the e2e test consistently for me.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems correct!!
I am testing more!
but I am really excited about this progress

2025-10-10.15-54-37.mp4

Future-Outlier

This comment was marked as resolved.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ryanaoleary, here is an async script.

  1. I think this is better than the synchronous one. I am thinking that you can probably do batch requests to a production cluster with GPU workload and see what will happen when these requests try to hit the limit while upgrading. This is the stress test I have in mind.

  2. And I am wondering if it is possible to test this PR with GPU using Locust? Following Ray's official documentation. Since I also want to know the CPU and resource usage.

https://docs.ray.io/en/latest/serve/autoscaling-guide.html#basic-example

  1. When using the script below, I triggered the incremental upgrade and got the response shown below.
    (I think this is flaky — you have to try multiple times. I believe that on average it fails about once every three tries.)
    I’m not sure whether this problem is related to incremental upgrade or not.
image
import asyncio
import subprocess
import time
import sys
import json
from typing import Tuple, Optional

async def run_kubectl_command_async(cmd, check=True):
    """Execute kubectl command asynchronously and return result"""
    try:
        process = await asyncio.create_subprocess_exec(
            *cmd,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE
        )
        stdout, stderr = await process.communicate()
        
        if check and process.returncode != 0:
            print(f"Error running command: {' '.join(cmd)}")
            print(f"Stderr: {stderr.decode('utf-8')}")
            return None
            
        return subprocess.CompletedProcess(
            args=cmd,
            returncode=process.returncode,
            stdout=stdout.decode('utf-8'),
            stderr=stderr.decode('utf-8')
        )
    except FileNotFoundError:
        print("Error: 'kubectl' command not found. Please ensure it is installed and in your PATH.")
        sys.exit(1)

async def ensure_curl_pod_is_running(pod_name="curl", namespace="default"):
    """Check if Pod exists and is running, if not, create it"""
    # Check Pod status
    check_cmd = ["kubectl", "get", "pod", pod_name, "-n", namespace, "-o", "jsonpath='{.status.phase}'"]
    try:
        result = await run_kubectl_command_async(check_cmd, check=True)
        if result:
            # Remove single quotes
            status = result.stdout.strip("'")
            if status == "Running":
                print(f"Pod '{pod_name}' is already running.")
                return True
            else:
                print(f"Pod '{pod_name}' exists but is not running (Status: {status}). Deleting it to recreate.")
                delete_cmd = ["kubectl", "delete", "pod", pod_name, "-n", namespace, "--grace-period=0"]
                await run_kubectl_command_async(delete_cmd, check=False)
                await asyncio.sleep(5)  # Wait for deletion to complete
    except Exception:
        # get pod failed, means pod doesn't exist
        print(f"Pod '{pod_name}' not found. Creating it...")

    # Create Pod and execute a command that never ends
    run_cmd = [
        "kubectl", "run", pod_name,
        "--image=radial/busyboxplus:curl",
        "--restart=Never",
        "-n", namespace,
        "--", "sh", "-c", "tail -f /dev/null"  # Keep Pod running permanently
    ]

    result = await run_kubectl_command_async(run_cmd)
    if not result:
        print(f"Failed to start pod '{pod_name}'.")
        return False

    # Wait for Pod to enter Running state
    print(f"Waiting for pod '{pod_name}' to be ready...")
    wait_cmd = ["kubectl", "wait", "--for=condition=Ready", f"pod/{pod_name}", "-n", namespace, "--timeout=120s"]
    result = await run_kubectl_command_async(wait_cmd)
    if not result:
        print("Pod did not become ready in time.")
        return False

    print(f"Pod '{pod_name}' is ready!")
    return True

async def send_curl_request(pod_name, request_count, namespace="default") -> Tuple[bool, int, str]:
    """Execute curl request in Pod and return (success, status_code, response_body)"""
    # Update URL according to your requirements
    url = "http://192.168.8.201/fruit"
    headers = {
        "Content-Type": "application/json",
        "Host": "rayservice-incremental-upgrade.default.svc.cluster.local"
    }
    data = '["MANGO", 2]'

    # Build curl command to execute inside Pod
    curl_cmd = [
        "kubectl", "exec", pod_name, "-n", namespace, "--",
        "curl",
        "-s",                   # Silent mode
        "-w", "\\n%{http_code}", # Print HTTP status code on the last line of output
        "-X", "POST",
        "-H", f"Content-Type: {headers['Content-Type']}",
        "-H", f"Host: {headers['Host']}",
        "-H", "Connection: close",
        "-d", data,
        url
    ]

    result = await run_kubectl_command_async(curl_cmd, check=False)
    if not result:
        print(f"Request #{request_count}: Failed to execute curl command in pod.")
        return False, 0, ""

    # Parse output from kubectl exec
    output = result.stdout.strip()
    if '\n' not in output:
        print(f"Request #{request_count}: Unexpected response format: {output}")
        if result.stderr:
            print(f"Stderr: {result.stderr.strip()}")
        return False, 0, output

    # Split response body and status code
    parts = output.rsplit('\n', 1)
    response_body = parts[0]
    status_code_str = parts[1]

    try:
        status_code = int(status_code_str)
        # Check status code and response content
        if status_code == 200:
            # print(f"Request #{request_count}: Success! Status: {status_code}, Response: {response_body}")
            return True, status_code, response_body
        else:
            print(f"Request #{request_count}: Failed! Status: {status_code}, Response: {response_body}")
            return False, status_code, response_body
    except ValueError:
        print(f"Request #{request_count}: Invalid status code '{status_code_str}' in response.")
        return False, 0, response_body

async def send_requests_batch(pod_name, start_count, batch_size=100, namespace="default"):
    """Send a batch of requests concurrently"""
    tasks = []
    for i in range(batch_size):
        request_count = start_count + i
        task = send_curl_request(pod_name, request_count, namespace)
        tasks.append(task)
    
    results = await asyncio.gather(*tasks, return_exceptions=True)
    
    success_count = 0
    failure_count = 0
    failed_requests = []
    
    for i, result in enumerate(results):
        request_count = start_count + i
        if isinstance(result, Exception):
            print(f"Request #{request_count}: Exception occurred: {result}")
            failure_count += 1
            failed_requests.append({
                'request_count': request_count,
                'status_code': 0,
                'response_body': str(result)
            })
        else:
            success, status_code, response_body = result
            if success:
                success_count += 1
            else:
                failure_count += 1
                failed_requests.append({
                    'request_count': request_count,
                    'status_code': status_code,
                    'response_body': response_body
                })
    
    return success_count, failure_count, failed_requests

async def main():
    pod_name = "curl"

    # Step 1: Ensure curl pod exists and is running
    if not await ensure_curl_pod_is_running(pod_name):
        print("Exiting due to failure in preparing the curl pod.")
        return

    total_request_count = 0
    total_success_count = 0
    total_failure_count = 0
    all_failed_requests = []

    print("\nStarting to send requests continuously... Press Ctrl+C to stop.")
    print("Using async batch processing for faster requests...")
    
    try:
        # Step 2: Enter continuous loop, send requests in batches
        batch_size = 10  # Number of concurrent requests per batch
        # delay_between_batches = 0.05  # Delay between batches in seconds
        
        while True:
            success_count, failure_count, failed_requests = await send_requests_batch(
                pod_name, total_request_count + 1, batch_size
            )
            
            total_request_count += batch_size
            total_success_count += success_count
            total_failure_count += failure_count
            all_failed_requests.extend(failed_requests)

            # Every 10 batches (200 requests), display statistics once
            if total_request_count % (batch_size * 10) == 0:
                print("-" * 50)
                print(f"Statistics after {total_request_count} requests: Success: {total_success_count}, Failure: {total_failure_count}")
                print("-" * 50)
                if total_failure_count > 0:
                    print("FAILED REQUESTS DETAILS:")
                    print("=" * 60)
                    for failed_req in all_failed_requests:
                        print(f"Request #{failed_req['request_count']}: Status: {failed_req['status_code']}, Response: {failed_req['response_body']}")
                    print("=" * 60)
                    exit(1)

            # Wait before next batch
            # await asyncio.sleep(delay_between_batches)

    except KeyboardInterrupt:
        print("\nScript interrupted by user.")
    except Exception as e:
        print(f"\nAn unexpected error occurred: {e}")
    finally:
        print("\nScript finished.")
        print(f"Final Statistics: Total: {total_request_count}, Success: {total_success_count}, Failure: {total_failure_count}")
        
        # Print all failed requests with details
        if all_failed_requests:
            print("\n" + "=" * 60)
            print("FAILED REQUESTS DETAILS:")
            print("=" * 60)
            for failed_req in all_failed_requests:
                print(f"Request #{failed_req['request_count']}: Status: {failed_req['status_code']}, Response: {failed_req['response_body']}")
            print("=" * 60)
        
        # Note: The script will not automatically delete the Pod after completion, you can manually execute the following command to clean up:
        # kubectl delete pod curl -n default

if __name__ == "__main__":
    asyncio.run(main())

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here are the todo list of this PR:

  1. if we are already upgrading from old cluster(1st cluster) to new cluster(2nd cluster), if now user kubectl apply a new cluster(3rd cluster), we should reject the request for 3rd cluster , right? do we handle this now?
  2. resolve my comments for better readability
    (or reply if my suggestion doesn't make sense, I will take a look)
    #3166 (comment)
    #3166 (comment)
  3. rename IncrementalUpgradeOptions to ClusterUpgradeOptions
    #3166 (comment)
  4. do more load test, I think we should try locust
  5. after 1,2,3,4 are done, make e2e test work in buildkite.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more time to check if this change is correct or not, will give you update next monday or Tuesday, thank you.

@ryanaoleary
Copy link
Collaborator Author

I think here are the todo list of this PR:

  1. if we are already upgrading from old cluster(1st cluster) to new cluster(2nd cluster), if now user kubectl apply a new cluster(3rd cluster), we should reject the request for 3rd cluster , right? do we handle this now?
  2. resolve my comments for better readability
    (or reply if my suggestion doesn't make sense, I will take a look)
    #3166 (comment)
    #3166 (comment)
  3. rename IncrementalUpgradeOptions to ClusterUpgradeOptions
    #3166 (comment)
  4. do more load test, I think we should try locust
  5. after 1,2,3,4 are done, make e2e test work in buildkite.
  1. The current behavior is that if we are upgrading from cluster 1 to cluster 2, and the user does kubectl apply to a new cluster 3, then the upgrade from 1 -> 2 will continue, and then once it completes we will upgrade again from cluster 2 -> 3. This is the same behavior as the NewCluster upgrade strategy. Should we instead add validation to reject kubectl apply if UpgradeInProgress is occurring? I'm not sure if this matches the behavior we talked about implementing for rollback strategy. I believe the desired behavior was that during the upgrade from 1 -> 2, if the goal cluster changes we roll back to cluster 1 and then upgrade from 1 -> the goal cluster. I put that implementation in a separate PR that I think should be part of graduating this feature from alpha to beta.

  2. Done in 94636e2

I'll work on 3. and 4. and update this PR with the changes/results by Monday night.

  1. For the buildkite, I prefer to do this in a follow-up/separate PR if possible since this PR is quite large. I think this would be better, because then if we need to remove it from the buildkite for some reason we can revert that commit rather than the entire feature. I think this would also make it easier to review. I can work on the buildkite change and have it ready for review by Monday or Tuesday so that we can still try to include it in the v1.5 release.

@ryanaoleary
Copy link
Collaborator Author

ryanaoleary commented Oct 10, 2025

I think this is better than the synchronous one. I am thinking that you can probably do batch requests to a production cluster with GPU workload and see what will happen when these requests try to hit the limit while upgrading. This is the stress test I have in mind.

And I am wondering if it is possible to test this PR with GPU using Locust? Following Ray's official documentation. Since I also want to know the CPU and resource usage.

I should have some GPU quota, I'll try that test and update this PR with results.

For the flaky test, I think that could be unrelated to the upgrade / the head Pod could be hitting memory or CPU limits. I'll try replicating as well to further debug but I suspect it's not unique to incremental upgrades.


// GetGatewayListenersForRayService constructs the default HTTP listener for a RayService Gateway.
func GetGatewayListenersForRayService(rayServiceInstance *rayv1.RayService) []gwv1.Listener {
hostname := fmt.Sprintf("%s.%s.svc.cluster.local", rayServiceInstance.Name, rayServiceInstance.Namespace)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does .cluster.local always work? IIRC, it is customizable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC hostname can be set to whatever we want as long as it's unique and matches what the users expect. It's not based on an actual service name or anything.

Is there somewhere in the RayCluster spec where users can specify a hostname? I can make this customizable based on that if so, I just didn't see a relevant field.

Copy link
Collaborator

@rueian rueian Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, .cluster.local let me think it is tight with a k8s service FQDN.

Shouldn't we just leave the hostname unspecified for accepting all requests?

@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 11, 2025

Hi, @ryanaoleary
here is an effective way to do stress test I just asked @kevin85421

  1. figure out a raycluster's QPS limit
    a. use locust or other tool
    b. keep increasing QPS until you can see response time begin slow
    c. find the QPS number near saturation
  2. use QPS number near saturation from 1 to do incremental upgrade

I will do it too, see you next week and thank you for the hard work.

@Future-Outlier
Copy link
Member

I think here are the todo list of this PR:

  1. if we are already upgrading from old cluster(1st cluster) to new cluster(2nd cluster), if now user kubectl apply a new cluster(3rd cluster), we should reject the request for 3rd cluster , right? do we handle this now?
  1. The current behavior is that if we are upgrading from cluster 1 to cluster 2, and the user does kubectl apply to a new cluster 3, then the upgrade from 1 -> 2 will continue, and then once it completes we will upgrade again from cluster 2 -> 3. This is the same behavior as the NewCluster upgrade strategy. Should we instead add validation to reject kubectl apply if UpgradeInProgress is occurring? I'm not sure if this matches the behavior we talked about implementing for rollback strategy. I believe the desired behavior was that during the upgrade from 1 -> 2, if the goal cluster changes we roll back to cluster 1 and then upgrade from 1 -> the goal cluster. I put that implementation in a separate PR that I think should be part of graduating this feature from alpha to beta.
  1. you are correct!

@Future-Outlier
Copy link
Member

todo:
when upgrading from a to b cluster, if the resource if not enough, we should have timeout something like that to cancel the upgrade.

Comment on lines +1558 to +1559
if incrementalUpgradeUpdate {
if err := r.reconcileServeTargetCapacity(ctx, rayServiceInstance, rayClusterInstance, rayDashboardClient); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ryanaoleary
Since isIncrementalUpgradeInProgress has included incrementalUpgradeUpdate already,
can we delete the statement if incrementalUpgradeUpdate?

Comment on lines +1530 to +1531
isIncrementalUpgradeInProgress := utils.IsIncrementalUpgradeEnabled(&rayServiceInstance.Spec) &&
rayServiceInstance.Status.PendingServiceStatus.RayClusterName != ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what's the difference between this statement and
isIncrementalUpgradeInProgress := utils.IsIncrementalUpgradeEnabled(&rayServiceInstance.Spec) && meta.IsStatusConditionTrue(rayServiceInstance.Status.Conditions, string(rayv1.UpgradeInProgress))?

Can we be consistent? if not, why?

Comment on lines +286 to 287
promotedPendingCluster := false
if headSvc != nil && serveSvc != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the name to isPendingClusterPromoted?

Comment on lines +291 to 330
if utils.IsIncrementalUpgradeEnabled(&rayServiceInstance.Spec) && meta.IsStatusConditionTrue(rayServiceInstance.Status.Conditions, string(rayv1.UpgradeInProgress)) {
// An incremental upgrade is complete when the active cluster has 0% capacity and the pending cluster has
// 100% of the traffic. We can't promote the pending cluster until traffic has been fully migrated.
if pendingCluster != nil &&
ptr.Deref(rayServiceInstance.Status.ActiveServiceStatus.TargetCapacity, -1) == 0 &&
ptr.Deref(rayServiceInstance.Status.PendingServiceStatus.TrafficRoutedPercent, -1) == 100 {

logger.Info("Promoting pending cluster to active: Incremental upgrade complete.",
"oldCluster", rayServiceInstance.Status.ActiveServiceStatus.RayClusterName,
"newCluster", rayServiceInstance.Status.PendingServiceStatus.RayClusterName)

rayServiceInstance.Status.ActiveServiceStatus = rayServiceInstance.Status.PendingServiceStatus
rayServiceInstance.Status.PendingServiceStatus = rayv1.RayServiceStatus{}
promotedPendingCluster = true
}
}
isPendingClusterServing = clusterName == pendingClusterName

// If services point to a different cluster than the active one, promote pending to active
logger.Info("calculateStatus", "clusterSvcPointingTo", clusterName, "pendingClusterName", pendingClusterName, "activeClusterName", activeClusterName)
if activeClusterName != clusterName {
logger.Info("Promoting pending cluster to active",
"oldCluster", rayServiceInstance.Status.ActiveServiceStatus.RayClusterName,
"newCluster", clusterName)
rayServiceInstance.Status.ActiveServiceStatus = rayServiceInstance.Status.PendingServiceStatus
rayServiceInstance.Status.PendingServiceStatus = rayv1.RayServiceStatus{}
if !utils.IsIncrementalUpgradeEnabled(&rayServiceInstance.Spec) || !promotedPendingCluster {
// Promote the pending cluster to the active cluster if both RayService's head and serve services
// have already pointed to the pending cluster.
clusterName := utils.GetRayClusterNameFromService(headSvc)
if clusterName != utils.GetRayClusterNameFromService(serveSvc) {
panic("headSvc and serveSvc are not pointing to the same cluster")
}
// Verify cluster name matches either pending or active cluster
if clusterName != pendingClusterName && clusterName != activeClusterName {
panic("clusterName is not equal to pendingCluster or activeCluster")
}
isPendingClusterServing = clusterName == pendingClusterName

// If services point to a different cluster than the active one, promote pending to active
logger.Info("calculateStatus", "clusterSvcPointingTo", clusterName, "pendingClusterName", pendingClusterName, "activeClusterName", activeClusterName)
if activeClusterName != clusterName {
logger.Info("Promoting pending cluster to active",
"oldCluster", rayServiceInstance.Status.ActiveServiceStatus.RayClusterName,
"newCluster", clusterName)
rayServiceInstance.Status.ActiveServiceStatus = rayServiceInstance.Status.PendingServiceStatus
rayServiceInstance.Status.PendingServiceStatus = rayv1.RayServiceStatus{}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like this??
If we don't do this, it will be hard for developer to understand the code.
you can change these helper functions's name, just want to give a direction.

// Check if incremental upgrade is enabled in the RayService specification
if utils.IsIncrementalUpgradeEnabled(&rayServiceInstance.Spec) {
    logger.Info("Processing incremental upgrade strategy")
    
    // Check if incremental upgrade is currently in progress
    if meta.IsStatusConditionTrue(rayServiceInstance.Status.Conditions, string(rayv1.UpgradeInProgress)) {
        // Incremental upgrade is in progress
        logger.Info("Incremental upgrade is in progress")
        
        // Check if incremental upgrade completion conditions are met
        if isIncrementalUpgradeComplete(rayServiceInstance, pendingCluster) {
            // All conditions met: active cluster has 0% capacity and 
            // pending cluster has 100% traffic
            logger.Info("Incremental upgrade completed, promoting pending cluster")
            
            // Execute the promotion logic
            promotePendingClusterToActive(rayServiceInstance)
            isPendingClusterPromoted = true
            
            // No need to execute traditional logic after successful promotion
            return
        } else {
            logger.Info("Incremental upgrade not complete, waiting for completion",
                "activeTargetCapacity", ptr.Deref(rayServiceInstance.Status.ActiveServiceStatus.TargetCapacity, -1),
                "pendingTrafficRoutedPercent", ptr.Deref(rayServiceInstance.Status.PendingServiceStatus.TrafficRoutedPercent, -1))
            
            // Continue with traditional logic as fallback during upgrade
            executeTraditionalUpgradeLogic(headSvc, serveSvc, rayServiceInstance)
        }
    } else {
        // Incremental upgrade is enabled but not started yet
        logger.Info("Incremental upgrade not started, using traditional logic")
        executeTraditionalUpgradeLogic(headSvc, serveSvc, rayServiceInstance)
    }
} else {
    // Incremental upgrade is not enabled, use traditional Blue/Green strategy
    logger.Info("Processing traditional Blue/Green upgrade strategy")
    executeTraditionalUpgradeLogic(headSvc, serveSvc, rayServiceInstance)
}

func isIncrementalUpgradeComplete(rayServiceInstance *rayv1.RayService, pendingCluster *rayv1.RayCluster) bool {
    return pendingCluster != nil &&
           ptr.Deref(rayServiceInstance.Status.ActiveServiceStatus.TargetCapacity, -1) == 0 &&
           ptr.Deref(rayServiceInstance.Status.PendingServiceStatus.TrafficRoutedPercent, -1) == 100
}

func promotePendingClusterToActive(rayServiceInstance *rayv1.RayService) {
    logger.Info("Promoting pending cluster to active: Incremental upgrade complete.",
        "oldCluster", rayServiceInstance.Status.ActiveServiceStatus.RayClusterName,
        "newCluster", rayServiceInstance.Status.PendingServiceStatus.RayClusterName)
    
    // Perform the actual status swap
    rayServiceInstance.Status.ActiveServiceStatus = rayServiceInstance.Status.PendingServiceStatus
    rayServiceInstance.Status.PendingServiceStatus = rayv1.RayServiceStatus{}
}

func executeTraditionalUpgradeLogic(headSvc, serveSvc *corev1.Service, rayServiceInstance *rayv1.RayService) {
    logger.Info("Executing traditional Blue/Green upgrade logic")
    
    // Validate that both services exist
    if headSvc == nil || serveSvc == nil {
        logger.Info("Services not ready, skipping traditional upgrade logic")
        return
    }
    
    // Step 1: Service Consistency Check
    // Ensure head and serve services point to the same cluster
    if err := validateServiceConsistency(headSvc, serveSvc); err != nil {
        logger.Error(err, "Service consistency validation failed")
        return
    }
    
    // Step 2: Cluster Switching Logic
    // Determine which cluster the services are currently pointing to
    clusterName := utils.GetRayClusterNameFromService(headSvc)
    pendingClusterName := rayServiceInstance.Status.PendingServiceStatus.RayClusterName
    activeClusterName := rayServiceInstance.Status.ActiveServiceStatus.RayClusterName
    
    // Update the serving status
    isPendingClusterServing = clusterName == pendingClusterName
    
    // Step 3: Execute Cluster Promotion (if needed)
    // If services are pointing to a different cluster than the current active one
    if activeClusterName != clusterName {
        logger.Info("Promoting pending cluster to active (Traditional upgrade)",
            "oldCluster", activeClusterName,
            "newCluster", clusterName)
        
        // Perform the cluster status swap
        rayServiceInstance.Status.ActiveServiceStatus = rayServiceInstance.Status.PendingServiceStatus
        rayServiceInstance.Status.PendingServiceStatus = rayv1.RayServiceStatus{}
    }
}

func validateServiceConsistency(headSvc, serveSvc *corev1.Service) error {
    clusterName := utils.GetRayClusterNameFromService(headSvc)
    serveSvcClusterName := utils.GetRayClusterNameFromService(serveSvc)
    
    // Check if both services point to the same cluster
    if clusterName != serveSvcClusterName {
        return fmt.Errorf("service inconsistency detected: head service points to %s but serve service points to %s", 
            clusterName, serveSvcClusterName)
    }
    
    return nil
}

@rueian rueian added the 1.5.0 label Oct 13, 2025
@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 14, 2025

How I do load test.

  1. kubectl port-forward
kubectl port-forward svc/rayservice-incremental-upgrade-gateway-istio 8080:80 -n default
  1. locust code
from locust import HttpUser, task, constant

class RayServiceUser(HttpUser):
    wait_time = constant(0)  # Zero wait time between requests
    
    def on_start(self):
        """Called when a user starts"""
        self.host = "http://localhost:8080"
    
    @task
    def call_fruit_endpoint(self):
        """Test the fruit endpoint with the same payload as your curl example"""
        headers = {
            "Content-Type": "application/json",
            "Host": "rayservice-incremental-upgrade.default.svc.cluster.local",
            "Connection": "close"
        }
        
        payload = '["MANGO", 2]'
        
        with self.client.post(
            "/fruit",
            data=payload,
            headers=headers,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Got status code {response.status_code}")
  1. The RPS limit in my setting is near 27, 28
image
  1. do incremental upgrade and see are there any failures?
image
  1. I do load test with excludeHeadPodFromServeSvc: false and excludeHeadPodFromServeSvc: true
    both work

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ryanaoleary

I tested this PR using locust on my local env.
I think we can use locust to test this feature, but don't have to do this in this PR.
https://github.com/ray-project/kuberay/blob/master/ray-operator/test/e2erayservice/rayservice_ha_test.go

  1. my locust test proof
    #3166 (comment)

  2. rename IncrementalUpgrade to ClusterUpgrade
    #3166 (comment)

  3. and do the refactor I mentioned here
    #3166 (review)

  4. Could you explain this commit? I read it yesterday, and I want to make sure my understanding matches yours.
    f785c6e

after these are done, I think we can get this merged.

cc @rueian @andrewsykim
to align our understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants