Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to access service/ingress to a offloaded pod: 504 Gateway Time-out #2909

Closed
remmen-io opened this issue Jan 22, 2025 · 4 comments · Fixed by #2924
Closed

Unable to access service/ingress to a offloaded pod: 504 Gateway Time-out #2909

remmen-io opened this issue Jan 22, 2025 · 4 comments · Fixed by #2924
Labels
fix Fixes a bug in the codebase.

Comments

@remmen-io
Copy link

What happened:

I've deployed a pod,svc and ingress on a cluster with network fabric and offloading enabled.
svc and ingress are excluded from the resource-reflection.

The pod is successfully started and I can see the logs, I can access the service or the pod direclty with kubectl port-forward

Accessing the ingress I get a 504

➜ curl https://vllm.e1-mfmm-lab-b.mm.ch/v1/models
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>


⎈ e1-k8s-mfmm-lab-b-admin (liqo-demo) ~ on ☁️  (local) took 15s
➜ k port-forward svc/service 8000:8000 &
[1] 97549

⎈ e1-k8s-mfmm-lab-b-admin (liqo-demo) ~ on ☁️  (local)
✦ ➜ Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
curl 127.0.0.1:8000/v1/models
Handling connection for 8000
{"object":"list","data":[{"id":"microsoft/Phi-3.5-mini-instruct","object":"model","created":1737552716,"owned_by":"vllm","root":"/models-cache/Phi-3.5-mini-instruct","parent":null,"max_model_len":20000,"permission":[{"id":"modelperm-1d8b39fd8eeb49c8ab9503778c4b5c47","object":"model_permission","created":1737552716,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

What you expected to happen:

Access the service of the ingress

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

We are using cilium as CNI with native routing. There is no networkpolicy preventing any traffic

We have noticed, that the pod gets an IP in the range of 10.71.72.225 which is not in the 10.71.0.0/18 range.
Therefore we have seen that traffic from a pod to this ip gets routed over the default gateway, which we think is wrong

On the node where the debug pod was running (with curl on the the service/pod ip)

$ ip route show table 0
10.68.0.0/16 via 10.80.0.7 dev liqo.9v2lmk2jtf table 1878365176
10.71.0.0/18 via 10.80.0.7 dev liqo.9v2lmk2jtf table 1878365176
10.80.0.7 dev liqo.9v2lmk2jtf table 1878365176 scope link
default via 172.16.183.254 dev ens192 onlink

$ sudo tcpdump -ni any host 10.71.72.225
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:59:05.578297 lxcb8de2f30aae0 In  IP 10.127.65.220.40990 > 10.71.72.225.8000: Flags [S], seq 1002186232, win 62160, options [mss 8880,sackOK,TS val 632166751 ecr 0,nop,wscale 7], length 0
13:59:05.578337 ens192 Out IP 172.16.182.21.40990 > 10.71.72.225.8000: Flags [S], seq 1002186232, win 62160, options [mss 8880,sackOK,TS val 632166751 ecr 0,nop,wscale 7], length 0
13:59:07.370282 lxcb8de2f30aae0 In  IP 10.127.65.220.55646 > 10.71.72.225.8000: Flags [S], seq 447635172, win 62160, options [mss 8880,sackOK,TS val 632168543 ecr 0,nop,wscale 7], length 0
13:59:07.370313 ens192 Out IP 172.16.182.21.55646 > 10.71.72.225.8000: Flags [S], seq 447635172, win 62160, options [mss 8880,sackOK,TS val 632168543 ecr 0,nop,wscale 7], length 0

But even if adding manually a route, I still got no response. So we might be wrong

Additional Informations:

Provider: e1-k8s-mfmm-lab-t
Consumer: e1-k8s-mfmm-lab-b

Liqo Status
➜ liqoctl info -n kube-liqo peer
┌─ Peer cluster info ──────────────────────────────────────────────────────────────┐
|  Cluster ID: e1-k8s-mfmm-lab-t                                                  |
|  Role:       Provider                                                            |
└──────────────────────────────────────────────────────────────────────────────────┘
┌─ Network ────────────────────────────────────────────────────────────────────────┐
|  Status: Healthy                                                                 |
|  CIDR                                                                            |
|      Remote                                                                      |
|          Pod CIDR:      10.127.64.0/18 → Remapped to 10.71.0.0/18                |
|          External CIDR: 10.70.0.0/16 → Remapped to 10.68.0.0/16                  |
|  Gateway                                                                         |
|      Role:    Server                                                             |
|      Address: 172.16.182.238                                                     |
|      Port:    31311                                                              |
└──────────────────────────────────────────────────────────────────────────────────┘
┌─ Authentication ─────────────────────────────────────────────────────────────────┐
|  Status:     Healthy                                                             |
|  API server: https://e1-k8s-mfmm-lab-t-internal.mm.ch                         |
|  Resource slices                                                                 |
|      gpupool                                                                     |
|          Action: Consuming                                                       |
|          Resource slice accepted                                                 |
|          Resources                                                               |
|              cpu:               4                                                |
|              ephemeral-storage: 20Gi                                             |
|              memory:            10Gi                                             |
|              nvidia.com/gpu:    2                                                |
|              pods:              110                                              |
└──────────────────────────────────────────────────────────────────────────────────┘
┌─ Offloading ─────────────────────────────────────────────────────────────────────┐
|  Status: Healthy                                                                 |
|  Virtual nodes                                                                   |
|      gpupool                                                                     |
|          Status:         Healthy                                                 |
|          Secret:         kubeconfig-resourceslice-gpupool                        |
|          Resource slice: gpupool                                                 |
|          Resources                                                               |
|              cpu:               4                                                |
|              ephemeral-storage: 20Gi                                             |
|              memory:            10Gi                                             |
|              nvidia.com/gpu:    2                                                |
|              pods:              110                                              |
└──────────────────────────────────────────────────────────────────────────────────┘




⎈ e1-k8s-mfmm-lab-t-admin (talos-vllm) kube-liqo on  liqo-network on ☁️  (local)
➜ liqoctl info -n kube-liqo peer
┌─ Peer cluster info ──────────────────────────────────────────────────────────────┐
|  Cluster ID: e1-k8s-mfmm-lab-b                                                  |
|  Role:       Consumer                                                            |
└──────────────────────────────────────────────────────────────────────────────────┘
┌─ Network ────────────────────────────────────────────────────────────────────────┐
|  Status: Healthy                                                                 |
|  CIDR                                                                            |
|      Remote                                                                      |
|          Pod CIDR:      10.127.64.0/18 → Remapped to 10.71.0.0/18                |
|          External CIDR: 10.70.0.0/16 → Remapped to 10.68.0.0/16                  |
|  Gateway                                                                         |
|      Role:    Client                                                             |
|      Address: 172.16.182.238                                                     |
|      Port:    31311                                                              |
└──────────────────────────────────────────────────────────────────────────────────┘
┌─ Authentication ─────────────────────────────────────────────────────────────────┐
|  Status: Healthy                                                                 |
|  Resource slices                                                                 |
|      gpupool                                                                     |
|          Action: Providing                                                       |
|          Resource slice accepted                                                 |
|          Resources                                                               |
|              cpu:               4                                                |
|              ephemeral-storage: 20Gi                                             |
|              memory:            10Gi                                             |
|              nvidia.com/gpu:    2                                                |
|              pods:              110                                              |
└──────────────────────────────────────────────────────────────────────────────────┘
┌─ Offloading ─────────────────────────────────────────────────────────────────────┐
|  Status: Disabled                                                                |
└──────────────────────────────────────────────────────────────────────────────────┘
Deployment
⎈ e1-k8s-mfmm-lab-b-admin (liqo-demo) kube-liqo on  liqo-network on ☁️  (local)
➜ k describe deployments.apps vllm
Name:               vllm
Namespace:          liqo-demo
CreationTimestamp:  Wed, 22 Jan 2025 10:57:34 +0100
Labels:             <none>
Annotations:        deployment.kubernetes.io/revision: 1
Selector:           app=vllm
Replicas:           1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:       Recreate
MinReadySeconds:    0
Pod Template:
  Labels:       app=vllm
  Annotations:  prometheus.io/path: /metrics
                prometheus.io/port: 8000
                prometheus.io/scheme: http
                prometheus.io/scrape: true
  Init Containers:
   s3toolbox:
    Image:      linux-docker-local.repo.mm.ch/herrem/s3toolbox:0.0.1
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      aws s3 sync s3://appl-tgi-e1/microsoft/Phi-3.5-mini-instruct/ /models-cache/Phi-3.5-mini-instruct --endpoint-url https://s3-tpfhyst.mm.ch
    Environment:
      AWS_ACCESS_KEY_ID:      <set to the key 'access-key' in secret 's3-credentials'>  Optional: false
      AWS_SECRET_ACCESS_KEY:  <set to the key 'secret-key' in secret 's3-credentials'>  Optional: false
    Mounts:
      /models-cache from models-cache (rw)
  Containers:
   vllm:
    Image:      vllm/vllm-openai:v0.6.5
    Port:       8000/TCP
    Host Port:  0/TCP
    Args:
      --model
      /models-cache/Phi-3.5-mini-instruct
      --served-model-name
      microsoft/Phi-3.5-mini-instruct
      --gpu-memory-utilization
      0.95
      --max-model-len
      20000
      --enforce-eager
      --disable-log-requests
    Limits:
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      nvidia.com/gpu:  1
    Liveness:          http-get http://:8000/health delay=0s timeout=8s period=120s #success=1 #failure=3
    Readiness:         http-get http://:8000/health delay=0s timeout=5s period=120s #success=1 #failure=3
    Startup:           http-get http://:8000/health delay=0s timeout=1s period=120s #success=1 #failure=24
    Environment:       <none>
    Mounts:
      /dev/shm from shm (rw)
      /models-cache from models-cache (rw)
  Volumes:
   models-cache:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
   shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   vllm-85f5dfdb49 (1/1 replicas created)
Events:          <none>

⎈ e1-k8s-mfmm-lab-b-admin (liqo-demo) kube-liqo on  liqo-network on ☁️  (local)
➜ k describe pod vllm-85f5dfdb49-584f5
Name:             vllm-85f5dfdb49-584f5
Namespace:        liqo-demo
Priority:         0
Service Account:  default
Node:             gpupool/10.127.65.176
Start Time:       Wed, 22 Jan 2025 13:22:31 +0100
Labels:           app=vllm
                  liqo.io/shadowPod=true
                  pod-template-hash=85f5dfdb49
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 8000
                  prometheus.io/scheme: http
                  prometheus.io/scrape: true
Status:           Running
IP:               10.71.72.225
IPs:
  IP:           10.71.72.225
Controlled By:  ReplicaSet/vllm-85f5dfdb49
Init Containers:
  s3toolbox:
    Container ID:  containerd://402423a117a8fea10669ec7a13710cc07affc431bb8a6fdb50a0cdf6bdee7e20
    Image:         linux-docker-local.repo.mm.ch/herrem/s3toolbox:0.0.1
    Image ID:      linux-docker-local.repo.mm.ch/herrem/s3toolbox@sha256:0c0cc08325c39f68bf1a04399e9f3f225472187b936df65098502a73b5776484
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      aws s3 sync s3://appl-tgi-e1/microsoft/Phi-3.5-mini-instruct/ /models-cache/Phi-3.5-mini-instruct --endpoint-url https://s3-tpfhyst.mm.ch
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 22 Jan 2025 13:22:35 +0100
      Finished:     Wed, 22 Jan 2025 13:24:14 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      AWS_ACCESS_KEY_ID:      <set to the key 'access-key' in secret 's3-credentials'>  Optional: false
      AWS_SECRET_ACCESS_KEY:  <set to the key 'secret-key' in secret 's3-credentials'>  Optional: false
    Mounts:
      /models-cache from models-cache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l9w4q (ro)
Containers:
  vllm:
    Container ID:  containerd://597f320cb67244783638d1f994894c479aa3798460e44cfdecbde4294ef984cf
    Image:         vllm/vllm-openai:v0.6.5
    Image ID:      docker.io/vllm/vllm-openai@sha256:42f117dffe16e978f9567084e5cda18f85fdcfbc18568536a1208a69419c77cf
    Port:          8000/TCP
    Host Port:     0/TCP
    Args:
      --model
      /models-cache/Phi-3.5-mini-instruct
      --served-model-name
      microsoft/Phi-3.5-mini-instruct
      --gpu-memory-utilization
      0.95
      --max-model-len
      20000
      --enforce-eager
      --disable-log-requests
    State:          Running
      Started:      Wed, 22 Jan 2025 13:24:21 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Liveness:          http-get http://:8000/health delay=0s timeout=8s period=120s #success=1 #failure=3
    Readiness:         http-get http://:8000/health delay=0s timeout=5s period=120s #success=1 #failure=3
    Startup:           http-get http://:8000/health delay=0s timeout=1s period=120s #success=1 #failure=24
    Environment:       <none>
    Mounts:
      /dev/shm from shm (rw)
      /models-cache from models-cache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l9w4q (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  models-cache:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  kube-api-access-l9w4q:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
                             virtual-node.liqo.io/not-allowed:NoExecute op=Exists
Events:
  Type     Reason                        Age                   From                            Message
  ----     ------                        ----                  ----                            -------
  Normal   Scheduled                     7m7s                  default-scheduler               Successfully assigned liqo-demo/vllm-85f5dfdb49-584f5 to gpupool
  Warning  FailedScheduling              7m49s                 default-scheduler               0/7 nodes are available: 1 Insufficient memory, 3 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/7 nodes are available: 3 Preemption is not helpful for scheduling, 4 No preemption victims found for incoming pod.
  Normal   Scheduled                     7m7s                   (remote)                       Successfully assigned liqo-demo-e1-k8s-mfmm-lab-b/vllm-85f5dfdb49-584f5
to e1-k8shpc-alzf001
  Normal   SuccessfulSATokensReflection  7m7s (x2 over 7m7s)   liqo-serviceaccount-reflection  Successfully reflected object to cluster "e1-k8s-mfmm-lab-t"
  Normal   Created                       7m3s                  kubelet (remote)                Created container s3toolbox
  Normal   Started                       7m3s                  kubelet (remote)                Started container s3toolbox
  Normal   Pulled                        7m3s                  kubelet (remote)                Container image "linux-docker-local.repo.mm.ch/herrem/s3toolbox:0.0.1"
already present on machine
  Normal   Pulled                        5m19s                 kubelet (remote)                Container image "vllm/vllm-openai:v0.6.5" already present on machine
  Normal   Created                       5m17s                 kubelet (remote)                Created container vllm
  Normal   Started                       5m17s                 kubelet (remote)                Started container vllm
  Warning  Unhealthy                     5m7s                  kubelet (remote)                Startup probe failed: Get "http://10.127.72.225:8000/health": dial tcp 10.127.72.225:8000: connect: connection refused
  Normal   SuccessfulReflection          3m7s (x8 over 7m7s)   liqo-pod-reflection             Successfully reflected object status back from cluster "e1-k8s-mfmm-lab-t"
  Normal   SuccessfulReflection          3m7s (x17 over 7m7s)  liqo-pod-reflection             Successfully reflected object to cluster "e1-k8s-mfmm-lab-t"

⎈ e1-k8s-mfmm-lab-b-admin (liqo-demo) kube-liqo on  liqo-network on ☁️  (local)
➜ k describe service service
Name:              service
Namespace:         liqo-demo
Labels:            app=vllm-service
Annotations:       liqo.io/skip-reflection: true
Selector:          app=vllm
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.127.21.32
IPs:               10.127.21.32
Port:              vllm-port  8000/TCP
TargetPort:        8000/TCP
Endpoints:         10.71.72.225:8000
Session Affinity:  None
Events:            <none>

⎈ e1-k8s-mfmm-lab-b-admin (liqo-demo) kube-liqo on  liqo-network on ☁️  (local)
➜ k describe ingress
Name:             phi3-ingress
Labels:           <none>
Namespace:        liqo-demo
Address:          172.16.178.18
Ingress Class:    nginx
Default backend:  <default>
TLS:
  SNI routes vllm.e1-mfmm-lab-b.mm.ch
Rules:
  Host                         Path  Backends
  ----                         ----  --------
  vllm.e1-mfmm-lab-b.mm.ch
                               /   service:8000 (10.71.72.225:8000)
Annotations:                   liqo.io/skip-reflection: true
Events:                        <none>

Environment:

  • Liqo version: v1.0.0-rc.3
  • Liqoctl version: v1.0.0-rc.2
  • Kubernetes version (use kubectl version): v1.30.4
  • Cloud provider or hardware configuration:
  • Node image:
  • Network plugin and version: cilium:v1.15.8-cee.1 replacing kube-proxy
  • Install tools: Helm
  • Others:
@remmen-io remmen-io added the fix Fixes a bug in the codebase. label Jan 22, 2025
@remmen-io
Copy link
Author

it seems the lable fix is automatically created

@cheina97
Copy link
Member

cheina97 commented Feb 6, 2025

Hi @remmen-io, can you give us more insight about how you configured your ingress?

@remmen-io
Copy link
Author

Hi @cheina97

Herer is the full deployment

apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials
stringData:
  access-key: XXX
  secret-key: XXX
type: Opaque
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    liqo.io/skip-reflection: "true"
  labels:
    app: vllm-service
  name: service
spec:
  ports:
  - name: vllm-port
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: vllm
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
spec:
  selector:
    matchLabels:
      app: vllm
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scheme: http
        prometheus.io/scrape: "true"
      labels:
        app: vllm
    spec:
      containers:
      - args:
        - --model
        - /models-cache/Phi-3.5-mini-instruct
        - --served-model-name
        - microsoft/Phi-3.5-mini-instruct
        - --gpu-memory-utilization
        - "0.95"
        - --max-model-len
        - "20000"
        - --enforce-eager
        - --disable-log-requests
        image: vllm/vllm-openai:v0.6.5
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          periodSeconds: 120
          successThreshold: 1
          timeoutSeconds: 8
        name: vllm
        ports:
        - containerPort: 8000
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          periodSeconds: 120
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            memory: 8Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            nvidia.com/gpu: "1"
        startupProbe:
          failureThreshold: 24
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          periodSeconds: 120
          successThreshold: 1
          timeoutSeconds: 1
        volumeMounts:
        - mountPath: /models-cache
          name: models-cache
        - mountPath: /dev/shm
          name: shm
      initContainers:
      - command:
        - sh
        - -c
        - aws s3 sync s3://appl-tgi-e1/microsoft/Phi-3.5-mini-instruct/ /models-cache/Phi-3.5-mini-instruct
          --endpoint-url https://s3-tpfhyst.mm.ch
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              key: access-key
              name: s3-credentials
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: secret-key
              name: s3-credentials
        image: linux-docker-local.repo.mm.ch/herrem/s3toolbox:0.0.1
        name: s3toolbox
        volumeMounts:
        - mountPath: /models-cache
          name: models-cache
      volumes:
      - name: models-cache
        emptyDir: {}
      - emptyDir:
          medium: Memory
          sizeLimit: 1Gi
        name: shm
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: phi3-ingress
  annotations:
    liqo.io/skip-reflection: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: vllm.e1-mfmm-lab-b.mm.ch
    http:
      paths:
      - backend:
          service:
            name: service
            port:
              number: 8000
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - vllm.e1-mfmm-lab-b.mm.ch

@cheina97 cheina97 linked a pull request Feb 7, 2025 that will close this issue
@cheina97
Copy link
Member

cheina97 commented Feb 7, 2025

Hi @remmen-io, I think we fixed your issue in this PR #2924. We found a bug in IPs remapping algorithm. Thanks for helping us to spot it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Fixes a bug in the codebase.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants