Tigera Gateway API v3.31.0: xDS connection fails with "no healthy upstream" - Envoy proxy cannot maintain connection to envoy-gateway controller

> [!WARNING]
> I don't have a big knowledge with gateway API yet, but I wanted to test it for our new deployment.
> That's why this troubleshooting was made with the huge help of Claude code.

## Environment

  - **Calico Version:** v3.31.0
  - **Kubernetes Version:** v1.34.1
  - **Deployment Type:** Bare metal (3 nodes: 1 master, 2 workers)
  - **CNI Configuration:** Calico with IPIP mode (`ipipMode: Always`)
  - **LoadBalancer:** MetalLB v0.14.8 (Layer 2 mode)
  - **Operating System:** Ubuntu noble 24.04.3 LTS
  - **Installation Method:** Tigera Operator

  ## Description

  The Envoy proxy pod in Tigera Gateway API consistently fails to maintain an xDS gRPC connection with the envoy-gateway controller. This results in the Envoy proxy never receiving complete listener and cluster configurations, this make the GatewayAPI unusable.

  The issue is **100% reproducible** and occurs regardless of:
  - Pod placement (same node or different nodes)
  - Network topology (with or without IPIP encapsulation)
  - Gateway listener mode (TLS Passthrough or TLS Terminate)
  - Protocol (HTTPRoute or TLSRoute)

  ## Reproduction Steps

  ### 1. Install Calico v3.31.0 via Tigera Operator
We've used helm chart.
 ```shell
helm install calico projectcalico/tigera-operator --version v3.31.0 --create-namespace --namespace tigera-operator 
```
  ### 2. Enable Gateway API

  ```yaml
  apiVersion: operator.tigera.io/v1
  kind: GatewayAPI
  metadata:
    name: default
  spec:
    crdManagement: PreferExisting
    gatewayClasses:
    - name: tigera-gateway-class
  ```

  ### 3. Create namespace and certificate
  ```shell
  kubectl create namespace gateway-infra
  kubectl create secret tls gateway-tls-secret -n gateway-infra \
    --cert=<cert-file> --key=<key-file>
  ```

  ### 4. Deploy a Gateway
  ```yaml
  apiVersion: gateway.networking.k8s.io/v1
  kind: Gateway
  metadata:
    name: shared-gateway
    namespace: gateway-infra
  spec:
    gatewayClassName: tigera-gateway-class
    listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
        - name: gateway-tls-secret
      allowedRoutes:
        namespaces:
          from: All
  ```
  ### 5. Create an HTTPRoute
  ```yaml
  apiVersion: gateway.networking.k8s.io/v1
  kind: HTTPRoute
  metadata:
    name: test-route
    namespace: default
  spec:
    parentRefs:
    - name: shared-gateway
      namespace: gateway-infra
    rules:
    - backendRefs:
      - name: kubernetes
        port: 443
        kind: Service
  ```
  ### 6. Check Gateway status and envoy proxy logs
  ```shell
  kubectl get gateway shared-gateway -n gateway-infra -o yaml
  kubectl logs -n tigera-gateway -l gateway.envoyproxy.io/owning-gateway-name=shared-gateway -c envoy
  ```
  Observed Behavior

  Gateway Status
  ```shell
  status:
    conditions:
    - status: "True"
      type: Programmed
      message: "Address assigned to the Gateway, 1/1 envoy replicas available"
    listeners:
    - conditions:
      - status: "True"
        type: Programmed
        message: "Sending translated listener configuration to the data plane"
  ```
  Envoy Proxy Logs
  ```shell
  [2025-11-10 19:31:30.053][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:23] Deprecated field: type envoy.config.route.v3.HeaderMatcher Using deprecated option 'envoy.config.route.v3.HeaderMatcher.exact_match'
   from file route_components.proto...
  [2025-11-10 19:31:45.065][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
  [2025-11-10 19:32:00.067][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
  [2025-11-10 19:32:04.340][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 34s ago: 14, no healthy upstream
  [2025-11-10 19:32:06.999][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 36s ago: 14, no healthy upstream
  ```
  This pattern repeats indefinitely. 
  Envoy-gateway Controller Logs with debug logging enabled :
  ```shell
  2025-11-10T18:42:49.249Z  INFO  xds-server  runner/runner.go:89   loaded TLS certificate and key {"runner": "xds-server"}
  2025-11-10T18:42:49.249Z  INFO  xds-server  runner/runner.go:104  started {"runner": "xds-server"}
  2025-11-10T18:42:49.579Z  INFO  xds-server  runner/runner.go:151  received an update {"runner": "xds-server"}
  ```
  Critical observation: No client connection attempts are ever logged by the envoy-gateway controller, despite the envoy proxy attempting connections.

  Expected Behavior

  1. Envoy proxy establishes xDS gRPC connection to envoy-gateway.tigera-gateway.svc.cluster.local:18000
  2. Envoy proxy receives complete Cluster and Listener configurations
  3. Envoy proxy maintains stable xDS connection
  4. Gateway becomes functional and routes traffic
---
  # We performed extensive troubleshooting to isolate the issue:

  ## ✅ Infrastructure Validation (All Working)

  ### 1. DNS Resolution
  ```shell
  kubectl run dns-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
    nslookup envoy-gateway.tigera-gateway.svc.cluster.
  ```
  Result: Resolves correctly to service ClusterIP

  ### 2. TCP Connectivity
  ```shell
  kubectl run tcp-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
    nc -zv envoy-gateway.tigera-gateway.svc.cluster.local 18000
  ```
  Result: Connection succeeds

  ### 3. TLS/mTLS Handshake
  ```shell
  kubectl run tls-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
    openssl s_client -connect envoy-gateway.tigera-gateway.svc.cluster.local:18000
  ```
  Result: Successfully completes TLS handshake, server requests client certificate
  
  ### 4. Network Policies
  ```shell
  kubectl get networkpolicies -A
  ```
  Result: No policies blocking tigera-gateway namespace traffic

  5. Certificate Validity

  - Server cert (envoy-gateway): Valid, SANs include DNS:envoy-gateway, DNS:envoy-gateway.tigera-gateway.svc.cluster.local
  - Client cert (envoy): Valid, signed by same CA
  - CA certificates match between both secrets

 ## ✅ Network Topology Testing

  ### Test 1: Cross-node communication (default)

  - envoy-gateway controller on worker02
  - envoy proxy on worker01
  - Result: xDS connection fails

  ### Test 2: Same-node communication

  - Forced both pods to worker02 using nodeSelector
  - Result: xDS connection still fails ← This rules out IPIP/cross-node issues

  ### Test 3: Cross-node with direct pod IP

  - Tested direct connection from worker01 to worker02 pod IP
  - Result: TCP and TLS work, but xDS stream still fails

  ## 🔍 Critical Pattern Observed

  The Envoy proxy logs show a consistent pattern:

  1. Initial configuration received (HeaderMatcher deprecation warnings prove this)
  2. Timeouts at 15 seconds for Cluster and Listener fetch
  3. Immediate disconnection with "no healthy upstream"
  4. Never reconnects successfully

  This pattern suggests:
  - Initial mTLS connection succeeds
  - Partial xDS configuration is sent
  - Connection closes before complete configuration transfer
  - Reconnection attempts all fail immediately

 ## ❌ envoy-gateway Controller Never Sees Connections

  Despite the envoy proxy attempting connections, the envoy-gateway controller logs no incoming connection attempts. With debug logging enabled, we see:
  - xDS server starts successfully
  - Configuration updates are generated
  - No client stream connections logged
  - No TLS handshake logs
  - No authentication/authorization logs

  This suggests the connection is failing at a lower level (possibly during or immediately after TLS handshake).

  Configuration Details

  Envoy Bootstrap xDS Cluster
  ```yaml
  clusters:
    - connect_timeout: 10s
      load_assignment:
        cluster_name: xds_cluster
        endpoints:
        - load_balancing_weight: 1
          lb_endpoints:
          - load_balancing_weight: 1
            endpoint:
              address:
                socket_address:
                  address: envoy-gateway.tigera-gateway.svc.cluster.local
                  port_value: 18000
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"
          explicit_http_config:
            http2_protocol_options:
              connection_keepalive:
                interval: 30s
                timeout: 5s
      name: xds_cluster
      type: STRICT_DNS
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          common_tls_context:
            tls_params:
              tls_maximum_protocol_version: TLSv1_3
            tls_certificate_sds_secret_configs:
            - name: xds_certificate
              sds_config:
                path_config_source:
                  path: /sds/xds-certificate.json
                resource_api_version: V3
            validation_context_sds_secret_config:
              name: xds_trusted_ca
              sds_config:
                path_config_source:
                  path: /sds/xds-trusted-ca.json
                resource_api_version: V3
  ```
  SDS Validation Config
  ```yaml
  {
    "resources": [{
      "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
      "name": "xds_trusted_ca",
      "validation_context": {
        "trusted_ca": {
          "filename": "/certs/ca.crt"
        },
        "match_typed_subject_alt_names": [{
          "san_type": "DNS",
          "matcher": {
            "exact": "envoy-gateway"
          }
        }]
      }
    }]
  }
  ```
  Workarounds Attempted

  - ❌ Restarting pods
  - ❌ Recreating Gateway resources
  - ❌ Using TLS Passthrough instead of Terminate
  - ❌ Using TLSRoute instead of HTTPRoute
  - ❌ Co-locating pods on same node
  - ❌ Upgrading from v3.30.4 to v3.31.0
  - ✅ Only working solution: Using alternative ingress controller (Traefik/NGINX)

  Impact

  Severity: Critical

  Tigera Gateway API is completely non-functional in Calico v3.31.0, preventing:
  - Adoption of Kubernetes Gateway API standard
  - Migration from legacy Ingress resources
  - Using Calico-integrated gateway features
  - Production deployments requiring Gateway API

  Users must use alternative ingress controllers (nginx-ingress, Traefik) as workaround.

  Additional Context

  This issue appears similar to https://github.com/envoyproxy/gateway/issues/2813 which involved hostNetwork: true and DNS resolution issues. However, our issue occurs with:
  - Standard pod networking (no hostNetwork)
  - Confirmed working DNS resolution
  - Confirmed working TCP/TLS connectivity
  - Same behavior on same node (no network traversal)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tigera Gateway API v3.31.0: xDS connection fails with "no healthy upstream" - Envoy proxy cannot maintain connection to envoy-gateway controller #4257

Environment

Description

Reproduction Steps

1. Install Calico v3.31.0 via Tigera Operator

2. Enable Gateway API

3. Create namespace and certificate

4. Deploy a Gateway

5. Create an HTTPRoute

6. Check Gateway status and envoy proxy logs

We performed extensive troubleshooting to isolate the issue:

✅ Infrastructure Validation (All Working)

1. DNS Resolution

2. TCP Connectivity

3. TLS/mTLS Handshake

4. Network Policies

✅ Network Topology Testing

Test 1: Cross-node communication (default)

Test 2: Same-node communication

Test 3: Cross-node with direct pod IP

🔍 Critical Pattern Observed

❌ envoy-gateway Controller Never Sees Connections

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tigera Gateway API v3.31.0: xDS connection fails with "no healthy upstream" - Envoy proxy cannot maintain connection to envoy-gateway controller #4257

Description

Environment

Description

Reproduction Steps

1. Install Calico v3.31.0 via Tigera Operator

2. Enable Gateway API

3. Create namespace and certificate

4. Deploy a Gateway

5. Create an HTTPRoute

6. Check Gateway status and envoy proxy logs

We performed extensive troubleshooting to isolate the issue:

✅ Infrastructure Validation (All Working)

1. DNS Resolution

2. TCP Connectivity

3. TLS/mTLS Handshake

4. Network Policies

✅ Network Topology Testing

Test 1: Cross-node communication (default)

Test 2: Same-node communication

Test 3: Cross-node with direct pod IP

🔍 Critical Pattern Observed

❌ envoy-gateway Controller Never Sees Connections

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions