Skip to content

Tigera Gateway API v3.31.0: xDS connection fails with "no healthy upstream" - Envoy proxy cannot maintain connection to envoy-gateway controller #4257

@charlybeaupe01

Description

@charlybeaupe01

Warning

I don't have a big knowledge with gateway API yet, but I wanted to test it for our new deployment.
That's why this troubleshooting was made with the huge help of Claude code.

Environment

  • Calico Version: v3.31.0
  • Kubernetes Version: v1.34.1
  • Deployment Type: Bare metal (3 nodes: 1 master, 2 workers)
  • CNI Configuration: Calico with IPIP mode (ipipMode: Always)
  • LoadBalancer: MetalLB v0.14.8 (Layer 2 mode)
  • Operating System: Ubuntu noble 24.04.3 LTS
  • Installation Method: Tigera Operator

Description

The Envoy proxy pod in Tigera Gateway API consistently fails to maintain an xDS gRPC connection with the envoy-gateway controller. This results in the Envoy proxy never receiving complete listener and cluster configurations, this make the GatewayAPI unusable.

The issue is 100% reproducible and occurs regardless of:

  • Pod placement (same node or different nodes)
  • Network topology (with or without IPIP encapsulation)
  • Gateway listener mode (TLS Passthrough or TLS Terminate)
  • Protocol (HTTPRoute or TLSRoute)

Reproduction Steps

1. Install Calico v3.31.0 via Tigera Operator

We've used helm chart.

helm install calico projectcalico/tigera-operator --version v3.31.0 --create-namespace --namespace tigera-operator 

2. Enable Gateway API

apiVersion: operator.tigera.io/v1
kind: GatewayAPI
metadata:
  name: default
spec:
  crdManagement: PreferExisting
  gatewayClasses:
  - name: tigera-gateway-class

3. Create namespace and certificate

kubectl create namespace gateway-infra
kubectl create secret tls gateway-tls-secret -n gateway-infra \
  --cert=<cert-file> --key=<key-file>

4. Deploy a Gateway

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: shared-gateway
  namespace: gateway-infra
spec:
  gatewayClassName: tigera-gateway-class
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - name: gateway-tls-secret
    allowedRoutes:
      namespaces:
        from: All

5. Create an HTTPRoute

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: test-route
  namespace: default
spec:
  parentRefs:
  - name: shared-gateway
    namespace: gateway-infra
  rules:
  - backendRefs:
    - name: kubernetes
      port: 443
      kind: Service

6. Check Gateway status and envoy proxy logs

kubectl get gateway shared-gateway -n gateway-infra -o yaml
kubectl logs -n tigera-gateway -l gateway.envoyproxy.io/owning-gateway-name=shared-gateway -c envoy

Observed Behavior

Gateway Status

status:
  conditions:
  - status: "True"
    type: Programmed
    message: "Address assigned to the Gateway, 1/1 envoy replicas available"
  listeners:
  - conditions:
    - status: "True"
      type: Programmed
      message: "Sending translated listener configuration to the data plane"

Envoy Proxy Logs

[2025-11-10 19:31:30.053][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:23] Deprecated field: type envoy.config.route.v3.HeaderMatcher Using deprecated option 'envoy.config.route.v3.HeaderMatcher.exact_match'
 from file route_components.proto...
[2025-11-10 19:31:45.065][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
[2025-11-10 19:32:00.067][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
[2025-11-10 19:32:04.340][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 34s ago: 14, no healthy upstream
[2025-11-10 19:32:06.999][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 36s ago: 14, no healthy upstream

This pattern repeats indefinitely.
Envoy-gateway Controller Logs with debug logging enabled :

2025-11-10T18:42:49.249Z  INFO  xds-server  runner/runner.go:89   loaded TLS certificate and key {"runner": "xds-server"}
2025-11-10T18:42:49.249Z  INFO  xds-server  runner/runner.go:104  started {"runner": "xds-server"}
2025-11-10T18:42:49.579Z  INFO  xds-server  runner/runner.go:151  received an update {"runner": "xds-server"}

Critical observation: No client connection attempts are ever logged by the envoy-gateway controller, despite the envoy proxy attempting connections.

Expected Behavior

  1. Envoy proxy establishes xDS gRPC connection to envoy-gateway.tigera-gateway.svc.cluster.local:18000
  2. Envoy proxy receives complete Cluster and Listener configurations
  3. Envoy proxy maintains stable xDS connection
  4. Gateway becomes functional and routes traffic

We performed extensive troubleshooting to isolate the issue:

✅ Infrastructure Validation (All Working)

1. DNS Resolution

kubectl run dns-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
  nslookup envoy-gateway.tigera-gateway.svc.cluster.

Result: Resolves correctly to service ClusterIP

2. TCP Connectivity

kubectl run tcp-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
  nc -zv envoy-gateway.tigera-gateway.svc.cluster.local 18000

Result: Connection succeeds

3. TLS/mTLS Handshake

kubectl run tls-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
  openssl s_client -connect envoy-gateway.tigera-gateway.svc.cluster.local:18000

Result: Successfully completes TLS handshake, server requests client certificate

4. Network Policies

kubectl get networkpolicies -A

Result: No policies blocking tigera-gateway namespace traffic

  1. Certificate Validity
  • Server cert (envoy-gateway): Valid, SANs include DNS:envoy-gateway, DNS:envoy-gateway.tigera-gateway.svc.cluster.local
  • Client cert (envoy): Valid, signed by same CA
  • CA certificates match between both secrets

✅ Network Topology Testing

Test 1: Cross-node communication (default)

  • envoy-gateway controller on worker02
  • envoy proxy on worker01
  • Result: xDS connection fails

Test 2: Same-node communication

  • Forced both pods to worker02 using nodeSelector
  • Result: xDS connection still fails ← This rules out IPIP/cross-node issues

Test 3: Cross-node with direct pod IP

  • Tested direct connection from worker01 to worker02 pod IP
  • Result: TCP and TLS work, but xDS stream still fails

🔍 Critical Pattern Observed

The Envoy proxy logs show a consistent pattern:

  1. Initial configuration received (HeaderMatcher deprecation warnings prove this)
  2. Timeouts at 15 seconds for Cluster and Listener fetch
  3. Immediate disconnection with "no healthy upstream"
  4. Never reconnects successfully

This pattern suggests:

  • Initial mTLS connection succeeds
  • Partial xDS configuration is sent
  • Connection closes before complete configuration transfer
  • Reconnection attempts all fail immediately

❌ envoy-gateway Controller Never Sees Connections

Despite the envoy proxy attempting connections, the envoy-gateway controller logs no incoming connection attempts. With debug logging enabled, we see:

  • xDS server starts successfully
  • Configuration updates are generated
  • No client stream connections logged
  • No TLS handshake logs
  • No authentication/authorization logs

This suggests the connection is failing at a lower level (possibly during or immediately after TLS handshake).

Configuration Details

Envoy Bootstrap xDS Cluster

clusters:
  - connect_timeout: 10s
    load_assignment:
      cluster_name: xds_cluster
      endpoints:
      - load_balancing_weight: 1
        lb_endpoints:
        - load_balancing_weight: 1
          endpoint:
            address:
              socket_address:
                address: envoy-gateway.tigera-gateway.svc.cluster.local
                port_value: 18000
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"
        explicit_http_config:
          http2_protocol_options:
            connection_keepalive:
              interval: 30s
              timeout: 5s
    name: xds_cluster
    type: STRICT_DNS
    transport_socket:
      name: envoy.transport_sockets.tls
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
        common_tls_context:
          tls_params:
            tls_maximum_protocol_version: TLSv1_3
          tls_certificate_sds_secret_configs:
          - name: xds_certificate
            sds_config:
              path_config_source:
                path: /sds/xds-certificate.json
              resource_api_version: V3
          validation_context_sds_secret_config:
            name: xds_trusted_ca
            sds_config:
              path_config_source:
                path: /sds/xds-trusted-ca.json
              resource_api_version: V3

SDS Validation Config

{
  "resources": [{
    "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
    "name": "xds_trusted_ca",
    "validation_context": {
      "trusted_ca": {
        "filename": "/certs/ca.crt"
      },
      "match_typed_subject_alt_names": [{
        "san_type": "DNS",
        "matcher": {
          "exact": "envoy-gateway"
        }
      }]
    }
  }]
}

Workarounds Attempted

  • ❌ Restarting pods
  • ❌ Recreating Gateway resources
  • ❌ Using TLS Passthrough instead of Terminate
  • ❌ Using TLSRoute instead of HTTPRoute
  • ❌ Co-locating pods on same node
  • ❌ Upgrading from v3.30.4 to v3.31.0
  • ✅ Only working solution: Using alternative ingress controller (Traefik/NGINX)

Impact

Severity: Critical

Tigera Gateway API is completely non-functional in Calico v3.31.0, preventing:

  • Adoption of Kubernetes Gateway API standard
  • Migration from legacy Ingress resources
  • Using Calico-integrated gateway features
  • Production deployments requiring Gateway API

Users must use alternative ingress controllers (nginx-ingress, Traefik) as workaround.

Additional Context

This issue appears similar to envoyproxy/gateway#2813 which involved hostNetwork: true and DNS resolution issues. However, our issue occurs with:

  • Standard pod networking (no hostNetwork)
  • Confirmed working DNS resolution
  • Confirmed working TCP/TLS connectivity
  • Same behavior on same node (no network traversal)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions