-
Notifications
You must be signed in to change notification settings - Fork 149
Description
Warning
I don't have a big knowledge with gateway API yet, but I wanted to test it for our new deployment.
That's why this troubleshooting was made with the huge help of Claude code.
Environment
- Calico Version: v3.31.0
- Kubernetes Version: v1.34.1
- Deployment Type: Bare metal (3 nodes: 1 master, 2 workers)
- CNI Configuration: Calico with IPIP mode (
ipipMode: Always) - LoadBalancer: MetalLB v0.14.8 (Layer 2 mode)
- Operating System: Ubuntu noble 24.04.3 LTS
- Installation Method: Tigera Operator
Description
The Envoy proxy pod in Tigera Gateway API consistently fails to maintain an xDS gRPC connection with the envoy-gateway controller. This results in the Envoy proxy never receiving complete listener and cluster configurations, this make the GatewayAPI unusable.
The issue is 100% reproducible and occurs regardless of:
- Pod placement (same node or different nodes)
- Network topology (with or without IPIP encapsulation)
- Gateway listener mode (TLS Passthrough or TLS Terminate)
- Protocol (HTTPRoute or TLSRoute)
Reproduction Steps
1. Install Calico v3.31.0 via Tigera Operator
We've used helm chart.
helm install calico projectcalico/tigera-operator --version v3.31.0 --create-namespace --namespace tigera-operator 2. Enable Gateway API
apiVersion: operator.tigera.io/v1
kind: GatewayAPI
metadata:
name: default
spec:
crdManagement: PreferExisting
gatewayClasses:
- name: tigera-gateway-class3. Create namespace and certificate
kubectl create namespace gateway-infra
kubectl create secret tls gateway-tls-secret -n gateway-infra \
--cert=<cert-file> --key=<key-file>4. Deploy a Gateway
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: shared-gateway
namespace: gateway-infra
spec:
gatewayClassName: tigera-gateway-class
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: gateway-tls-secret
allowedRoutes:
namespaces:
from: All5. Create an HTTPRoute
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: test-route
namespace: default
spec:
parentRefs:
- name: shared-gateway
namespace: gateway-infra
rules:
- backendRefs:
- name: kubernetes
port: 443
kind: Service6. Check Gateway status and envoy proxy logs
kubectl get gateway shared-gateway -n gateway-infra -o yaml
kubectl logs -n tigera-gateway -l gateway.envoyproxy.io/owning-gateway-name=shared-gateway -c envoyObserved Behavior
Gateway Status
status:
conditions:
- status: "True"
type: Programmed
message: "Address assigned to the Gateway, 1/1 envoy replicas available"
listeners:
- conditions:
- status: "True"
type: Programmed
message: "Sending translated listener configuration to the data plane"Envoy Proxy Logs
[2025-11-10 19:31:30.053][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:23] Deprecated field: type envoy.config.route.v3.HeaderMatcher Using deprecated option 'envoy.config.route.v3.HeaderMatcher.exact_match'
from file route_components.proto...
[2025-11-10 19:31:45.065][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
[2025-11-10 19:32:00.067][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
[2025-11-10 19:32:04.340][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 34s ago: 14, no healthy upstream
[2025-11-10 19:32:06.999][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 36s ago: 14, no healthy upstreamThis pattern repeats indefinitely.
Envoy-gateway Controller Logs with debug logging enabled :
2025-11-10T18:42:49.249Z INFO xds-server runner/runner.go:89 loaded TLS certificate and key {"runner": "xds-server"}
2025-11-10T18:42:49.249Z INFO xds-server runner/runner.go:104 started {"runner": "xds-server"}
2025-11-10T18:42:49.579Z INFO xds-server runner/runner.go:151 received an update {"runner": "xds-server"}Critical observation: No client connection attempts are ever logged by the envoy-gateway controller, despite the envoy proxy attempting connections.
Expected Behavior
- Envoy proxy establishes xDS gRPC connection to envoy-gateway.tigera-gateway.svc.cluster.local:18000
- Envoy proxy receives complete Cluster and Listener configurations
- Envoy proxy maintains stable xDS connection
- Gateway becomes functional and routes traffic
We performed extensive troubleshooting to isolate the issue:
✅ Infrastructure Validation (All Working)
1. DNS Resolution
kubectl run dns-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
nslookup envoy-gateway.tigera-gateway.svc.cluster.Result: Resolves correctly to service ClusterIP
2. TCP Connectivity
kubectl run tcp-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
nc -zv envoy-gateway.tigera-gateway.svc.cluster.local 18000Result: Connection succeeds
3. TLS/mTLS Handshake
kubectl run tls-test --image=nicolaka/netshoot --rm -i --restart=Never -n tigera-gateway -- \
openssl s_client -connect envoy-gateway.tigera-gateway.svc.cluster.local:18000Result: Successfully completes TLS handshake, server requests client certificate
4. Network Policies
kubectl get networkpolicies -AResult: No policies blocking tigera-gateway namespace traffic
- Certificate Validity
- Server cert (envoy-gateway): Valid, SANs include DNS:envoy-gateway, DNS:envoy-gateway.tigera-gateway.svc.cluster.local
- Client cert (envoy): Valid, signed by same CA
- CA certificates match between both secrets
✅ Network Topology Testing
Test 1: Cross-node communication (default)
- envoy-gateway controller on worker02
- envoy proxy on worker01
- Result: xDS connection fails
Test 2: Same-node communication
- Forced both pods to worker02 using nodeSelector
- Result: xDS connection still fails ← This rules out IPIP/cross-node issues
Test 3: Cross-node with direct pod IP
- Tested direct connection from worker01 to worker02 pod IP
- Result: TCP and TLS work, but xDS stream still fails
🔍 Critical Pattern Observed
The Envoy proxy logs show a consistent pattern:
- Initial configuration received (HeaderMatcher deprecation warnings prove this)
- Timeouts at 15 seconds for Cluster and Listener fetch
- Immediate disconnection with "no healthy upstream"
- Never reconnects successfully
This pattern suggests:
- Initial mTLS connection succeeds
- Partial xDS configuration is sent
- Connection closes before complete configuration transfer
- Reconnection attempts all fail immediately
❌ envoy-gateway Controller Never Sees Connections
Despite the envoy proxy attempting connections, the envoy-gateway controller logs no incoming connection attempts. With debug logging enabled, we see:
- xDS server starts successfully
- Configuration updates are generated
- No client stream connections logged
- No TLS handshake logs
- No authentication/authorization logs
This suggests the connection is failing at a lower level (possibly during or immediately after TLS handshake).
Configuration Details
Envoy Bootstrap xDS Cluster
clusters:
- connect_timeout: 10s
load_assignment:
cluster_name: xds_cluster
endpoints:
- load_balancing_weight: 1
lb_endpoints:
- load_balancing_weight: 1
endpoint:
address:
socket_address:
address: envoy-gateway.tigera-gateway.svc.cluster.local
port_value: 18000
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"
explicit_http_config:
http2_protocol_options:
connection_keepalive:
interval: 30s
timeout: 5s
name: xds_cluster
type: STRICT_DNS
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
common_tls_context:
tls_params:
tls_maximum_protocol_version: TLSv1_3
tls_certificate_sds_secret_configs:
- name: xds_certificate
sds_config:
path_config_source:
path: /sds/xds-certificate.json
resource_api_version: V3
validation_context_sds_secret_config:
name: xds_trusted_ca
sds_config:
path_config_source:
path: /sds/xds-trusted-ca.json
resource_api_version: V3SDS Validation Config
{
"resources": [{
"@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
"name": "xds_trusted_ca",
"validation_context": {
"trusted_ca": {
"filename": "/certs/ca.crt"
},
"match_typed_subject_alt_names": [{
"san_type": "DNS",
"matcher": {
"exact": "envoy-gateway"
}
}]
}
}]
}Workarounds Attempted
- ❌ Restarting pods
- ❌ Recreating Gateway resources
- ❌ Using TLS Passthrough instead of Terminate
- ❌ Using TLSRoute instead of HTTPRoute
- ❌ Co-locating pods on same node
- ❌ Upgrading from v3.30.4 to v3.31.0
- ✅ Only working solution: Using alternative ingress controller (Traefik/NGINX)
Impact
Severity: Critical
Tigera Gateway API is completely non-functional in Calico v3.31.0, preventing:
- Adoption of Kubernetes Gateway API standard
- Migration from legacy Ingress resources
- Using Calico-integrated gateway features
- Production deployments requiring Gateway API
Users must use alternative ingress controllers (nginx-ingress, Traefik) as workaround.
Additional Context
This issue appears similar to envoyproxy/gateway#2813 which involved hostNetwork: true and DNS resolution issues. However, our issue occurs with:
- Standard pod networking (no hostNetwork)
- Confirmed working DNS resolution
- Confirmed working TCP/TLS connectivity
- Same behavior on same node (no network traversal)