Description
Is there an existing issue for this?
- I have searched the existing issues
Kong version ($ kong version
)
3.8.0
Current Behavior
hello,
We run kong KIC on GKE clusters: every night the preemptible nodes are reclaimed in our staging envs. And most of the time, it takes down all kong gateway pods (2 replicas) for hours.
versions
- GKE control plane & node pools:
1.30.4-gke.1348000
- kong ingress chart
0.14.1
, - controller:
3.3.1
- gateway:
3.8.0
Additional info
- db-less mode
- using the Gateway API and Gateway resources only (e.g. HTTPRoutes)
- no istio sidecars (they have been removed to try to narrow down the issue)
It seems that the liveness probe is responding ok, while the readiness probe remains unhealthy, leading to the gateway pods to just remain around, not able to process traffic.
Error logs
ERROR 2024/10/05 00:06:53 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
ERROR nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
[repeats over and over, yet the pod is not killed]
The controller fails to talk to the gateways with
ERROR 2024-10-07T03:37:26.870415241Z [resource.labels.containerName: ingress-controller] Error: could not retrieve Kong admin root(s): making HTTP request: Get "https://10.163.37.7:8444/": dial tcp 10.163.37.7:8444: connect: connection refused
Kong finds itself in some sort of "deadlock" until the pods are deleted manually. Any insights ?
Below is the values.yaml
file configuring kong
ingress:
deployment:
test:
enabled: false
controller:
enabled: true
proxy:
nameOverride: "{{ .Release.Name }}-gateway-proxy"
postgresql:
enabled: false
env:
database: "off"
deployment:
kong:
enabled: false
ingressController:
enabled: true
image:
repository: kong/kubernetes-ingress-controller
tag: "3.3.1"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
memory: 1G
ingressClass: kong-green
env:
log_format: json
log_level: error
ingress_class: kong-green
gateway_api_controller_name: konghq.com/kong-green
gatewayDiscovery:
enabled: true
generateAdminApiService: true
podAnnotations:
sidecar.istio.io/inject: "false"
gateway:
enabled: true
deployment:
kong:
enabled: true
image:
repository: kong
tag: "3.8.0"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 250m
memory: 500Mi
limits:
memory: 2G
replicaCount: 6
podAnnotations:
sidecar.istio.io/inject: "false"
proxy:
enabled: true
type: ClusterIP
annotations:
konghq.com/protocol: "https"
cloud.google.com/neg: '{"exposed_ports": {"80":{"name": "neg-kong-green"}}}'
http:
enabled: true
servicePort: 80
containerPort: 8000
parameters: []
tls:
enabled: true
servicePort: 443
containerPort: 8443
parameters:
- http2
appProtocol: ""
ingressController:
enabled: false
postgresql:
enabled: false
env:
role: traditional
database: "off"
proxy_access_log: "off"
# proxy_error_log: "off"
proxy_stream_access_log: "off"
# proxy_stream_error_log: "off"
admin_access_log: "off"
# admin_error_log: "off"
status_access_log: "off"
# status_error_log: "off"
log_level: warn
headers: "off"
request_debug: "off"
Expected Behavior
kong gateway pods, either
- don't fail with the error above
bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
- or at least are able to recover from it by failing the liveness probe (or else)
Steps To Reproduce
I could reproduce the error by killing the nodes (kubectl delete nodes
) on which the kong pods were running. After killing the nodes, KIC fails to restart as it enters the deadlock situation described above. See screenshot:
Anything else?
dump of a failing gateway pod: kubectl describe
:
k -n kong-dbless describe po kong-green-gateway-68f467ff98-qztm5
Name: kong-green-gateway-68f467ff98-qztm5
Namespace: kong-dbless
Priority: 0
Service Account: kong-green-gateway
Node: ---
Start Time: Mon, 07 Oct 2024 13:49:02 +0200
Labels: app=kong-green-gateway
app.kubernetes.io/component=app
app.kubernetes.io/instance=kong-green
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=gateway
app.kubernetes.io/version=3.6
helm.sh/chart=gateway-2.41.1
pod-template-hash=68f467ff98
version=3.6
Annotations: cni.projectcalico.org/containerID: 13864002653403e75b1ddb3ef661b5665f69e3b97c266b5833042f8dc4a4f39b
cni.projectcalico.org/podIP: 10.163.33.135/32
cni.projectcalico.org/podIPs: 10.163.33.135/32
kuma.io/gateway: enabled
kuma.io/service-account-token-volume: kong-green-gateway-token
sidecar.istio.io/inject: false
traffic.sidecar.istio.io/includeInboundPorts:
Status: Running
IP: 10.163.33.135
IPs:
IP: 10.163.33.135
Controlled By: ReplicaSet/kong-green-gateway-68f467ff98
Init Containers:
clear-stale-pid:
Container ID: containerd://ed0b35719cd87e11e849b42f20f1f328b1e2d63612d004b313ba981eda0bd790
Image: kong:3.8.0
Image ID: docker.io/library/kong@sha256:616b2ab5a4c7b6c14022e8a1495ff34930ced76f25f3d418e76758717fec335f
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
rm
-vrf
$KONG_PREFIX/pids
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 07 Oct 2024 13:49:20 +0200
Finished: Mon, 07 Oct 2024 13:49:21 +0200
Ready: True
Restart Count: 0
Limits:
memory: 2G
Requests:
cpu: 250m
memory: 500Mi
Environment:
KONG_ADMIN_ACCESS_LOG: /dev/stdout
KONG_ADMIN_ERROR_LOG: /dev/stderr
KONG_ADMIN_GUI_ACCESS_LOG: /dev/stdout
KONG_ADMIN_GUI_ERROR_LOG: /dev/stderr
KONG_ADMIN_LISTEN: 0.0.0.0:8444 http2 ssl, [::]:8444 http2 ssl
KONG_CLUSTER_LISTEN: off
KONG_DATABASE: off
KONG_LUA_PACKAGE_PATH: /opt/?.lua;/opt/?/init.lua;;
KONG_NGINX_WORKER_PROCESSES: 2
KONG_PLUGINS: ---
KONG_PORTAL_API_ACCESS_LOG: /dev/stdout
KONG_PORTAL_API_ERROR_LOG: /dev/stderr
KONG_PORT_MAPS: 80:8000, 443:8443
KONG_PREFIX: /kong_prefix/
KONG_PROXY_ACCESS_LOG: /dev/stdout
KONG_PROXY_ERROR_LOG: /dev/stderr
KONG_PROXY_LISTEN: 0.0.0.0:8000, [::]:8000, 0.0.0.0:8443 http2 ssl, [::]:8443 http2 ssl
KONG_PROXY_STREAM_ACCESS_LOG: /dev/stdout basic
KONG_PROXY_STREAM_ERROR_LOG: /dev/stderr
KONG_ROLE: traditional
KONG_ROUTER_FLAVOR: traditional
KONG_STATUS_ACCESS_LOG: off
KONG_STATUS_ERROR_LOG: /dev/stderr
KONG_STATUS_LISTEN: 0.0.0.0:8100, [::]:8100
KONG_STREAM_LISTEN: off
Mounts:
/kong_prefix/ from kong-green-gateway-prefix-dir (rw)
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/tmp from kong-green-gateway-tmp (rw)
Containers:
proxy:
Container ID: containerd://0ed944478d25423c08c85146ed1528ae668d128f13bddaf6402990701e2ea3a1
Image: kong:3.8.0
Image ID: docker.io/library/kong@sha256:616b2ab5a4c7b6c14022e8a1495ff34930ced76f25f3d418e76758717fec335f
Ports: 8444/TCP, 8000/TCP, 8443/TCP, 8100/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 07 Oct 2024 13:59:39 +0200
Finished: Mon, 07 Oct 2024 13:59:49 +0200
Ready: False
Restart Count: 7
Limits:
memory: 2G
Requests:
cpu: 250m
memory: 500Mi
Liveness: http-get http://:status/status delay=5s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:status/status/ready delay=5s timeout=5s period=10s #success=1 #failure=3
Environment:
KONG_ADMIN_ACCESS_LOG: /dev/stdout
KONG_ADMIN_ERROR_LOG: /dev/stderr
KONG_ADMIN_GUI_ACCESS_LOG: /dev/stdout
KONG_ADMIN_GUI_ERROR_LOG: /dev/stderr
KONG_ADMIN_LISTEN: 0.0.0.0:8444 http2 ssl, [::]:8444 http2 ssl
KONG_CLUSTER_LISTEN: off
KONG_DATABASE: off
KONG_LUA_PACKAGE_PATH: /opt/?.lua;/opt/?/init.lua;;
KONG_NGINX_WORKER_PROCESSES: 2
KONG_PLUGINS: ---
KONG_PORTAL_API_ACCESS_LOG: /dev/stdout
KONG_PORTAL_API_ERROR_LOG: /dev/stderr
KONG_PORT_MAPS: 80:8000, 443:8443
KONG_PREFIX: /kong_prefix/
KONG_PROXY_ACCESS_LOG: /dev/stdout
KONG_PROXY_ERROR_LOG: /dev/stderr
KONG_PROXY_LISTEN: 0.0.0.0:8000, [::]:8000, 0.0.0.0:8443 http2 ssl, [::]:8443 http2 ssl
KONG_PROXY_STREAM_ACCESS_LOG: /dev/stdout basic
KONG_PROXY_STREAM_ERROR_LOG: /dev/stderr
KONG_ROLE: traditional
KONG_ROUTER_FLAVOR: traditional
KONG_STATUS_ACCESS_LOG: off
KONG_STATUS_ERROR_LOG: /dev/stderr
KONG_STATUS_LISTEN: 0.0.0.0:8100, [::]:8100
KONG_STREAM_LISTEN: off
KONG_NGINX_DAEMON: off
Mounts:
/kong_prefix/ from kong-green-gateway-prefix-dir (rw)
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/tmp from kong-green-gateway-tmp (rw)
Readiness Gates:
Type Status
cloud.google.com/load-balancer-neg-ready True
Conditions:
Type Status
cloud.google.com/load-balancer-neg-ready True
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kong-green-gateway-prefix-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 256Mi
kong-green-gateway-tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 1Gi
kong-green-gateway-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 11m kubelet Container proxy failed liveness probe, will be restarted
Warning FailedPreStopHook 10m kubelet PreStopHook failed
Normal Pulled 10m (x2 over 11m) kubelet Container image "kong:3.8.0" already present on machine
Normal Created 10m (x2 over 11m) kubelet Created container proxy
Normal Started 10m (x2 over 11m) kubelet Started container proxy
Warning Unhealthy 10m (x4 over 11m) kubelet Liveness probe failed: Get "http://10.163.33.135:8100/status": dial tcp 10.163.33.135:8100: connect: connection refused
Warning Unhealthy 10m (x9 over 11m) kubelet Readiness probe failed: Get "http://10.163.33.135:8100/status/ready": dial tcp 10.163.33.135:8100: connect: connection refused
Warning BackOff 114s (x26 over 7m19s) kubelet Back-off restarting failed container proxy in pod kong-green-gateway-68f467ff98-qztm5_kong-dbless(ab152a94-7ef0-4de0-b84c-1eb419327b88)
and logs
2024/10/07 12:05:10 [warn] 1#0: the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /kong_prefix/nginx.conf:7
nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /kong_prefix/nginx.conf:7
2024/10/07 12:05:14 [notice] 1#0: [lua] init.lua:791: init(): [request-debug] token for request debugging: ccbb05a0-6e76-4cb7-9e5d-346690a3c69f
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: still could not bind()
nginx: [emerg] still could not bind()