-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Description
When restarting the coroot-node-agent DaemonSet, pods on nodes running php-fpm (under heavy traffic) take 10+ minutes to terminate. Nodes without php-fpm workloads terminate normally even though they have some other high traffic web application (go app, java spring etc.)
We are running coroot-node-agent as daemonset on AWS EKS version v1.35.0-eks-3a10415
Initial Findings
The agent receives SIGTERM and begins cleanup, but an internal SIGALRM watchdog fires before cleanup completes, causing the process to exit immediately without detaching its eBPF programs. This was confirmed by observing the agent process in zombie state with SigPnd: 0000000000004000 (signal 14 = SIGALRM pending).
Our suspection is the cleanup is slow specifically on nodes that are running pods that has php-fpm because the agent attaches OpenSSL uprobes (openssl_SSL_read, openssl_SSL_write and variants) per php-fpm worker process. The kernel cannot release these uprobe attachments until all in-flight events drain from each worker's perf ring buffer. php-fpm workers hold persistent TLS connections which keep the ring buffers alive.
Disabling L7 tracing via --disable-l7-tracing flag causes the agent to terminate immediately, confirming OpenSSL uprobes are the biggest suspect cause of the delay.
Evidence
- Agent process in zombie state with SIGALRM pending immediately after SIGTERM:
Name: coroot-node-age
State: Z (zombie)
Pid: 7082
PPid: 6426 ← containerd-shim
SigPnd: 0000000000004000 ← signal 14 = SIGALRM
The agent exited via its internal SIGALRM watchdog(?) before completing eBPF cleanup.
- All 46 eBPF programs orphaned (pids: null) after agent exit:
bpftool prog list --json | jq '.[] | select(.type == "kprobe" or .type == "tracepoint") | {id, name, pids}'
All 46 entries returned "pids": null — no owning process, fully orphaned.
- OpenSSL uprobes confirmed as the blocker
openssl_SSL_write_enter
openssl_SSL_write_enter_v1_1_1
openssl_SSL_write_enter_v3_0
openssl_SSL_read_enter
openssl_SSL_read_enter_v1_1_1
openssl_SSL_read_enter_v3_0
openssl_SSL_read_ex_enter
openssl_SSL_read_ex_enter_v3_0
openssl_SSL_read_ex_enter_v1_1_1
openssl_SSL_read_exit
These are per-process uprobes. We suspect with many php-fpm workers each having their own uprobe attachment, the kernel holds all of them until every worker's perf ring buffer drains.
Reproduction Steps
- Deploy coroot-node-agent as a DaemonSet on a Kubernetes cluster
- Ensure at least one node is running php-fpm web application with high traffic (not tested but, dummy web service that returns immediately might not work here)
- Trigger a rollout:
kubectl rollout restart daemonset/coroot-node-agent - Observe pods on php-fpm nodes stuck in
Terminatingstate for a longer duration compared to the rest of the workers