Skip to content

bug: coroot-node-agent exits uncleanly on SIGTERM leaving orphaned eBPF programs on nodes with active php-fpm workers #290

@serhatcetinkaya

Description

@serhatcetinkaya

Description

When restarting the coroot-node-agent DaemonSet, pods on nodes running php-fpm (under heavy traffic) take 10+ minutes to terminate. Nodes without php-fpm workloads terminate normally even though they have some other high traffic web application (go app, java spring etc.)

We are running coroot-node-agent as daemonset on AWS EKS version v1.35.0-eks-3a10415

Initial Findings

The agent receives SIGTERM and begins cleanup, but an internal SIGALRM watchdog fires before cleanup completes, causing the process to exit immediately without detaching its eBPF programs. This was confirmed by observing the agent process in zombie state with SigPnd: 0000000000004000 (signal 14 = SIGALRM pending).

Our suspection is the cleanup is slow specifically on nodes that are running pods that has php-fpm because the agent attaches OpenSSL uprobes (openssl_SSL_read, openssl_SSL_write and variants) per php-fpm worker process. The kernel cannot release these uprobe attachments until all in-flight events drain from each worker's perf ring buffer. php-fpm workers hold persistent TLS connections which keep the ring buffers alive.

Disabling L7 tracing via --disable-l7-tracing flag causes the agent to terminate immediately, confirming OpenSSL uprobes are the biggest suspect cause of the delay.

Evidence

  1. Agent process in zombie state with SIGALRM pending immediately after SIGTERM:
Name:   coroot-node-age
State:  Z (zombie)
Pid:    7082
PPid:   6426  ← containerd-shim
SigPnd: 0000000000004000  ← signal 14 = SIGALRM

The agent exited via its internal SIGALRM watchdog(?) before completing eBPF cleanup.

  1. All 46 eBPF programs orphaned (pids: null) after agent exit:
bpftool prog list --json | jq '.[] | select(.type == "kprobe" or .type == "tracepoint") | {id, name, pids}'

All 46 entries returned "pids": null — no owning process, fully orphaned.

  1. OpenSSL uprobes confirmed as the blocker
openssl_SSL_write_enter
openssl_SSL_write_enter_v1_1_1
openssl_SSL_write_enter_v3_0
openssl_SSL_read_enter
openssl_SSL_read_enter_v1_1_1
openssl_SSL_read_enter_v3_0
openssl_SSL_read_ex_enter
openssl_SSL_read_ex_enter_v3_0
openssl_SSL_read_ex_enter_v1_1_1
openssl_SSL_read_exit

These are per-process uprobes. We suspect with many php-fpm workers each having their own uprobe attachment, the kernel holds all of them until every worker's perf ring buffer drains.

Reproduction Steps

  • Deploy coroot-node-agent as a DaemonSet on a Kubernetes cluster
  • Ensure at least one node is running php-fpm web application with high traffic (not tested but, dummy web service that returns immediately might not work here)
  • Trigger a rollout: kubectl rollout restart daemonset/coroot-node-agent
  • Observe pods on php-fpm nodes stuck in Terminating state for a longer duration compared to the rest of the workers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions