bug: coroot-node-agent exits uncleanly on SIGTERM leaving orphaned eBPF programs on nodes with active php-fpm workers

## Description

When restarting the coroot-node-agent DaemonSet, pods on nodes running php-fpm (under heavy traffic) take 10+ minutes to terminate. Nodes without php-fpm workloads terminate normally even though they have some other high traffic web application (go app, java spring etc.)

We are running coroot-node-agent as daemonset on AWS EKS version `v1.35.0-eks-3a10415`

## Initial Findings

The agent receives SIGTERM and begins cleanup, but an internal SIGALRM watchdog fires before cleanup completes, causing the process to exit immediately without detaching its eBPF programs. This was confirmed by observing the agent process in zombie state with `SigPnd: 0000000000004000` (signal 14 = SIGALRM pending).

Our suspection is the cleanup is slow specifically on nodes that are running pods that has php-fpm because the agent attaches OpenSSL uprobes (openssl_SSL_read, openssl_SSL_write and variants) per php-fpm worker process. The kernel cannot release these uprobe attachments until all in-flight events drain from each worker's perf ring buffer. php-fpm workers hold persistent TLS connections which keep the ring buffers alive. 

Disabling L7 tracing via `--disable-l7-tracing` flag causes the agent to terminate immediately, confirming OpenSSL uprobes are the biggest suspect cause of the delay.

### Evidence

1. Agent process in zombie state with SIGALRM pending immediately after SIGTERM:

```
Name:   coroot-node-age
State:  Z (zombie)
Pid:    7082
PPid:   6426  ← containerd-shim
SigPnd: 0000000000004000  ← signal 14 = SIGALRM
```

The agent exited via its internal SIGALRM watchdog(?) before completing eBPF cleanup.

2. All 46 eBPF programs orphaned (pids: null) after agent exit:

```
bpftool prog list --json | jq '.[] | select(.type == "kprobe" or .type == "tracepoint") | {id, name, pids}'
```

All 46 entries returned `"pids": null` — no owning process, fully orphaned.

3. OpenSSL uprobes confirmed as the blocker

```
openssl_SSL_write_enter
openssl_SSL_write_enter_v1_1_1
openssl_SSL_write_enter_v3_0
openssl_SSL_read_enter
openssl_SSL_read_enter_v1_1_1
openssl_SSL_read_enter_v3_0
openssl_SSL_read_ex_enter
openssl_SSL_read_ex_enter_v3_0
openssl_SSL_read_ex_enter_v1_1_1
openssl_SSL_read_exit
```

These are per-process uprobes. We suspect with many php-fpm workers each having their own uprobe attachment, the kernel holds all of them until every worker's perf ring buffer drains.

## Reproduction Steps

- Deploy coroot-node-agent as a DaemonSet on a Kubernetes cluster
- Ensure at least one node is running php-fpm web application with high traffic (not tested but, dummy web service that returns immediately might not work here)
- Trigger a rollout: `kubectl rollout restart daemonset/coroot-node-agent`
- Observe pods on php-fpm nodes stuck in `Terminating` state for a longer duration compared to the rest of the workers



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: coroot-node-agent exits uncleanly on SIGTERM leaving orphaned eBPF programs on nodes with active php-fpm workers #290

Description

Initial Findings

Evidence

Reproduction Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: coroot-node-agent exits uncleanly on SIGTERM leaving orphaned eBPF programs on nodes with active php-fpm workers #290

Description

Description

Initial Findings

Evidence

Reproduction Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions