Skip to content

[BUG] Datadog Agent Network Monitoring prevents AWS VPC CNI interface cleanup in EKS - causes pod IPv4 connectivity failures #41350

@alexandrucojocaru-creatopy

Description

Problem Description

The Datadog Agent with network monitoring enabled prevents proper cleanup of network interfaces (veth pairs) when pods are deleted in AWS EKS clusters, leading to massive interface accumulation and intermittent IPv4 connectivity failures for pods.

Environment Details

  • Platform: Amazon EKS
  • Kubernetes Version: 1.33
  • Node OS: Bottlerocket OS 1.42.0 (aws-k8s-1.33)
  • Cluster IP address family: IPv6
  • Instance Type: c7i.4xlarge
  • AWS VPC CNI Version: v1.19.6-eksbuild.1
  • Datadog Agent Version: 7.70.2
  • Datadog Agent Configuration: Network monitoring enabled

Technical Symptoms Discovered

Expected Behavior:

  • Node with 35 active pods should have ~35-70 veth interfaces (one pair per pod)
  • When pods are deleted, AWS VPC CNI should clean up corresponding veth interfaces

Actual Behavior:

  • Node with 35 active pods accumulated 2000+ veth interfaces
  • Stale veth interfaces persist after pod deletion

Connectivity Impact:

  • Intermittent IPv4 connectivity failures for new pods
  • Pods cannot reach external services (DNS resolution works, but TCP connections fail)
  • IPv6 connectivity remains unaffected
# Healthy node (without issue):
$ ip addr show | grep ": veth" | wc -l
2  # Expected: matches number of active pods

# Affected node (with Datadog network monitoring):
$ ip addr show | grep ": veth" | wc -l  
2000+  # Problem: massive interface accumulation
  1. Without Datadog network monitoring: CNI cleanup works properly
  2. With Datadog network monitoring enabled: veth interfaces accumulate indefinitely
  3. Hypothesis: Network monitoring hooks prevent AWS VPC CNI from properly cleaning up network interfaces during pod deletion

Steps to Reproduce

  1. Deploy AWS EKS cluster with Bottlerocket nodes
  2. Install Datadog Agent with network monitoring enabled
  3. Deploy and delete pods repeatedly over several days
  4. Monitor interface count: ip addr show | grep ": veth" | wc -l
  5. Observe interface count growing and not decreasing when pods are deleted

Disabling Datadog network monitoring immediately resolved the issue: the unused network interfaces were deleted after the agent was terminated.

Please let me know what additional diagnostic information would be helpful for investigating this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    pendingLabel for issues waiting a Datadog member's response.team/triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions