Skip to content

exporter/prometheus: Metrics clean up should be independent of Collect() #41123

@BloodyKnight

Description

@BloodyKnight

Component(s)

  • servicegraph connector
  • prometheus exporter

What happened?

Description

when we use servicegraph connector to gather application topology information and exporter the metrics using prometheus exporter, we found that the otel-collctor eating memory slowly(about 16GiB 2 weeks).The picture below is the node memory usage:

Image

Steps to Reproduce

1、enable servicegraph connector
2、exporter metrics using prometheus exporter
3、remember NOT scrape the metrics endpoint
4、otel-collector OOM eventually

Expected Result

otel-collctor memory usage is steady

Actual Result

otel-collctor memory leak

Collector version

v0.124.1

Environment information

Environment

OS: CentOS Linux release 7.9.2009 (Core) KVM
Binary:offical release v0.124.1

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 85
    spike_limit_percentage: 15

  batch:
    timeout: 100ms
    send_batch_size: 4096
    send_batch_max_size: 5000

  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      [
         {
             name: prob-policy,
             type: probabilistic,
             probabilistic: {sampling_percentage: 10}
         },
         {
           name: error-policy,
           type: status_code,
           status_code: {status_codes: [ERROR]}
         },
         {
           name: latency-policy,
           type: latency,
           latency: {threshold_ms: 3000}
         },
		 {
           name: force-sample,  # always sample if the force_sample attribute is set to true
           type: boolean_attribute,
           boolean_attribute: { key: agent.force.sample, value: true }
         }
      ]

exporters:
  debug:
    verbosity: detailed
  kafka/trace:
    brokers:
      - log.gateway.collector:9092
    protocol_version: 2.1.0
    encoding: zipkin_proto
    retry_on_failure:
       enabled: true
    timeout: 2s
    sending_queue:
       enabled: true
       num_consumers: 30
       queue_size: 500000
    producer:
      max_message_bytes: 31457280
      required_acks: 0
      compression: lz4
    topic: mop-trace
  prometheus/servicegraph:
    endpoint: 0.0.0.0:18073
connectors:
  servicegraph:
    store:
      ttl: 2s
      max_items: 50000
    dimensions: [region.code, service.namespace]
    latency_histogram_buckets: [10ms, 50ms, 100ms, 500ms, 1s, 3s, 5s, 10s]
    cache_loop: 2m # the time to cleans the cache periodically
    store_expiration_loop: 30s 
    virtual_node_peer_attributes: [service.name, db.name, net.sock.peer.addr, net.peer.name, rpc.service, net.sock.peer.name, net.peer.name, http.url, http.target]
    metrics_flush_interval: 100ms
extensions:
  health_check:
   endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: :1777
service:
  extensions: [health_check, zpages, pprof]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [kafka/trace]
    traces/servicegraph:
      receivers: [otlp]
      processors: [memory_limiter]
      exporters: [servicegraph]
    metrics/servicegraph:
      receivers: [servicegraph]
      exporters: [prometheus/servicegraph]
  telemetry:
    metrics:
      readers:
        - pull:
            exporter:
              prometheus:
                host: '0.0.0.0'
                port: 8888
    logs:
      level: "info"

Log output

2025-06-28T17:04:41.463+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6230}
2025-06-28T17:05:27.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11197}
2025-06-28T17:05:32.338+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6263}
2025-06-28T17:25:37.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11151}
2025-06-28T17:25:42.094+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6312}
2025-06-28T17:30:47.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11150}
2025-06-28T17:30:51.978+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6208}
2025-06-28T17:35:32.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11196}
2025-06-28T17:35:36.884+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6308}
2025-06-28T17:40:42.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11147}
2025-06-28T17:40:47.768+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6264}
2025-06-28T17:43:07.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11129}
2025-06-28T17:43:11.755+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6210}
2025-06-28T17:45:32.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11182}
2025-06-28T17:45:36.624+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6315}
2025-06-28T17:53:07.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11151}
2025-06-28T17:53:11.817+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6284}
2025-06-28T17:55:32.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11237}
2025-06-28T17:55:36.491+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6306}
2025-06-28T18:00:17.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11276}
2025-06-28T18:00:22.166+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6670}
2025-06-28T18:00:42.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11618}
2025-06-28T18:00:48.222+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6559}
2025-06-28T18:02:37.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11327}
2025-06-28T18:02:42.522+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6452}
2025-06-28T18:03:47.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11314}
2025-06-28T18:03:52.618+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6459}
2025-06-28T18:04:37.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11581}
2025-06-28T18:04:41.960+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6352}
2025-06-28T18:05:47.073+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11187}
2025-06-28T18:05:52.748+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6284}
2025-06-28T18:06:37.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11156}
2025-06-28T18:06:43.403+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6371}
2025-06-28T18:07:07.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11226}
2025-06-28T18:07:11.586+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6340}
2025-06-28T18:10:37.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11233}
2025-06-28T18:10:43.423+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6491}
2025-06-28T18:11:27.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11327}
2025-06-28T18:11:32.503+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6413}
2025-06-28T18:18:07.071+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11132}
2025-06-28T18:18:11.679+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6340}
2025-06-28T18:25:37.072+0800    info    [email protected]/memorylimiter.go:205     Memory usage is above soft limit. Forcing a GC. {"cur_mem_mib": 11251}
2025-06-28T18:25:41.711+0800    info    [email protected]/memorylimiter.go:171     Memory usage after GC.  {"cur_mem_mib": 6325}
....

Additional context

the v0.93.0 offical build release has no this problem, or maybe the problem is not so obvious.
v0.124.1 memory pprof(inuse_space)

Image

v0.93 memory pprof(inuse_space)

Image

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingexporter/prometheushelp wantedExtra attention is needednever staleIssues marked with this label will be never staled and automatically removed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions