Skip to content

filelog receiver loses logs when exporter is down #45120

@ommkwn

Description

@ommkwn

Component(s)

No response

Describe the issue you're reporting

filelog receiver advances checkpoints even when exporter (OpenSearch) is down, causing permanent log loss after backend restart

I am using OpenTelemetry Collector Contrib with the filelog receiver and opensearch exporter.

My application writes log files continuously, and the collector reads them from disk and sends them to OpenSearch.

However, when OpenSearch is unavailable for some time, logs generated during that downtime are permanently lost, even though:

sending_queue is enabled

retry_on_failure is enabled

file_storage is configured

checkpoints are persisted on disk

I expected the collector to resume sending all logs generated while OpenSearch was down after it comes back, but this does not happen.

my opentelemetry value.yaml file :

  `nameOverride: ""
  fullnameOverride: ""
  
  mode: "deployment"
  
  namespaceOverride: ""
  
  presets:
  
    logsCollection:
      enabled: false
      includeCollectorLogs: false
  
      storeCheckpoints: true
  
      maxRecombineLogSize: 102400
  
    hostMetrics:
      enabled: false
  
    kubernetesAttributes:
      enabled: false
  
      extractAllPodLabels: false
  
      extractAllPodAnnotations: false
  
    kubeletMetrics:
      enabled: false
  
    kubernetesEvents:
      enabled: false
  
    clusterMetrics:
      enabled: false
  
    annotationDiscovery:
      logs:
        enabled: false
      metrics:
        enabled: false
  
  configMap:
    create: true
  
    existingName: ""
  
  internalTelemetryViaOTLP:
  
    endpoint: ""
    headers: []
  
    traces:
      enabled: false
  
      endpoint: ""
  
      headers: []
    metrics:
      enabled: false
  
      endpoint: ""
  
      headers: []
    logs:
      enabled: false
  
      endpoint: ""
  
      headers: []
  
  config:
  
    exporters:
      opensearch:
        logs_index: ss4o_log-otel-timedloggermvc
        http:
          endpoint: https://opensearch-cluster-master.open.svc.cluster.local:9200
          auth:
            authenticator: basicauth/client
          tls:
            insecure_skip_verify: true
  
        sending_queue:
          enabled: true
          storage: file_storage
          queue_size: 20000
          num_consumers: 2
  
        retry_on_failure:
          enabled: true
          max_elapsed_time: 0   # retry forever
  
      debug: {}
    extensions:
      file_storage:
        directory: /var/lib/otelcol/checkpoints
        create_directory: true
      basicauth/client:
        client_auth:
          username: admin
          password: Tadhak**Dev02
  
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      batch:
        send_batch_size: 50
        timeout: 5s
  
      memory_limiter:
  
        check_interval: 5s
  
        limit_percentage: 80
  
        spike_limit_percentage: 25
  
    receivers:
      filelog:
        include:
          - /var/log/tadhak/tadhak-*.log
        start_at: beginning
        include_file_name: true
        include_file_path: true
        delete_after_read: false
        storage: file_storage
        operators:
          - type: json_parser
            timestamp:
              parse_from: attributes.@t
              layout: '%Y-%m-%dT%H:%M:%S.%fZ'
  
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            endpoint: ${env:MY_POD_IP}:4318
      # if internalTelemetryViaOTLP.metrics.enabled = true, prometheus receiver will be removed
      prometheus:
        config:
          scrape_configs:
            - job_name: opentelemetry-collector
              scrape_interval: 10s
              static_configs:
                - targets:
                    - ${env:MY_POD_IP}:8888
      zipkin:
        endpoint: ${env:MY_POD_IP}:9411
    service:
      telemetry:
        metrics:
          readers:
            - pull:
                exporter:
                  prometheus:
                    host: ${env:MY_POD_IP}
                    port: 8888
      extensions:
        - health_check
        - basicauth/client
        - file_storage
      pipelines:
        logs:
          receivers: [filelog]
          processors: [memory_limiter, batch]   # 🔥 REQUIRED
          exporters: [opensearch] 
  
  alternateConfig: {}
  
  image:
  
    repository: "otel/opentelemetry-collector-contrib"
    pullPolicy: IfNotPresent
  
    tag: "latest"
  
    digest: ""
  imagePullSecrets: []
  
  command:
    name: otelcol-contrib
    extraArgs:
      - "--feature-gates=filelog.allowFileDeletion"
  
  serviceAccount:
    create: true
    annotations: {}
    name: ""
    automountServiceAccountToken: true
  
  clusterRole:
    create: false
    annotations: {}
    name: ""
  
    rules: []
  
    clusterRoleBinding:
  
      annotations: {}
  
      name: ""
  
  podSecurityContext: {}
  securityContext: {}
  
  nodeSelector: {}
  tolerations: []
  affinity: {}
  topologySpreadConstraints: []
  
  priorityClassName: ""
  
  runtimeClassName: ""
  
  extraEnvs: []
  extraEnvsFrom: []
  
  extraVolumes:
    - name: log-volume
      persistentVolumeClaim:
        claimName: log-storage-pvc1 # Use Nginx logs PVC
  
    - name: otel-storage
      persistentVolumeClaim:
        claimName: otel-storage-pvc
  
    - name: cleanup-script
      configMap:
        name: log-cleanup-script
        defaultMode: 0777
  
  extraVolumeMounts:
    - name: log-volume
      mountPath: /var/log/tadhak  # Match Nginx log directory
      # readOnly: true  # OpenTelemetry should only read logs
  
    - name: otel-storage
      mountPath: /var/lib/otelcol/checkpoints
  
  extraManifests: []
  
  ports:
    otlp:
      enabled: true
      containerPort: 4317
      servicePort: 4317
      hostPort: 4317
      protocol: TCP
      # nodePort: 30317
      appProtocol: grpc
    otlp-http:
      enabled: true
      containerPort: 4318
      servicePort: 4318
      hostPort: 4318
      protocol: TCP
    jaeger-compact:
      enabled: true
      containerPort: 6831
      servicePort: 6831
      hostPort: 6831
      protocol: UDP
    jaeger-thrift:
      enabled: true
      containerPort: 14268
      servicePort: 14268
      hostPort: 14268
      protocol: TCP
    jaeger-grpc:
      enabled: true
      containerPort: 14250
      servicePort: 14250
      hostPort: 14250
      protocol: TCP
    zipkin:
      enabled: true
      containerPort: 9411
      servicePort: 9411
      hostPort: 9411
      protocol: TCP
    metrics:
      # The metrics port is disabled by default. However you need to enable the port
      # in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
      enabled: false
      containerPort: 8888
      servicePort: 8888
      protocol: TCP
  
  useGOMEMLIMIT: true
  
  resources: {}
  
  enableConfigChecksumAnnotation: true
  podAnnotations: {}
  
  podLabels: {}
  
  additionalLabels: {}
  
  hostNetwork: false
  
  hostAliases: []
  
  dnsPolicy: ""
  
  dnsConfig: {}
  
  schedulerName: ""
  
  replicaCount: 1
  
  revisionHistoryLimit: 10
  
  annotations: {}
  
  extraContainers: []
  
  initContainers: []
  
  lifecycleHooks: {}
  
  livenessProbe:
  
    httpGet:
      port: 13133
      path: /
  
  readinessProbe:
  
    httpGet:
      port: 13133
      path: /
  
  startupProbe: {}
  
  service:
  
    type: ClusterIP
    annotations: {}
  ingress:
    enabled: false
  
    additionalIngresses: []
  
  podMonitor:
    enabled: false
    metricsEndpoints:
      - port: metrics
  
    extraLabels: {}
  
  serviceMonitor:
  
    enabled: false
    metricsEndpoints:
      - port: metrics
  
    extraLabels: {}
  
    relabelings: []
    metricRelabelings: []
  
  podDisruptionBudget:
    enabled: false
  
  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    behavior: {}
    targetCPUUtilizationPercentage: 80
    additionalMetrics: []
  
  rollout:
    rollingUpdate: {}
    strategy: RollingUpdate
  
  prometheusRule:
    enabled: false
    groups: []
    defaultRules:
      enabled: false
  
    extraLabels: {}
  
  statefulset:
    volumeClaimTemplates: []
    podManagementPolicy: "Parallel"
    persistentVolumeClaimRetentionPolicy:
      enabled: false
      whenDeleted: Retain
      whenScaled: Retain
  
  networkPolicy:
    enabled: false
  
    annotations: {}
  
    allowIngressFrom: []
  
    extraIngressRules: []
  
    egressRules: []
  
  shareProcessNamespace: false

`

Log Generation Pattern

My application behaves like this:

Generates a new log file every 3 minutes

Writes 5 logs per minute

Example:

File | Logs | Time

File 1 | 1–15 | 3 minutes
File 2 | 16–20 | 1 minute

These logs are successfully ingested into OpenSearch initially.

Failure Scenario

OpenSearch pod crashes and is unavailable for ~7 minutes

During this downtime, my application continues writing logs:

File | Logs

File 2 | 21–30
File 3 | 31–45
File 4 | 46–55

OpenSearch comes back up

Only new logs after restart (e.g. 56–60) are sent to OpenSearch

Logs 21–55 are never ingested, even though:

Files still exist

Checkpoints exist on disk

Exporter retries are enabled

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions