filelog receiver loses logs when exporter is down

### Component(s)

_No response_

### Describe the issue you're reporting

filelog receiver advances checkpoints even when exporter (OpenSearch) is down, causing permanent log loss after backend restart

I am using OpenTelemetry Collector Contrib with the filelog receiver and opensearch exporter.

My application writes log files continuously, and the collector reads them from disk and sends them to OpenSearch.

However, when OpenSearch is unavailable for some time, logs generated during that downtime are permanently lost, even though:

sending_queue is enabled

retry_on_failure is enabled

file_storage is configured

checkpoints are persisted on disk

I expected the collector to resume sending all logs generated while OpenSearch was down after it comes back, but this does not happen.

my opentelemetry value.yaml file :


      `nameOverride: ""
      fullnameOverride: ""
      
      mode: "deployment"
      
      namespaceOverride: ""
      
      presets:
      
        logsCollection:
          enabled: false
          includeCollectorLogs: false
      
          storeCheckpoints: true
      
          maxRecombineLogSize: 102400
      
        hostMetrics:
          enabled: false
      
        kubernetesAttributes:
          enabled: false
      
          extractAllPodLabels: false
      
          extractAllPodAnnotations: false
      
        kubeletMetrics:
          enabled: false
      
        kubernetesEvents:
          enabled: false
      
        clusterMetrics:
          enabled: false
      
        annotationDiscovery:
          logs:
            enabled: false
          metrics:
            enabled: false
      
      configMap:
        create: true
      
        existingName: ""
      
      internalTelemetryViaOTLP:
      
        endpoint: ""
        headers: []
      
        traces:
          enabled: false
      
          endpoint: ""
      
          headers: []
        metrics:
          enabled: false
      
          endpoint: ""
      
          headers: []
        logs:
          enabled: false
      
          endpoint: ""
      
          headers: []
      
      config:
      
        exporters:
          opensearch:
            logs_index: ss4o_log-otel-timedloggermvc
            http:
              endpoint: https://opensearch-cluster-master.open.svc.cluster.local:9200
              auth:
                authenticator: basicauth/client
              tls:
                insecure_skip_verify: true
      
            sending_queue:
              enabled: true
              storage: file_storage
              queue_size: 20000
              num_consumers: 2
      
            retry_on_failure:
              enabled: true
              max_elapsed_time: 0   # retry forever
      
          debug: {}
        extensions:
          file_storage:
            directory: /var/lib/otelcol/checkpoints
            create_directory: true
          basicauth/client:
            client_auth:
              username: admin
              password: Tadhak**Dev02
      
          health_check:
            endpoint: ${env:MY_POD_IP}:13133
        processors:
          batch:
            send_batch_size: 50
            timeout: 5s
      
          memory_limiter:
      
            check_interval: 5s
      
            limit_percentage: 80
      
            spike_limit_percentage: 25
      
        receivers:
          filelog:
            include:
              - /var/log/tadhak/tadhak-*.log
            start_at: beginning
            include_file_name: true
            include_file_path: true
            delete_after_read: false
            storage: file_storage
            operators:
              - type: json_parser
                timestamp:
                  parse_from: attributes.@t
                  layout: '%Y-%m-%dT%H:%M:%S.%fZ'
      
          otlp:
            protocols:
              grpc:
                endpoint: ${env:MY_POD_IP}:4317
              http:
                endpoint: ${env:MY_POD_IP}:4318
          # if internalTelemetryViaOTLP.metrics.enabled = true, prometheus receiver will be removed
          prometheus:
            config:
              scrape_configs:
                - job_name: opentelemetry-collector
                  scrape_interval: 10s
                  static_configs:
                    - targets:
                        - ${env:MY_POD_IP}:8888
          zipkin:
            endpoint: ${env:MY_POD_IP}:9411
        service:
          telemetry:
            metrics:
              readers:
                - pull:
                    exporter:
                      prometheus:
                        host: ${env:MY_POD_IP}
                        port: 8888
          extensions:
            - health_check
            - basicauth/client
            - file_storage
          pipelines:
            logs:
              receivers: [filelog]
              processors: [memory_limiter, batch]   # 🔥 REQUIRED
              exporters: [opensearch] 
      
      alternateConfig: {}
      
      image:
      
        repository: "otel/opentelemetry-collector-contrib"
        pullPolicy: IfNotPresent
      
        tag: "latest"
      
        digest: ""
      imagePullSecrets: []
      
      command:
        name: otelcol-contrib
        extraArgs:
          - "--feature-gates=filelog.allowFileDeletion"
      
      serviceAccount:
        create: true
        annotations: {}
        name: ""
        automountServiceAccountToken: true
      
      clusterRole:
        create: false
        annotations: {}
        name: ""
      
        rules: []
      
        clusterRoleBinding:
      
          annotations: {}
      
          name: ""
      
      podSecurityContext: {}
      securityContext: {}
      
      nodeSelector: {}
      tolerations: []
      affinity: {}
      topologySpreadConstraints: []
      
      priorityClassName: ""
      
      runtimeClassName: ""
      
      extraEnvs: []
      extraEnvsFrom: []
      
      extraVolumes:
        - name: log-volume
          persistentVolumeClaim:
            claimName: log-storage-pvc1 # Use Nginx logs PVC
      
        - name: otel-storage
          persistentVolumeClaim:
            claimName: otel-storage-pvc
      
        - name: cleanup-script
          configMap:
            name: log-cleanup-script
            defaultMode: 0777
      
      extraVolumeMounts:
        - name: log-volume
          mountPath: /var/log/tadhak  # Match Nginx log directory
          # readOnly: true  # OpenTelemetry should only read logs
      
        - name: otel-storage
          mountPath: /var/lib/otelcol/checkpoints
      
      extraManifests: []
      
      ports:
        otlp:
          enabled: true
          containerPort: 4317
          servicePort: 4317
          hostPort: 4317
          protocol: TCP
          # nodePort: 30317
          appProtocol: grpc
        otlp-http:
          enabled: true
          containerPort: 4318
          servicePort: 4318
          hostPort: 4318
          protocol: TCP
        jaeger-compact:
          enabled: true
          containerPort: 6831
          servicePort: 6831
          hostPort: 6831
          protocol: UDP
        jaeger-thrift:
          enabled: true
          containerPort: 14268
          servicePort: 14268
          hostPort: 14268
          protocol: TCP
        jaeger-grpc:
          enabled: true
          containerPort: 14250
          servicePort: 14250
          hostPort: 14250
          protocol: TCP
        zipkin:
          enabled: true
          containerPort: 9411
          servicePort: 9411
          hostPort: 9411
          protocol: TCP
        metrics:
          # The metrics port is disabled by default. However you need to enable the port
          # in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
          enabled: false
          containerPort: 8888
          servicePort: 8888
          protocol: TCP
      
      useGOMEMLIMIT: true
      
      resources: {}
      
      enableConfigChecksumAnnotation: true
      podAnnotations: {}
      
      podLabels: {}
      
      additionalLabels: {}
      
      hostNetwork: false
      
      hostAliases: []
      
      dnsPolicy: ""
      
      dnsConfig: {}
      
      schedulerName: ""
      
      replicaCount: 1
      
      revisionHistoryLimit: 10
      
      annotations: {}
      
      extraContainers: []
      
      initContainers: []
      
      lifecycleHooks: {}
      
      livenessProbe:
      
        httpGet:
          port: 13133
          path: /
      
      readinessProbe:
      
        httpGet:
          port: 13133
          path: /
      
      startupProbe: {}
      
      service:
      
        type: ClusterIP
        annotations: {}
      ingress:
        enabled: false
      
        additionalIngresses: []
      
      podMonitor:
        enabled: false
        metricsEndpoints:
          - port: metrics
      
        extraLabels: {}
      
      serviceMonitor:
      
        enabled: false
        metricsEndpoints:
          - port: metrics
      
        extraLabels: {}
      
        relabelings: []
        metricRelabelings: []
      
      podDisruptionBudget:
        enabled: false
      
      autoscaling:
        enabled: false
        minReplicas: 1
        maxReplicas: 10
        behavior: {}
        targetCPUUtilizationPercentage: 80
        additionalMetrics: []
      
      rollout:
        rollingUpdate: {}
        strategy: RollingUpdate
      
      prometheusRule:
        enabled: false
        groups: []
        defaultRules:
          enabled: false
      
        extraLabels: {}
      
      statefulset:
        volumeClaimTemplates: []
        podManagementPolicy: "Parallel"
        persistentVolumeClaimRetentionPolicy:
          enabled: false
          whenDeleted: Retain
          whenScaled: Retain
      
      networkPolicy:
        enabled: false
      
        annotations: {}
      
        allowIngressFrom: []
      
        extraIngressRules: []
      
        egressRules: []
      
      shareProcessNamespace: false

`

Log Generation Pattern

My application behaves like this:

Generates a new log file every 3 minutes

Writes 5 logs per minute

Example:

File     | Logs        | Time

File 1 | 1–15         | 3 minutes
File 2 | 16–20       | 1 minute

These logs are successfully ingested into OpenSearch initially.

Failure Scenario

OpenSearch pod crashes and is unavailable for ~7 minutes

During this downtime, my application continues writing logs:


File          | Logs

File 2      | 21–30
File 3      | 31–45
File 4      | 46–55

OpenSearch comes back up

Only new logs after restart (e.g. 56–60) are sent to OpenSearch

Logs 21–55 are never ingested, even though:

Files still exist

Checkpoints exist on disk

Exporter retries are enabled


### Tip

<sub>[React](https://github.blog/news-insights/product-news/add-reactions-to-pull-requests-issues-and-comments/) with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding `+1` or `me too`, to help us triage it. Learn more [here](https://opentelemetry.io/community/end-user/issue-participation/).</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

filelog receiver loses logs when exporter is down #45120

Component(s)

Describe the issue you're reporting

Tip

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

filelog receiver loses logs when exporter is down #45120

Description

Component(s)

Describe the issue you're reporting

Tip

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions