-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Component(s)
No response
Describe the issue you're reporting
filelog receiver advances checkpoints even when exporter (OpenSearch) is down, causing permanent log loss after backend restart
I am using OpenTelemetry Collector Contrib with the filelog receiver and opensearch exporter.
My application writes log files continuously, and the collector reads them from disk and sends them to OpenSearch.
However, when OpenSearch is unavailable for some time, logs generated during that downtime are permanently lost, even though:
sending_queue is enabled
retry_on_failure is enabled
file_storage is configured
checkpoints are persisted on disk
I expected the collector to resume sending all logs generated while OpenSearch was down after it comes back, but this does not happen.
my opentelemetry value.yaml file :
`nameOverride: ""
fullnameOverride: ""
mode: "deployment"
namespaceOverride: ""
presets:
logsCollection:
enabled: false
includeCollectorLogs: false
storeCheckpoints: true
maxRecombineLogSize: 102400
hostMetrics:
enabled: false
kubernetesAttributes:
enabled: false
extractAllPodLabels: false
extractAllPodAnnotations: false
kubeletMetrics:
enabled: false
kubernetesEvents:
enabled: false
clusterMetrics:
enabled: false
annotationDiscovery:
logs:
enabled: false
metrics:
enabled: false
configMap:
create: true
existingName: ""
internalTelemetryViaOTLP:
endpoint: ""
headers: []
traces:
enabled: false
endpoint: ""
headers: []
metrics:
enabled: false
endpoint: ""
headers: []
logs:
enabled: false
endpoint: ""
headers: []
config:
exporters:
opensearch:
logs_index: ss4o_log-otel-timedloggermvc
http:
endpoint: https://opensearch-cluster-master.open.svc.cluster.local:9200
auth:
authenticator: basicauth/client
tls:
insecure_skip_verify: true
sending_queue:
enabled: true
storage: file_storage
queue_size: 20000
num_consumers: 2
retry_on_failure:
enabled: true
max_elapsed_time: 0 # retry forever
debug: {}
extensions:
file_storage:
directory: /var/lib/otelcol/checkpoints
create_directory: true
basicauth/client:
client_auth:
username: admin
password: Tadhak**Dev02
health_check:
endpoint: ${env:MY_POD_IP}:13133
processors:
batch:
send_batch_size: 50
timeout: 5s
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
receivers:
filelog:
include:
- /var/log/tadhak/tadhak-*.log
start_at: beginning
include_file_name: true
include_file_path: true
delete_after_read: false
storage: file_storage
operators:
- type: json_parser
timestamp:
parse_from: attributes.@t
layout: '%Y-%m-%dT%H:%M:%S.%fZ'
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
http:
endpoint: ${env:MY_POD_IP}:4318
# if internalTelemetryViaOTLP.metrics.enabled = true, prometheus receiver will be removed
prometheus:
config:
scrape_configs:
- job_name: opentelemetry-collector
scrape_interval: 10s
static_configs:
- targets:
- ${env:MY_POD_IP}:8888
zipkin:
endpoint: ${env:MY_POD_IP}:9411
service:
telemetry:
metrics:
readers:
- pull:
exporter:
prometheus:
host: ${env:MY_POD_IP}
port: 8888
extensions:
- health_check
- basicauth/client
- file_storage
pipelines:
logs:
receivers: [filelog]
processors: [memory_limiter, batch] # 🔥 REQUIRED
exporters: [opensearch]
alternateConfig: {}
image:
repository: "otel/opentelemetry-collector-contrib"
pullPolicy: IfNotPresent
tag: "latest"
digest: ""
imagePullSecrets: []
command:
name: otelcol-contrib
extraArgs:
- "--feature-gates=filelog.allowFileDeletion"
serviceAccount:
create: true
annotations: {}
name: ""
automountServiceAccountToken: true
clusterRole:
create: false
annotations: {}
name: ""
rules: []
clusterRoleBinding:
annotations: {}
name: ""
podSecurityContext: {}
securityContext: {}
nodeSelector: {}
tolerations: []
affinity: {}
topologySpreadConstraints: []
priorityClassName: ""
runtimeClassName: ""
extraEnvs: []
extraEnvsFrom: []
extraVolumes:
- name: log-volume
persistentVolumeClaim:
claimName: log-storage-pvc1 # Use Nginx logs PVC
- name: otel-storage
persistentVolumeClaim:
claimName: otel-storage-pvc
- name: cleanup-script
configMap:
name: log-cleanup-script
defaultMode: 0777
extraVolumeMounts:
- name: log-volume
mountPath: /var/log/tadhak # Match Nginx log directory
# readOnly: true # OpenTelemetry should only read logs
- name: otel-storage
mountPath: /var/lib/otelcol/checkpoints
extraManifests: []
ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
hostPort: 4317
protocol: TCP
# nodePort: 30317
appProtocol: grpc
otlp-http:
enabled: true
containerPort: 4318
servicePort: 4318
hostPort: 4318
protocol: TCP
jaeger-compact:
enabled: true
containerPort: 6831
servicePort: 6831
hostPort: 6831
protocol: UDP
jaeger-thrift:
enabled: true
containerPort: 14268
servicePort: 14268
hostPort: 14268
protocol: TCP
jaeger-grpc:
enabled: true
containerPort: 14250
servicePort: 14250
hostPort: 14250
protocol: TCP
zipkin:
enabled: true
containerPort: 9411
servicePort: 9411
hostPort: 9411
protocol: TCP
metrics:
# The metrics port is disabled by default. However you need to enable the port
# in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
enabled: false
containerPort: 8888
servicePort: 8888
protocol: TCP
useGOMEMLIMIT: true
resources: {}
enableConfigChecksumAnnotation: true
podAnnotations: {}
podLabels: {}
additionalLabels: {}
hostNetwork: false
hostAliases: []
dnsPolicy: ""
dnsConfig: {}
schedulerName: ""
replicaCount: 1
revisionHistoryLimit: 10
annotations: {}
extraContainers: []
initContainers: []
lifecycleHooks: {}
livenessProbe:
httpGet:
port: 13133
path: /
readinessProbe:
httpGet:
port: 13133
path: /
startupProbe: {}
service:
type: ClusterIP
annotations: {}
ingress:
enabled: false
additionalIngresses: []
podMonitor:
enabled: false
metricsEndpoints:
- port: metrics
extraLabels: {}
serviceMonitor:
enabled: false
metricsEndpoints:
- port: metrics
extraLabels: {}
relabelings: []
metricRelabelings: []
podDisruptionBudget:
enabled: false
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 10
behavior: {}
targetCPUUtilizationPercentage: 80
additionalMetrics: []
rollout:
rollingUpdate: {}
strategy: RollingUpdate
prometheusRule:
enabled: false
groups: []
defaultRules:
enabled: false
extraLabels: {}
statefulset:
volumeClaimTemplates: []
podManagementPolicy: "Parallel"
persistentVolumeClaimRetentionPolicy:
enabled: false
whenDeleted: Retain
whenScaled: Retain
networkPolicy:
enabled: false
annotations: {}
allowIngressFrom: []
extraIngressRules: []
egressRules: []
shareProcessNamespace: false
`
Log Generation Pattern
My application behaves like this:
Generates a new log file every 3 minutes
Writes 5 logs per minute
Example:
File | Logs | Time
File 1 | 1–15 | 3 minutes
File 2 | 16–20 | 1 minute
These logs are successfully ingested into OpenSearch initially.
Failure Scenario
OpenSearch pod crashes and is unavailable for ~7 minutes
During this downtime, my application continues writing logs:
File | Logs
File 2 | 21–30
File 3 | 31–45
File 4 | 46–55
OpenSearch comes back up
Only new logs after restart (e.g. 56–60) are sent to OpenSearch
Logs 21–55 are never ingested, even though:
Files still exist
Checkpoints exist on disk
Exporter retries are enabled
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.