Open
Description
What's wrong?
When rebooting kubernetes nodes running the alloy agent, prometheus podmonitors and servicemonitor objects fail to be monitored when the alloy pods restart. Restarting the alloy pods again restores the functionality. But they should not require 2 restarts. Looks like maybe a race condition on reboot?
Steps to reproduce
- Reboot a kubernetes node with just a system level reboot
- Once the node comes back, observe the alloy pod on the node is not collecting prometheus operator objects
System information
Linux 5.14.0-427.37.1.el9_4.x86_64
Software version
v1.3.1
Configuration
alloy:
configMap:
content: |
prometheus.remote_write "metrics_default" {
external_labels = {
agent = "alloy",
environment = env("ENV"),
cluster = env("CLUSTER"),
}
endpoint {
name = env("prom_endpoint_1")
url = join(["http://", env("prom_endpoint_1"), ":9090/api/v1/write"], "")
queue_config {
max_samples_per_send = 1000
max_shards = 1
}
metadata_config { }
write_relabel_config {
action = "labeldrop"
regex = "^agent_hostname$"
}
}
endpoint {
name = env("prom_endpoint_2")
url = join(["http://", env("prom_endpoint_2"), ":9090/api/v1/write"], "")
queue_config {
max_samples_per_send = 1000
max_shards = 1
}
metadata_config { }
write_relabel_config {
action = "labeldrop"
regex = "^agent_hostname$"
}
}
}
discovery.kubernetes "pods" {
role = "pod"
selectors {
role = "pod"
}
}
discovery.relabel "pods" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
action = "keep"
regex = "true"
}
rule {
action = "labelmap"
regex = "__meta_kubernetes_pod_label_(.+)"
}
rule {
source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_metric_path"]
action = "replace"
target_label = "__metrics_path__"
regex = "(.+)"
}
rule {
source_labels = ["__address__", "__meta_kubernetes_pod_annotation_prometheus_io_port"]
action = "replace"
regex = "([^:]+)(?::\\d+)?;(\\d+)"
replacement = "$1:$2"
target_label = "__address__"
}
rule {
source_labels = ["__meta_kubernetes_namespace"]
action = "replace"
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_name"]
action = "replace"
target_label = "pod"
}
}
prometheus.operator.podmonitors "primary" {
forward_to = [prometheus.remote_write.metrics_default.receiver]
clustering {
enabled = true
}
}
prometheus.operator.servicemonitors "primary" {
forward_to = [prometheus.remote_write.metrics_default.receiver]
clustering {
enabled = true
}
}
prometheus.scrape "k8s" {
targets = discovery.relabel.pods.output
forward_to = [prometheus.remote_write.metrics_default.receiver]
job_name = "integrations/kubernetes-pods"
clustering {
enabled = true
}
}
Helm chart values:
alloy:
configMap:
create: true
clustering:
enabled: true
crds:
create: false
controller:
type: 'daemonset'
Logs
ts=2024-10-09T14:33:43.332318918Z level=error msg="error running crd manager" component_path=/ component_id=prometheus.operator.podmonitors.primary err="failed to configure informers: timeout exceeded while configuring informers. Check the connection to the Kubernetes API is stable and that Alloy has appropriate RBAC permissions for &{{ } { 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []} { [] [] {map[] []} {false []} 0 0 0 0 0 <nil>}}"