Skip to content

Kubernetes node reboot prometheus operator CRDs not monitored on restart #1853

Open
@tman5

Description

@tman5

What's wrong?

When rebooting kubernetes nodes running the alloy agent, prometheus podmonitors and servicemonitor objects fail to be monitored when the alloy pods restart. Restarting the alloy pods again restores the functionality. But they should not require 2 restarts. Looks like maybe a race condition on reboot?

Steps to reproduce

  • Reboot a kubernetes node with just a system level reboot
  • Once the node comes back, observe the alloy pod on the node is not collecting prometheus operator objects

System information

Linux 5.14.0-427.37.1.el9_4.x86_64

Software version

v1.3.1

Configuration

alloy:
    configMap:
        content: |
            prometheus.remote_write "metrics_default" {
                external_labels = {
                    agent = "alloy",
                    environment = env("ENV"),
                    cluster = env("CLUSTER"),
                }

                endpoint {
                    name = env("prom_endpoint_1")
                    url  = join(["http://", env("prom_endpoint_1"), ":9090/api/v1/write"], "")

                    queue_config {
                        max_samples_per_send = 1000
                        max_shards = 1
                    }

                    metadata_config { }

                    write_relabel_config {
                        action = "labeldrop"
                        regex = "^agent_hostname$"
                    }
                }

                endpoint {
                    name = env("prom_endpoint_2")
                    url  = join(["http://", env("prom_endpoint_2"), ":9090/api/v1/write"], "")

                    queue_config {
                        max_samples_per_send = 1000
                        max_shards = 1
                    }

                    metadata_config { }

                    write_relabel_config {
                        action = "labeldrop"
                        regex = "^agent_hostname$"
                    }
                }
            }

            discovery.kubernetes "pods" {
                role = "pod"
                selectors {
                    role = "pod"
                }
            }

            discovery.relabel "pods" {
                targets = discovery.kubernetes.pods.targets

                rule {
                    source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
                    action =  "keep"
                    regex = "true"
                }

                rule {
                    action =  "labelmap"
                    regex = "__meta_kubernetes_pod_label_(.+)"
                }

                rule {
                    source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_metric_path"]
                    action =  "replace"
                    target_label =  "__metrics_path__"
                    regex = "(.+)"
                }

                rule {
                    source_labels = ["__address__", "__meta_kubernetes_pod_annotation_prometheus_io_port"]
                    action        = "replace"
                    regex         = "([^:]+)(?::\\d+)?;(\\d+)"
                    replacement   = "$1:$2"
                    target_label  = "__address__"
                }

                rule {
                    source_labels = ["__meta_kubernetes_namespace"]
                    action        = "replace"
                    target_label  = "namespace"
                }

                rule {
                    source_labels = ["__meta_kubernetes_pod_name"]
                    action        = "replace"
                    target_label  = "pod"
                }
            }

            prometheus.operator.podmonitors "primary" {
            forward_to = [prometheus.remote_write.metrics_default.receiver]
            clustering {
                enabled = true
            }
            }

            prometheus.operator.servicemonitors "primary" {
                forward_to = [prometheus.remote_write.metrics_default.receiver]
                clustering {
                    enabled = true
                }
            }

            prometheus.scrape "k8s" {
                targets    = discovery.relabel.pods.output
                forward_to = [prometheus.remote_write.metrics_default.receiver]
                job_name = "integrations/kubernetes-pods"
                clustering {
                    enabled = true
                }
            }

Helm chart values:

alloy:
  configMap:
    create: true
  clustering:
    enabled: true
crds:
  create: false
controller:
  type: 'daemonset'

Logs

ts=2024-10-09T14:33:43.332318918Z level=error msg="error running crd manager" component_path=/ component_id=prometheus.operator.podmonitors.primary err="failed to configure informers: timeout exceeded while configuring informers. Check the connection to the Kubernetes API is stable and that Alloy has appropriate RBAC permissions for &{{ } {      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []} { [] [] {map[] []} {false []} 0 0 0 0 0 <nil>}}"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions