Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive status reason by timing issues in alert rules #1410

Open
TeodorSAP opened this issue Aug 30, 2024 · 2 comments
Open

False positive status reason by timing issues in alert rules #1410

TeodorSAP opened this issue Aug 30, 2024 · 2 comments
Assignees
Labels
area/logs LogPipeline area/metrics MetricPipeline area/traces TracePipeline kind/bug Categorizes issue or PR as related to a bug.

Comments

@TeodorSAP
Copy link
Member

TeodorSAP commented Aug 30, 2024

We observed single datapoints where the health status was reported negative which was not the case. That mainly is caused by having the status evaluation based on two rule evaluations, which might be evaluated slightly at a different point in time. So a beginning ingestion might have been evaluated to true already where the starting export might not have been evaluated yet, so that there is a short time window where the status is seen as "no export possible".

A refactoring of the alerting was done for the LogPipeline already (#1397) and here no false positives were observed anymore.

The existing self-monitor alerts, based on PromQL queries, should be refactored such that:

  • A single PromQL query (i.e. alert) is mapped to each unhealthy condition (i.e. no Go code is involved in logically evaluating the firing alerts)
  • The for clause is used to avoid timing issues (where justified)
  • Alerts in firing state should strictly be used for negative/unhealthy scenarios

This refactoring should follow the changes introduced in: #1397

@TeodorSAP TeodorSAP changed the title Refactor existing SelfMon / Prometheus alerts Refactor existing Self-Monitoring (Prometheus) alerts Aug 30, 2024
@TeodorSAP TeodorSAP added area/logs LogPipeline area/metrics MetricPipeline area/traces TracePipeline kind/chore Categorizes issue or PR as related to a chore. labels Aug 30, 2024
@TeodorSAP TeodorSAP changed the title Refactor existing Self-Monitoring (Prometheus) alerts Refactor existing Self-Monitor (Prometheus) alerts Aug 30, 2024
Copy link

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
Copy link

github-actions bot commented Nov 7, 2024

This issue has been automatically closed due to the lack of recent activity.
/lifecycle rotten

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 7, 2024
@kyma-bot kyma-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 7, 2024
@a-thaler a-thaler removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2024
@a-thaler a-thaler reopened this Nov 7, 2024
@hisarbalik hisarbalik self-assigned this Dec 19, 2024
@a-thaler a-thaler added kind/bug Categorizes issue or PR as related to a bug. and removed kind/chore Categorizes issue or PR as related to a chore. labels Dec 20, 2024
@a-thaler a-thaler changed the title Refactor existing Self-Monitor (Prometheus) alerts False positive status reason by timing issues in alert rules Dec 20, 2024
@hisarbalik hisarbalik assigned skhalash and unassigned hisarbalik Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logs LogPipeline area/metrics MetricPipeline area/traces TracePipeline kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants