Skip to content

Proposal: mimir.alertmanagerconfigs.kubernetes component to handle prometheus operator's AlertmanagerConfig CRD. #4216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
captncraig opened this issue Jun 20, 2023 · 6 comments
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. proposal Proposal or RFC

Comments

@captncraig
Copy link
Contributor

Background

Prometheus Operator introduced the AlertmanagerConfig CRD to manage alert manager routes and receivers.

We should investigate the posibility of implementing this as a flow component similar to mimir.rules.kubernetes

Proposal

It is not clear to me immediately if this is a simple task. The goal would be for a user to be able to sync alert manager configs to an local mimir instance or to Grafana Cloud.

What does prometheus operator do with these CRDs?

Prometheus operator deploys an instance of Alertmanager with a fully-manifested config. All relevant AlertManagerConfigs and config secrets are merged together at reconcile time to generate a config. code.

What APIs could we use?

Plain AlertManager does not have any apis for managing routes and things. It only has a reload endpoint. Mimir does have some potentially useful apis.

For mimir.rules.kubernetes, we are able to use the standard mimir apis for managing rules. Those apis have granularity to add, update, or delete individual rules, and mimir.rules.kubernetes uses them to allow you to mix and match manually-added and crd-defined rules.

Mimir's alertmanager apis though, only have "get" and "update" endpoints. That makes it much harder to manage routes dynamically. Either:

  • CRDs are the only mechanism to configure alert manager configs. Any other changes will get overwritten.
  • This component would need to do a complicated compare and merge operation on the existing configs. I'm not clear how clean it is possible to make this, and suspect things like deleted CRDs could leave remnants behind if we don't have a good cleanup mechanism.

I really don't love either of those options. With rules, we have unique IDs and granular apis. With alertmanager configs, it feels much harder to replicate the same paradigm Prometheus Operator provides.

@captncraig
Copy link
Contributor Author

In summary, Prometheus Operator is able to support AlertManagerConfigs only because it is also deploying AlertManager, and has exclusive control over the configs. Agent is not going to be in charge of deploying workloads, and so is not going to have exclusive control of any alertmanager configs from external Alert Managers.

I would vote that this CRD is not as good a match as the other Prometheus Operator CRDs we handle (ServiceMonitor, PodMonitor, Probe, and PrometheusRule), and we should not take this on.

If anyone more familiar with the alertmanager/mimir codebases or apis has better context or other reasons this would work, I would love to hear them.

@captncraig
Copy link
Contributor Author

Although, if we want to assume this component has full control over an AlertManager instance, and are ok overwriting everything else, we could make a component that:

  • Finds all AlertManagerConfigs matching some set of selectors.
  • Merge them together into a single yaml blob
  • Post that to the mimir api.

That is a possibility, but it strikes me as a bit dangerous, and maybe not how we want to promote people configuring cloud alerting.

@davidspek
Copy link
Contributor

I've actually made an attempt at implementing this but I haven't had the time to finish it. For anybody wanting to take this on feel free to have a look at https://github.com/pluralsh/grafana-agent/tree/alertmanager-configs and take any code that might be useful.

@davidspek
Copy link
Contributor

FYI, this issue is somewhat of a duplicate of grafana/alloy#504.

@davidspek
Copy link
Contributor

@captncraig @rfratto Is this already being worked on or is there a timeline for it?

@captncraig
Copy link
Contributor Author

Yes, this is a duplicate of grafana/alloy#504. I will close this in favor of that one. We have not committed to implementing this in any particular milestone, but any contribution would be welcome.

@github-project-automation github-project-automation bot moved this from Todo to Done in Grafana Agent (Public) Sep 5, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. proposal Proposal or RFC
Projects
No open projects
Development

No branches or pull requests

3 participants