Proposal: `mimir.alertmanagerconfigs.kubernetes` component to handle prometheus operator's `AlertmanagerConfig` CRD. #4216

captncraig · 2023-06-20T21:46:02Z

Background

Prometheus Operator introduced the AlertmanagerConfig CRD to manage alert manager routes and receivers.

We should investigate the posibility of implementing this as a flow component similar to mimir.rules.kubernetes

Proposal

It is not clear to me immediately if this is a simple task. The goal would be for a user to be able to sync alert manager configs to an local mimir instance or to Grafana Cloud.

What does prometheus operator do with these CRDs?

Prometheus operator deploys an instance of Alertmanager with a fully-manifested config. All relevant AlertManagerConfigs and config secrets are merged together at reconcile time to generate a config. code.

What APIs could we use?

Plain AlertManager does not have any apis for managing routes and things. It only has a reload endpoint. Mimir does have some potentially useful apis.

For mimir.rules.kubernetes, we are able to use the standard mimir apis for managing rules. Those apis have granularity to add, update, or delete individual rules, and mimir.rules.kubernetes uses them to allow you to mix and match manually-added and crd-defined rules.

Mimir's alertmanager apis though, only have "get" and "update" endpoints. That makes it much harder to manage routes dynamically. Either:

CRDs are the only mechanism to configure alert manager configs. Any other changes will get overwritten.
This component would need to do a complicated compare and merge operation on the existing configs. I'm not clear how clean it is possible to make this, and suspect things like deleted CRDs could leave remnants behind if we don't have a good cleanup mechanism.

I really don't love either of those options. With rules, we have unique IDs and granular apis. With alertmanager configs, it feels much harder to replicate the same paradigm Prometheus Operator provides.

The text was updated successfully, but these errors were encountered:

captncraig · 2023-06-21T17:45:31Z

In summary, Prometheus Operator is able to support AlertManagerConfigs only because it is also deploying AlertManager, and has exclusive control over the configs. Agent is not going to be in charge of deploying workloads, and so is not going to have exclusive control of any alertmanager configs from external Alert Managers.

I would vote that this CRD is not as good a match as the other Prometheus Operator CRDs we handle (ServiceMonitor, PodMonitor, Probe, and PrometheusRule), and we should not take this on.

If anyone more familiar with the alertmanager/mimir codebases or apis has better context or other reasons this would work, I would love to hear them.

captncraig · 2023-06-21T17:59:16Z

Although, if we want to assume this component has full control over an AlertManager instance, and are ok overwriting everything else, we could make a component that:

Finds all AlertManagerConfigs matching some set of selectors.
Merge them together into a single yaml blob
Post that to the mimir api.

That is a possibility, but it strikes me as a bit dangerous, and maybe not how we want to promote people configuring cloud alerting.

davidspek · 2023-07-10T14:05:17Z

I've actually made an attempt at implementing this but I haven't had the time to finish it. For anybody wanting to take this on feel free to have a look at https://github.com/pluralsh/grafana-agent/tree/alertmanager-configs and take any code that might be useful.

davidspek · 2023-07-10T14:06:45Z

FYI, this issue is somewhat of a duplicate of grafana/alloy#504.

davidspek · 2023-09-05T18:10:03Z

@captncraig @rfratto Is this already being worked on or is there a timeline for it?

captncraig · 2023-09-05T18:12:29Z

Yes, this is a duplicate of grafana/alloy#504. I will close this in favor of that one. We have not committed to implementing this in any particular milestone, but any contribution would be welcome.

captncraig added the proposal Proposal or RFC label Jun 20, 2023

grafanabot added this to Grafana Agent (Public) Jun 20, 2023

github-project-automation bot moved this to Todo in Grafana Agent (Public) Jun 20, 2023

rfratto added the type/infrastructure label Jun 21, 2023

captncraig closed this as completed Sep 5, 2023

github-project-automation bot moved this from Todo to Done in Grafana Agent (Public) Sep 5, 2023

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: `mimir.alertmanagerconfigs.kubernetes` component to handle prometheus operator's `AlertmanagerConfig` CRD. #4216

Proposal: `mimir.alertmanagerconfigs.kubernetes` component to handle prometheus operator's `AlertmanagerConfig` CRD. #4216

captncraig commented Jun 20, 2023

captncraig commented Jun 21, 2023

captncraig commented Jun 21, 2023

davidspek commented Jul 10, 2023

davidspek commented Jul 10, 2023

davidspek commented Sep 5, 2023

captncraig commented Sep 5, 2023

Proposal: mimir.alertmanagerconfigs.kubernetes component to handle prometheus operator's AlertmanagerConfig CRD. #4216

Proposal: mimir.alertmanagerconfigs.kubernetes component to handle prometheus operator's AlertmanagerConfig CRD. #4216

Comments

captncraig commented Jun 20, 2023

Background

Proposal

What does prometheus operator do with these CRDs?

What APIs could we use?

captncraig commented Jun 21, 2023

captncraig commented Jun 21, 2023

davidspek commented Jul 10, 2023

davidspek commented Jul 10, 2023

davidspek commented Sep 5, 2023

captncraig commented Sep 5, 2023

Proposal: `mimir.alertmanagerconfigs.kubernetes` component to handle prometheus operator's `AlertmanagerConfig` CRD. #4216

Proposal: `mimir.alertmanagerconfigs.kubernetes` component to handle prometheus operator's `AlertmanagerConfig` CRD. #4216