Skip to content

Commit 0b031cb

Browse files
authored
Merge pull request #101 from Kuadrant/gateway-api-metrics-exporter
RFC for gateway api metrics exporter
2 parents 6c9da7c + f58087d commit 0b031cb

File tree

1 file changed

+197
-0
lines changed

1 file changed

+197
-0
lines changed
+197
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Gateway API Metrics Exporter
2+
3+
- Feature Name: gateway-api-metrics-exporter
4+
- Start Date: 2024-07-26
5+
- RFC PR: [Kuadrant/architecture#0000](https://github.com/Kuadrant/architecture/pull/0000)
6+
- Issue tracking: [kuadrant-operator/issues/675](https://github.com/Kuadrant/kuadrant-operator/issues/675)
7+
8+
# Summary
9+
[summary]: #summary
10+
11+
A new sub-project for a [prometheus exporter](https://prometheus.io/docs/instrumenting/writing_exporters/) that exports metrics about the state of [Gateway API resources](https://gateway-api.sigs.k8s.io/api-types/gateway/) in a Kubernetes cluster.
12+
13+
14+
# Motivation
15+
[motivation]: #motivation
16+
17+
Allow additional stateful information about Gateway API resources to be made available via metrics.
18+
Currently a set of metrics are made available via the [gateway-api-state-metrics](https://github.com/Kuadrant/gateway-api-state-metrics/blob/main/METRICS.md) project.
19+
However, there are [limitations](https://github.com/Kuadrant/gateway-api-state-metrics/issues/1) with what resource information can be exposed using the underlying [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) project.
20+
Additional stateful information would include:
21+
22+
- Individual listener status within a Gateway
23+
- HTTPRoute status
24+
25+
For example, the individual status listener information from a Gateway:
26+
27+
```yaml
28+
status:
29+
listeners:
30+
- attachedRoutes: 1
31+
conditions:
32+
- lastTransitionTime: "2023-08-15T13:22:06Z"
33+
message: No errors found
34+
observedGeneration: 1
35+
reason: Ready
36+
status: "True"
37+
type: Ready
38+
- lastTransitionTime: "2023-08-15T13:22:06Z"
39+
message: No errors found
40+
observedGeneration: 1
41+
reason: ResolvedRefs
42+
status: "True"
43+
type: ResolvedRefs
44+
name: api
45+
```
46+
47+
and HTTPRoute parents status conditions:
48+
49+
```yaml
50+
status:
51+
parents:
52+
- conditions:
53+
- lastTransitionTime: "2024-05-16T16:17:38Z"
54+
message: Object affected by AuthPolicy default/toystore
55+
observedGeneration: 1
56+
reason: Accepted
57+
status: "True"
58+
type: kuadrant.io/AuthPolicyAffected
59+
- lastTransitionTime: "2024-05-16T16:18:51Z"
60+
message: Object affected by RateLimitPolicy default/toystore
61+
observedGeneration: 1
62+
reason: Accepted
63+
status: "True"
64+
type: kuadrant.io/RateLimitPolicyAffected
65+
controllerName: kuadrant.io/policy-controller
66+
parentRef:
67+
group: gateway.networking.k8s.io
68+
kind: Gateway
69+
name: api-gateway
70+
namespace: kuadrant-system
71+
- conditions:
72+
- lastTransitionTime: "2024-05-16T16:17:38Z"
73+
message: Object affected by AuthPolicy default/toystore
74+
observedGeneration: 1
75+
reason: Accepted
76+
status: "True"
77+
type: kuadrant.io/AuthPolicyAffected
78+
- lastTransitionTime: "2024-05-16T16:18:51Z"
79+
message: Object affected by RateLimitPolicy default/toystore
80+
observedGeneration: 1
81+
reason: Accepted
82+
status: "True"
83+
type: kuadrant.io/RateLimitPolicyAffected
84+
- lastTransitionTime: "2024-05-20T11:45:33Z"
85+
message: Route was valid
86+
observedGeneration: 1
87+
reason: Accepted
88+
status: "True"
89+
type: Accepted
90+
- lastTransitionTime: "2024-05-20T11:45:33Z"
91+
message: All references resolved
92+
observedGeneration: 1
93+
reason: ResolvedRefs
94+
status: "True"
95+
type: ResolvedRefs
96+
controllerName: istio.io/gateway-controller
97+
parentRef:
98+
group: gateway.networking.k8s.io
99+
kind: Gateway
100+
name: api-gateway
101+
namespace: kuadrant-system
102+
```
103+
104+
# Guide-level explanation
105+
[guide-level-explanation]: #guide-level-explanation
106+
107+
First, implement the existing Gateway API metrics that are part of the [gateway-api-state-metrics](https://github.com/Kuadrant/gateway-api-state-metrics/blob/main/METRICS.md) project.
108+
This will allow the new prometheus exporter to replace that project as a whole.
109+
The metrics will be backwards compatible with the gateway-api-state-metrics project, with the addition of new metrics.
110+
111+
Second, implement new metrics, as per the examples below, to capture the additional status information:
112+
113+
```promql
114+
gatewayapi_gateway_status_listeners_conditions{namespace="<NAMESPACE>",name="<GATEWAY>",listener_name="<LISTENER_NAME>",type="<ResolvedRefs|Ready|Other>} 1
115+
```
116+
This metric captures the status condition types of individual listeners in a Gateway.
117+
It will allow the status of individual named listeners to be queried, graphed and alerted on via metrics.
118+
The health status of listeners can then be visualised in a stat panel in Grafana, showing healthy and unhealthy listeners.
119+
This expands the debugging path beyond just the overall health of a Gateway.
120+
121+
```promql
122+
gatewayapi_httproute_status_parents_conditions{namespace="<NAMESPACE>",name="<GATEWAY>",controller_name="<CONTROLLER_NAME>","parent_group="<PARENT_GROUP>",parent_kind="<PARENT_KIND>",parent_name="<PARENT_NAME>",parent_namespace="<PARENT_NAMESPACE>",type="<ResolvedRefs|Accepted|Other>} 1
123+
```
124+
This metric captures the condition types for each parent of a HTTPRoute.
125+
The `type` field would also record any custom types set by a controller. For example, `kuadrant.io/AuthPolicyAffected` and `kuadrant.io/RateLimitPolicyAffected`.
126+
This will allow the health of HTTPRoutes to be reported via metrics. A HTTPRoute that has an `type` of Accepted and value of 1 means the HTTPRoute is accepted by the Gateway and can be considered healthy.
127+
It will also allow policy specific information about a HTTPRoute to be represented in metrics.
128+
For example, alerting on any HTTPRoutes that don't have the `kuadrant.io/AuthPolicyAffected` type with a value of 1 i.e. HTTPRoutes without an AuthPolicy.
129+
130+
Tests will be added directly to the project in a similar manner to the [redis-exporter](https://github.com/oliver006/redis_exporter/blob/master/exporter/http_test.go).
131+
The test environment will bring up a kind cluster, create the Gateway API CRDs, example Gateway & HTTPRoute resources, then test the scrape endpoint.
132+
This will be the same as how [metrics are tested](https://github.com/Kuadrant/gateway-api-state-metrics/blob/main/tests/e2e.sh) for the gateway-api-state-metrics project.
133+
There is a [separate test function](https://github.com/Kuadrant/gateway-api-state-metrics/blob/main/tests/e2e/main_test.go) for each resource.
134+
135+
Existing [example dashboards](https://github.com/Kuadrant/gateway-api-state-metrics/tree/main/src/dashboards) in the gateway-api-state-metrics project will be copied over to the exporter project and continue to work as before.
136+
However, initially it will just be the Gateway, GatewayClass and HTTPRoute dashboards as those will be the metrics that are implemented first.
137+
138+
# Reference-level explanation
139+
[reference-level-explanation]: #reference-level-explanation
140+
141+
The exporter will be written in golang and follow the guidelines from https://prometheus.io/docs/instrumenting/writing_exporters/.
142+
Other exporters like the https://github.com/prometheus/node_exporter/tree/master and https://github.com/oliver006/redis_exporter will be referenced for patterns and library usage.
143+
Metrics will only be pulled from the kubernetes API when Prometheus scrapes them.
144+
That is, the exporter will not perform scrapes based on its own timers.
145+
All scrapes will be synchronous.
146+
147+
The [client-go](https://github.com/kubernetes/client-go) library will be used for all kubernetes API calls.
148+
As the number of Gateways and HTTPRoutes could vary greatly, there is a performance consideration with these API calls if there are a lot of resources.
149+
To allow for this, a single list of all resources of a kind will used rather than 1 by 1.
150+
If in future there are issues with performance, there is an option to cache responses to expensive queries.
151+
152+
A 'gateway_metrics_up` metric will be included, as per https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes
153+
such that the exporter can continue to respond in a standard way if there are issues with some aspects of scraping.
154+
The scrape response should include all metrics that have information available at that time of scraping.
155+
156+
# Drawbacks
157+
[drawbacks]: #drawbacks
158+
159+
This is an additional library to maintain.
160+
However, it will supersede the gateway-api-state-metrics project and dependency on kube-state-metrics.
161+
162+
# Rationale and alternatives
163+
[rationale-and-alternatives]: #rationale-and-alternatives
164+
165+
In theory it should be possible to get the desired functionality from the kube-state-metrics project if the proposed change in https://github.com/kubernetes/kube-state-metrics/pull/2059 is accepted and subsequently implemented.
166+
However, that proposal has been open since May 2023.
167+
I have ruled out the possibility of helping with the implementation of this change in that project due to:
168+
169+
- lack of detailed knowledge of the current implementation of selectors in kube-state-metrics
170+
- potential complexity of implementing generic CEL support in that project
171+
- it not being core to the Kuadrant project goals, combined with the ongoing maintenance commitment after implementation. It wouldn't be fair to land the change we need without following up on maintenance after
172+
173+
The proposed design is a more focused solution on the needs of the Kuadrant project from Gateway API resources in the form of metrics.
174+
There are plenty of examples of exporters out there that we can reference and follow established patterns.
175+
176+
If we don't make this change, we are limited to having just the overall Gateway status available via metrics,
177+
and no HTTPRoute status information on which we can visualise and alert.
178+
179+
# Prior art
180+
[prior-art]: #prior-art
181+
182+
The primary prior art are the kube-state-metrics project, and the gateway-api-state-metrics projects.
183+
The gateway-api-state-metrics project uses the [CustomResourceStateMetrics](https://github.com/Kuadrant/gateway-api-state-metrics/blob/main/config/default/custom-resource-state.yaml) configuration feature of kube-state-metrics to configure what fields in which resources should be made available via metrics.
184+
185+
# Unresolved questions
186+
[unresolved-questions]: #unresolved-questions
187+
188+
n/a
189+
190+
# Future possibilities
191+
[future-possibilities]: #future-possibilities
192+
193+
Although this exporter is intended to replace the gateway-api-state-metrics project,
194+
it will likely take a phased approach to get to that point.
195+
The initial goal is to get 'like for like' functionality from a Kuadrant project point of view (Gateways, GatewayClasses and HTTPRoutes),
196+
followed by the new status functionality as detailed in this RFC.
197+
Other resources, such as TLSRoute, UDPRoute etc.. can be added later.

0 commit comments

Comments
 (0)