Description
What is the problem you are trying to solve?
We are currently working to reduce the amount of active metric series of the observability platform at Maersk. As part of this effort, we need a way to retrieve all metric names and, ideally, rank them by series count.
Knowing the series count of each metric is important as it allows us to do things like:
- Focusing on the highest cardinality metrics first.
- Having a lower limit of cardinality below which we ignore the metric (i.e., don't take action on metrics with less than 5000 series).
- Building intelligence around how many series we're saving with various measures; answering questions like "how many series would we save by dropping this metric?" or "how many series have we saved in total with the measures implemented thus far?".
The GET,POST <prometheus-http-prefix>/api/v1/cardinality/label_values
endpoint is a prime candidate for our use-case. The issue is that the endpoint has a maximum limit of 500
. In other words, with this endpoint, we can only get the top 500 highest cardinality metrics of the platform.
This is an example of the request we're currently making:
curl --location --globoff 'localhost:8080/prometheus/api/v1/cardinality/label_values?label_names[]=__name__&count_method=active&limit=500' --header 'X-Scope-OrgID: fake'
Which solution do you envision (roughly)?
Initially, we suspected that the limit
parameter was in place to protect server-side resources. However, we inspected the code of the endpoint, and it is our understanding that the endpoint always fetches all label values, and that the only thing the limit
parameter does is control how many are returned (source).
We'd like for the limit
parameter to have no maximum value:
- Behaviour for all requests that don't specify the
limit
parameter is the same:limit
will default to20
. - Behaviour for all requests that specify a
limit
smaller than500
is the same. - Behaviour for all requests that specify a
limit
greater than500
changes: they would previously fail with'limit' param cannot be greater than '500'
but will succeed after the change. - Clients who would like to analyse a larger set of label values can set
limit
to a greater value like100000
.
Have you considered any alternatives?
We've considered two alternative solutions:
- The label values endpoint can return all metric names:
GET /prometheus/api/v1/label/__name__/values
. The issue with this endpoint is that it doesn't return series count. - The instant query endpoint is theoretically capable of returning all metric names and count of series for each name with this query:
count by (__name__) ({ __name__=~".+"})
. This query is prohibitively expensive to run, and fails almost immediately with athe query exceeded the maximum number of chunks
error.
Any additional context to share?
We might consider whether to make the same change for the limit
parameter of the GET,POST <prometheus-http-prefix>/api/v1/cardinality/label_names
endpoint.
My soft opinion might be that the two endpoints should have the same implementation of the limit
parameter, but I am not privy to the details of this endpoint.
How long do you think this would take to be developed?
Small (<= 1 month dev)
What are the documentation dependencies?
We'd need to update the documentation for the label cardinality endpoint: https://grafana.com/docs/mimir/v2.14.x/references/http-api/#label-values-cardinality
Proposer?
Lasse Hels