Skip to content

Metrics Improvement Project #42881

@dannyl1u

Description

@dannyl1u

Description

From @ferruzzi:

Currently when you add a new metric to the codebase, you must also manually update the docs page. The docs page inevitably gets out of date and misses some details. We want an automated system to generate the docs page based on the actual metrics. There are also known instances where the same metric is being created and emitted in more than one place, causing duplicate data. These will have to be fixed manually and an automated check might possibly (stretch goal?) include checking for same or ”too similar” names while collecting the names for the docs page.

Phase 1
Situation:
We support multiple different Metrics backends [0]. The two main ones are StatsD and OpenTelemetry. This is managed though an interface class [1] which is implemented for each backend (examples: StatsD[2] and OTel[3]). StatsD was the only supported version well into Airflow 2.x and the entire codebase was designed with StatsD in mind so it was a good chunk of work to abstract it out and there are a few remaining tasks to perfect the new implementation.
Task 1:
StatsD has a name length limit of around 300 characters. OTel limits names to 34 characters, but allows tagging. Our temporary solution was to emit almost everything twice, once in the long format for StatsD and again in the short format with tags for OTel. We also had to add code [4] to make sure the name is safe for OTel, and other hacks to make it work.
The first task in this project is to understand the difference in how the two implementations handle their names and them add a "get_name" method to the interface: def get_name(metric_name: str, tags: dict[str: str]). In the statsd_logger [2] implementation it will concatenate the tags onto the name and in the OTel implementation it will just return name.
Once that is implemented, it can be used in the various emit methods (incr, decr, etc) instead of all the name validation code, and search the code for places where we are emitting things more than once and clean it up.
Example:
You can see an example in local_task_job_runner [5]. We emit local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> for StatsD but that results in a name too long for OTel so we also emit local_task_job.task_exit, and the name validation method [4] in the OTel implementation catches the one that is too long and just swallows it. What we should do instead is pass incr() the name and the tags and let StatsD and OTel handle them accordingly.
[0] https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions
[1] https://github.com/apache/airflow/blob/main/airflow/metrics/base_stats_logger.py
[2] https://github.com/apache/airflow/blob/main/airflow/metrics/statsd_logger.py
[3] https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py
[4] https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py#L128
[5] https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job_runner.py#L352

Use case/motivation

From @ferruzzi:

Currently when you add a new metric to the codebase, you must also manually update the docs page. The docs page inevitably gets out of date and misses some details. We want an automated system to generate the docs page based on the actual metrics. There are also known instances where the same metric is being created and emitted in more than one place, causing duplicate data. These will have to be fixed manually and an automated check might possibly (stretch goal?) include checking for same or ”too similar” names while collecting the names for the docs page.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions