|
| 1 | +## **Motivation** |
| 2 | +Apache Airflow is an open-source workflow management platform primarily used for scheduling and monitoring workflows, It can be used to handle complex data pipelines and has been widely applied in the fields of data engineering and data science. Airflow allows users to write workflows, which are called DAGs (Directed Acyclic Graphs). Each DAG contains a series of tasks that can be executed in a specific sequence and dependency relationship, Due to its support for multitasking in complex scenarios, monitoring the health and operational status of Airflow is crucial, Through these metrics, it is possible to help analyze task health status, formulate optimization plans, and design risk prevention strategies. |
| 3 | + |
| 4 | + |
| 5 | +## **Architecture Graph** |
| 6 | + |
| 7 | +There is no significant architecture-level change. |
| 8 | + |
| 9 | +## **Proposed Changes** |
| 10 | +```mermaid |
| 11 | +graph LR; |
| 12 | + AirflowOTEL("Airflow OTEL") --> OpenTelemetryCollector("OpenTelemetry Collector") --> SkyWalkingOTELReceiver("SkyWalking OTEL Receiver") --> SkyWalkingMALEngine("SkyWalking MAL Engine") |
| 13 | + --> SkyWalkingUI("SkyWalking UI") |
| 14 | +``` |
| 15 | +1. Airflow sending metrics to OpenTelemetry Collector,OpenTelemetry Collector pushes metrics to SkyWalking OTEL |
| 16 | + Receiver via OpenTelemetry exporter. |
| 17 | +2. The SkyWalking OAP Server parses the expression with MAL to filter/calculate/aggregate and store the results. |
| 18 | +3. These metrics can be displayed via the SkyWalking UI, and the metrics can be customized for display on the UI dashboard. |
| 19 | + |
| 20 | +#### Airflow Service Supported Metrics |
| 21 | +| Monitoring Panel | Unit | Metric Name | Description | Data Source | |
| 22 | +|-----|------|-----|-----|-----| |
| 23 | +| Airflow Job Started | count | <job_name>_start | Number of started job | OpenTelemetry from Airflow | |
| 24 | +| Tasks Executable | count | scheduler.tasks.executable | Number of tasks that are ready for execution (set to queued) with respect to pool limits, DAG concurrency, executor state, and priority. | OpenTelemetry from Airflow | |
| 25 | +| Tasks Cleared | count | scheduler.orphaned_tasks.cleared | Number of Orphaned tasks cleared by the Scheduler | OpenTelemetry from Airflow | |
| 26 | +| Tasks Adopted | count | scheduler.orphaned_tasks.adopted | Number of Orphaned tasks adopted by the Scheduler | OpenTelemetry from Airflow | |
| 27 | +| Queued Tasks | count | executor.queued_tasks | Number of queued tasks on executor | OpenTelemetry from Airflow | |
| 28 | +| Pool Open Slots | count | executor.open_slots | Number of open slots on executor | OpenTelemetry from Airflow | |
| 29 | +| Pool Queued Slots | count | pool.queued_slots | Number of queued slots in the pool. Metric with pool_name tagging. | OpenTelemetry from Airflow | |
| 30 | +| Deferred Slots | count | pool.deferred_slots | Number of deferred slots in the pool. Metric with pool_name tagging. | OpenTelemetry from Airflow | |
| 31 | +| Scheduler Heartbeat | rate | scheduler_heartbeat | Scheduler heartbeats | OpenTelemetry from Airflow | |
| 32 | +| DAG File Queue Size | count | dag_processing.file_path_queue_size | Number of DAG files to be considered for the next scan | OpenTelemetry from Airflow | |
| 33 | +| Dataset Updates | count | dataset.updates | Number of updated datasets | OpenTelemetry from Airflow | |
| 34 | + |
| 35 | +### Airflow Instance Supported Metrics |
| 36 | + |
| 37 | +| Monitoring Panel | Unit | Metric Name | Description | Data Source | |
| 38 | +|---------------------------|--------------|-----|-----|-----| |
| 39 | +| Airflow Job Started | count | <job_name>_start | Number of started job | OpenTelemetry from Airflow | |
| 40 | +| Pool Open Slots | count | pool.open_slots|Number of open slots in the pool. Metric with pool_name tagging| OpenTelemetry from Airflow | |
| 41 | +| Pool Deferred Slots | count | pool.deferred_slots| Number of deferred slots in the pool. Metric with pool_name tagging| OpenTelemetry from Airflow | |
| 42 | +| Pool Running Slots | count | pool.running_slots| Number of running slots in the pool. Metric with pool_name tagging.| OpenTelemetry from Airflow | |
| 43 | +| Triggerer Heartbeat | rate | triggerer_heartbeat | Number of open slots on executor | OpenTelemetry from Airflow | |
| 44 | +| Triggers Main Thread | count | triggers.blocked_main_thread| Number of triggers that blocked the main thread (likely due to not being fully asynchronous)| OpenTelemetry from Airflow | |
| 45 | +| Triggers Succeeded | count | pool.deferred_slots | Number of deferred slots in the pool. Metric with pool_name tagging. | OpenTelemetry from Airflow | |
| 46 | +| triggers Failed | count|triggers.failed| Number of triggers that errored before they could fire an event| OpenTelemetry from Airflow | |
| 47 | +| Tasks Executable | count | scheduler.tasks.executable | Number of tasks that are ready for execution (set to queued) with respect to pool limits, DAG concurrency, executor state, and priority. | OpenTelemetry from Airflow | |
| 48 | +| Tasks Cleared | count | scheduler.orphaned_tasks.cleared | Number of Orphaned tasks cleared by the Scheduler | OpenTelemetry from Airflow | |
| 49 | +| Tasks Adopted | count | scheduler.orphaned_tasks.adopted | Number of Orphaned tasks adopted by the Scheduler | OpenTelemetry from Airflow | |
| 50 | +| Queued Tasks | count | executor.queued_tasks | Number of queued tasks on executor | OpenTelemetry from Airflow | |
| 51 | +| Dataset Updates | count | dataset.updates | Number of updated datasets | OpenTelemetry from Airflow | |
| 52 | +| Dataset Orphaned | count | dataset.orphaned| Number of datasets marked as orphans because they are no longer referenced in DAG schedule parameters or task outlets| OpenTelemetry from Airflow | |
| 53 | +| Dataset Triggered Dagruns | count | dataset.triggered_dagruns | Number of DAG runs triggered by a dataset update| OpenTelemetry from Airflow | |
| 54 | + |
| 55 | +If the metrics exists for both service and its instances , then the number displayed on the service dashboard is the sum of all instances |
| 56 | +## **Imported Dependencies libs and their licenses.** |
| 57 | + |
| 58 | +No new dependency. |
| 59 | + |
| 60 | +## **Compatibility** |
| 61 | + |
| 62 | +no breaking changes. |
| 63 | + |
| 64 | +## **General usage docs** |
0 commit comments