-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Description
Apache Airflow version
3.1.0
If "Other Airflow 2/3 version" selected, which one?
No response
What happened?
Our large scale setup includes
- ~1000 celery executor workers,
- 15 api servers - 64 worker processes each (with enough resources - having checked utilization)
Also, maybe relevantly,
- 6 scheduler replicas
- 2 dag processors
- a pgbouncer with a large enough
airflow
connection pool size (doesn't reach maximum) - dags with up to 8k tasks (in parallel) and a final task that depends on all of them.
usually dags are smaller than that, average ~5k tasks
When all workers are active and working on task instances - they all get the following warning 4 times
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the %d time calling it. [airflow.sdk.api.client]
and, the 5th time - they get this error:
[error] Task execute_workload[$celery_task_uuid] raise unexpected: ReadTimeout('timed out') [celery.app.trace]
We investigated this error a little and found that the error comes from the httpx default timeout
from httpx docs ('https://www.python-httpx.org/advanced/timeouts/'):
HTTPX is careful to enforce timeouts everywhere by default.
The default behavior is to raise a TimeoutException after 5 seconds of network inactivity.
What you think should happen instead?
Airflow should allow users to configure the timeout via airflow.cfg
to accommodate users with high-load systems.
For example:
[api]
HTTPX_TIMEOUT = # 5 by default
Also - maybe add a section to the docs detailing best practices when working with very high loads to make the api server reliable.
How to reproduce
(1) Run airflow in a kubernetes cluster with:
~ 1k celery workers
~ 15 api server replicas (64 worker processes. resource limits: 25Gi RAM. 8 CPU cores)
(2) Have large dags so that all 1k workers do tasks in parallel (each task should take more than 5 mins)
(3) Observe workers for errors (ReadTimeout)
Operating System
Debian GNU/Linux 12 (bookworm)
Versions of Apache Airflow Providers
apache-airflow-providers-celery==3.12.2
apache-airflow-providers-common-compat==1.7.3
apache-airflow-providers-common-io==1.6.2
apache-airflow-providers-common-sql=1.27.5
apache-airflow-providers-standard==1.6.0
apache-airflow-providers-postgres==6.2.3
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
Problem occurs everytime that all workers are executing a task instance (the highest load)
logs:
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 1st time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 2nd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 3rd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 4th time calling it. [airflow.sdk.api.client]
[error] Task execute_workload[a7469ad-3481-4fd4-b8f236b37cf1] raise unexpected: ReadTimeout('timed out') [celery.app.trace]
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct