Replies: 2 comments 3 replies
-
Three things that come to my mind:
|
Beta Was this translation helpful? Give feedback.
-
Hi potiuk, thank you for the suggestions 👍
We already use Celery Kubernetes Executor. Basically we started with the Celery Executor and now switched to the Celery Kubernetes Executor. What we do within the DAGs is running "blackbox" docker images doing computation on big chunks of data. The Input data can vary in size from really small chunks of data to large TB files. Same for the computation dockers, some run fast, some take a long computation time and a lot of resources. Since we can not influence the input data size and also can not influence the computation dockers we can not really estimate the runtime. The only thing we know about the computation dockers is how many resources each computation docker requires in a worst case scenario. What we did in the past (and worked for us with a few downsights):
As an example:
Knowing the available resources of the Celery Node we limited the parallelism within the pool to the available resources. This way we know that airflow wont queue tasks above our resource limits and causing individual tasks to get OOM killed. But this approach had a few downsights:
What airflow seems to be missing here is a way to schedule based on resources instead of more abstract pools and slots. What we did to resolve the downsights:
This way we can just scale the resources within the Kubernetes cluster and Kubernetes will manage the resource demands and always trys to max out all available resources. During testing all seems to be working just fine this way except for the issue that airflow removes the (expectedly) waiting tasks from the queue after a while. What kind of issues would a int max value for "task_queued_timeout" cause ? Does it really mean tasks stuck in the queue for known reasons (e.g. Kubernetes api error) are not removed before the task_queued_timeout is reached ? I think the main issue is not the removal of tasks from the queue, this would still be fine if the task is just "put back to backlog" and latern on scheduled again. It is more the "Failure" state of the task which seems to be causing the issues on our side. We first thought "put back to backlog" would be equivalent to "retry" but this also seems to be not the same because there is no way to differentiate between "retry tasks stuck in queue" and "retry failed tasks". Retry because of "task stuck in queue" in combination with "retry backoff" would be great and maybe solve our issue. |
Beta Was this translation helpful? Give feedback.
-
Hi,
we use Kubernetes to run long running tasks in airflow.
The current behavior is that tasks get queued as soon as the DAG preconditions are met.
Since we have tasks with a varying duration between a few minutes and in worst case (depending on the task and data) several days and only limited resources, tasks are often failing as "stuck in queued as failed" because Kubernetes resources are exhausted for a long period of time while the queue gets filled up.
For us it would be no issue if the tasks are just waiting in the queue until they can be picked up by Kubernetes and they should not fail just because there are no resources for a specific period of time.
Is there a way to handle such a scenario without causing side effects ?
Stuck in queued because there are currently no resources left ---> good case, task should not fail and just be picked up when resources become available again
Stuck in queued because of non resources issues (e.g. kubernetes crashed, api not reachable, ....) the task should still fail
It seems there is the "task_queued_timeout" parameter which is causing the failed tasks. We could now increase the timeout to a really big number to prevent the tasks from failing but we are not sure if we then prevent airflow or the scheduler from detecting tasks which are really stuck in the queue.
Any recommendation how we could work around this issue ?
Beta Was this translation helpful? Give feedback.
All reactions