How to handle long running tasks with the Kubernetes Operator ? #45503

ywmvis · 2025-01-09T09:27:56Z

ywmvis
Jan 9, 2025

Hi,

we use Kubernetes to run long running tasks in airflow.
The current behavior is that tasks get queued as soon as the DAG preconditions are met.

Since we have tasks with a varying duration between a few minutes and in worst case (depending on the task and data) several days and only limited resources, tasks are often failing as "stuck in queued as failed" because Kubernetes resources are exhausted for a long period of time while the queue gets filled up.

For us it would be no issue if the tasks are just waiting in the queue until they can be picked up by Kubernetes and they should not fail just because there are no resources for a specific period of time.

Is there a way to handle such a scenario without causing side effects ?

Stuck in queued because there are currently no resources left ---> good case, task should not fail and just be picked up when resources become available again

Stuck in queued because of non resources issues (e.g. kubernetes crashed, api not reachable, ....) the task should still fail

It seems there is the "task_queued_timeout" parameter which is causing the failed tasks. We could now increase the timeout to a really big number to prevent the tasks from failing but we are not sure if we then prevent airflow or the scheduler from detecting tasks which are really stuck in the queue.

Any recommendation how we could work around this issue ?

potiuk · 2025-01-09T10:30:25Z

potiuk
Jan 9, 2025
Collaborator

Three things that come to my mind:

Use Hybrid Celery Kubernetes Executor and make all the small and fast tasks run through Celery - that wil limit the overhead incurred by many PODs being created to just run a small and fast thing (will decrease the pressure on K8S)
You can consider using Yuvicorn or Kueue for better management of queued PODs and resources with priorities and such - but this requires more understanding of your particular tasks and resource needs they have
Limit parallelism of certain tasks in Airlfow - Airlfow has a number of ways to limit parallelism - for example by using Pools, Queues, various dags, tasks and configuration parameters - for example here: https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-to-improve-dag-performance - this will prevent airflow scheduler to even schedule tasks for execution if there are other - related tasks already scheduled and exceed the parallelism settings

0 replies

ywmvis · 2025-01-09T17:58:00Z

ywmvis
Jan 9, 2025
Author

Hi potiuk, thank you for the suggestions 👍

Use Hybrid Celery Kubernetes Executor and make all the small and fast tasks run through Celery - that wil limit the overhead incurred by many PODs being created to just run a small and fast thing (will decrease the pressure on K8S)

Limit parallelism of certain tasks in Airlfow - Airlfow has a number of ways to limit parallelism - for example by using Pools, Queues, various dags, tasks and configuration parameters - for example here: https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-to-improve-dag-performance - this will prevent airflow scheduler to even schedule tasks for execution if there are other - related tasks already scheduled and exceed the parallelism settings

We already use Celery Kubernetes Executor. Basically we started with the Celery Executor and now switched to the Celery Kubernetes Executor.

What we do within the DAGs is running "blackbox" docker images doing computation on big chunks of data.
Within the DAG we define a workflow like "[Input Data] -> Computation docker A -> ( Computation docker B | Computation docker C)"

The Input data can vary in size from really small chunks of data to large TB files. Same for the computation dockers, some run fast, some take a long computation time and a lot of resources.

Since we can not influence the input data size and also can not influence the computation dockers we can not really estimate the runtime.

The only thing we know about the computation dockers is how many resources each computation docker requires in a worst case scenario.

What we did in the past (and worked for us with a few downsights):

We created pools "ComputationSmall, ComputationMedium, ComputationLarge" and assigned the computation dockers to the matching pool.

As an example:

ComputationSmall is designed to run computation dockers with 2 CPU cores per docker and max 1GB Ram
ComputationMedium is designed to run computation dockers with 4 CPU cores per docker and max 16GB Ram
ComputationLarge is designed to run computation dockers with 4 CPU cores per docker and max 32GB Ram

Knowing the available resources of the Celery Node we limited the parallelism within the pool to the available resources.
-> ComputationSmall Celery Node with 32CPU cores and 16GB Ram --> 16 Slots

This way we know that airflow wont queue tasks above our resource limits and causing individual tasks to get OOM killed.

But this approach had a few downsights:

If we have only "ComputationLarge" tasks for a few days only the ComputationLarge node is used while ComputationSmall and ComputationMedium nodes are just in idle state wasting "blocked" resources
Extending the computation resources cant just be done on the Celery node side it always requires to keep the airflow parallelism in sync with the available resources
If a previous tasks fails during computation and is not cleaned up properly (e.g. docker is not shut down properly and still running) parallelism / slots are again out of sync with the available resources causing tasks to fail because they stuck in queue, or OOM kill because of unavailable resources.

What airflow seems to be missing here is a way to schedule based on resources instead of more abstract pools and slots.
We first thought we could just have one pool and match the amount of slots to the available resources e.g. 1 slot = 100mb of available ram. A task which requests 1G RAM takes 10 Slots. But this way we can only schedule based on one resource (in this example RAM) not based on multiple resources (RAM and CPU).

What we did to resolve the downsights:

Switched from Celery Executor to Celery Kubernetes Executor
All small "management" tasks just run on the celery nodes (to eliminate pod overhread), all computation tasks run on Kubernetes
Each computation task is assigned with a Kubernetes resource request for CPU und RAM demands
Kubernetes only picks tasks from the queue if the resources are available

This way we can just scale the resources within the Kubernetes cluster and Kubernetes will manage the resource demands and always trys to max out all available resources.

During testing all seems to be working just fine this way except for the issue that airflow removes the (expectedly) waiting tasks from the queue after a while.

What kind of issues would a int max value for "task_queued_timeout" cause ? Does it really mean tasks stuck in the queue for known reasons (e.g. Kubernetes api error) are not removed before the task_queued_timeout is reached ?

I think the main issue is not the removal of tasks from the queue, this would still be fine if the task is just "put back to backlog" and latern on scheduled again. It is more the "Failure" state of the task which seems to be causing the issues on our side.

We first thought "put back to backlog" would be equivalent to "retry" but this also seems to be not the same because there is no way to differentiate between "retry tasks stuck in queue" and "retry failed tasks".

Retry because of "task stuck in queue" in combination with "retry backoff" would be great and maybe solve our issue.
But for tasks failing during execution (computation failed) retry does not make sense in our case. Running the exact same computation on the exact same data would just cause the same failure again. But there seems to be no mechanism to retry only "stuck in queued" tasks but not "failed during execution" tasks.

3 replies

potiuk Jan 9, 2025
Collaborator

What airflow seems to be missing here is a way to schedule based on resources instead of more abstract pools and slots.

This is where Queue and Yuvicorn can help. Look them up - you can deploy them in K8s to manage tasks submitted by Airflow.

ywmvis Jan 10, 2025
Author

Thanks again @potiuk

We had a read through the documentation and Kueue seems to be exactly made for our resource management demands.

Currently we use the CeleryKubernetes Executor, tasks are created as Pods via the DockerOperator and "kubernetes" queue.

According to the Kueue documentation it should be possible to run pods through Kueue by assigning a label to the pod.
https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/

We try to find out if this approach will work with Airflow and the CeleryKubernetes Executor. If not i think we have to try the "KubernetesStartKueueJobOperator" and try to work around not using the DockerOperator as task entry point.

Just for a better understanding, how would the flow in airflow look like if Kueue is setup and working properly (mainly on airflow side) ?
Seems for the KubernetesStartKueueJobOperator there is not a lot of documentation or examples available.

Lets assume airflow could schedule 10 tasks and kubernetes kueue has only resources for 5 tasks available. What would happen to the remaining 5 tasks on airflow side ? Are they not put to the airflow queued state until resources on kubernetes side are available or would they be placed in the airflow queue but transition immediately to a non queued state ?

potiuk Jan 10, 2025
Collaborator

No idea, really. I know Kueue is somthing I know people are using , but I think we do not have a reay answer/prescription - but we would love if somoene like you can do all the investigation and testing and contribute back the documentation on how to use things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle long running tasks with the Kubernetes Operator ? #45503

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to handle long running tasks with the Kubernetes Operator ? #45503

ywmvis Jan 9, 2025

Replies: 2 comments · 3 replies

potiuk Jan 9, 2025 Collaborator

ywmvis Jan 9, 2025 Author

potiuk Jan 9, 2025 Collaborator

ywmvis Jan 10, 2025 Author

potiuk Jan 10, 2025 Collaborator

ywmvis
Jan 9, 2025

Replies: 2 comments 3 replies

potiuk
Jan 9, 2025
Collaborator

ywmvis
Jan 9, 2025
Author

potiuk Jan 9, 2025
Collaborator

ywmvis Jan 10, 2025
Author

potiuk Jan 10, 2025
Collaborator