How to setup low-latency task-workers in a true improvement over celery/redis? #15992

realimpat · 2024-11-12T19:28:25Z

realimpat
Nov 12, 2024

after reading blog:
https://www.prefect.io/blog/background-tasks-why-they-matter-in-prefect

I've tried to build out a flow in prefect where we currently use redis + celery + flower + ecs. It would be great to structure our processing steps as a DAG and get more monitoring and observability.

That said, it seems like there's no way to have fine-grained control over which task workers get which tasks, or trace which have processed what in a way that is visible in the prefect UI dashboard. So if we see a stage of a flow fail, we will know which python functions failed, but we won't for example know which worker had the failure to be able to then trace the logs of that container to see if there was some infrastructure reason for the failure.

Adding a discussion topic at the request of @zzstoatzz . I'm new to starting discussions here so forgive me if this isn't structured well. I'll just paste the full conversation from slack here for more context:

whole convo

Claude Summary

I couldn't find a way to monitor which task workers handle which tasks or trace failures at the infrastructure level (unlike our current Celery+Flower setup)

I wanted to avoid polling scheduler latency for user-triggered events. Nate clarified we could bypass Prefect's "Deployment" system by calling flows directly from background tasks, maintaining observability while achieving near-real-time processing. We had previously ruled out serverless alternatives due to cold-start latency issues in our use case.

Whole thread

@Nate
thanks for your help, i was able to build a prototype of prefect in our app using the background tasks examples.
I'm now looking into
A) monitoring the task workers
B) organizing 'real' code into flows
7:59
For A), the prefect ui is a good dashboard to see the tasks. That said, to see the status of a worker it looks like there's just a simple endpoint right now that can return json, within each individual worker, and not an aggregate view. As defined here: https://github.com/PrefectHQ/prefect/blob/main/src/prefect/task_worker.py
The 'worker pool' concept doesn't seem to apply to the task workers. Am I reading this correctly?

GitHubGitHub
prefect/src/prefect/task_worker.py at main · PrefectHQ/prefect
Prefect is a workflow orchestration framework for building resilient data pipelines in Python. - PrefectHQ/prefect (52 kB)
https://github.com/PrefectHQ/prefect/blob/main/src/prefect/task_worker.py

Nate
Yesterday at 8:00 PM
yes that sounds right!

yourstake
Yesterday at 8:01 PM
so, if we wanted a dashboard to monitor the health of each individual worker, we would need to do the aggregation on our side to build a ui that queries the status of each running worker at that endpoint?
8:02
(or just look at the infrastructure container logs themselves).
I'm used to flower from celery, which was a screenshot in your blog post announcing the task workers

Nate
Yesterday at 8:03 PM
yeah we dont have a ton of observability for task workers themselves at this time unfortunately. you'd just see all the task runs in the UI that get picked up by them (edited)

yourstake
Yesterday at 8:03 PM
And for B), is it correct that right now the background tasks only works for individual tasks, and not Flows?
I ask because in our situation what we are looking for most is a way to organize complex pipeline processing steps in near-real time (triggered when a user clicks a button in a web ui), so I'm hoping to use the DAG concept of flows and tasks, but without relying on a polling scheduler that would introduce latency

yourstake
Yesterday at 8:26 PM
I realize i could have a flow call background tasks just as they would call any other task.
That said: can the flows be triggered with the event-based websocket system, or do the flows need to be triggered by a polling scheduler?
Or, does this distinction not matter due to the way the prefect internals work; is a 'flow' just an organization of tasks, such that, if all the tasks in a flow were event-based, starting the flow itself would also be event-based with the same latency concerns as its individual tasks?

Nate
Yesterday at 8:47 PM
let me give this question proper attention tomorrow, this is something i’ve been thinking a lot about recently but i’m tied up at an event rn.

yourstake
Today at 9:55 AM
thanks. Would be helpful to hear your thoughts
@Nate
on whether the background task workers can be used with flows, or otherwise how best to solve for a DAG within event-based triggers from users at low latency

yourstake
1 hour ago
should I assume that it is the case that the background tasks will only work for handling small tasks and not for a whole DAG-like flow, in the current production version of prefect 3 as of today?

Nate
1 hour ago
so what do you think about something like this
@flow
def lots_of_nested_tasks_and_flows():
pass

@task
def calls_a_bunch_of_things():
lots_of_nested_tasks_and_flows() # blocking, does all the side effects

calls_a_bunch_of_things.serve() # websockets, low latency
(edited)
1:01
should I assume that it is the case that the background tasks will only work for handling small tasks and not for a whole DAG-like flow, in the current production version of prefect 3 as of today?
background tasks can call whatever, including flows

yourstake
13 minutes ago
if a background task calls a flow, would the flow wait to start until the polling scheduler finds it? or would the flow start immediately when called?

Nate
13 minutes ago
if you just call a flow it runs like a normal python function, no scheduler involved
1:40
its only if you hit POST /create_flow_run_from_deployment one way or another are you scheduling a run that a polling process would have to catch

yourstake
12 minutes ago
okay that clarifies things. so if we avoid using the 'deployment' parts of the framework and just call flows from task workers we avoid the polling scheduler?

Nate
12 minutes ago
yes
1:42
yeah to be candid we have this problem in our info architecture where we suggest that Deployment is the "one true path" but there's a lot of use you can get by avoiding Deployment altogether and using flows for observability, you still get events and the runs in the UI and everything else, just not dynamic dispatch of infra (edited)
1:42
for example you can do this in a lambda
@flow
def handler(event, context): ...
(edited)

yourstake
7 minutes ago
ya i think in our use case we could refactor things to run serverless, but currently we have a autoscaling number of worker containers that are created by celery/redis/ecs cdk setup. so i'm for starters seeing if we could create a similar number of prefect task workers that execute the processing.
it looks like there's some automatic routing logic to handle dispatches to multiple workers in the task_worker source code. how would that interact with a really complex flow? is there for example any way to trace and monitor which workers are doing what within a flow? so if something fails we can look at the logs for that specific worker?
1:47
"yeah to be candid we have this problem in our info architecture where we suggest that Deployment is the "one true path" but there's a lot of use you can get by avoiding Deployment altogether and using flows for observability, you still get events and the runs in the UI and everything else, just not dynamic dispatch of infra (edited)"
Ya I think this is the root of my confusion

Nate
4 minutes ago
thats a really good question
can you open a discussion about this? right now internally we're working on emitting more context about the client side exec env so the UI can show it, but its deployment-focused AFAIK right now
but it would be ideal to do the same for task workers so that the flow runs you call from background would be traceable to the infra that they're running on

yourstake
3 minutes ago
the othe rproblem with serverless is latency; a lot of serverless thing shav ecold starts an dit can be complex to pre-warm them. the aws ecs guidance is pre-warming still leaves several seconds for warm starts
1:50
so we ruled out serverless in the past for that reason for our own use-case. i think technically there's arguments that can work around the cold-start problem but most o fthose solutions are more time consuming to setup and maintain than the' traditional' worker container architecture

abrookins · 2024-11-14T00:16:23Z

abrookins
Nov 14, 2024
Maintainer

👋 Hello again! I see a few improvements we could make here:

Make it easier/possible to see task workers, task worker logs (i.e., the infra logs, not task run logs, which are currently visible in the UI), and see which worker ran a task.

All of this makes perfect sense to me and fits with work we're doing right now to enhance worker logging (for flow workers first, and only in Cloud at the moment). Check out this preview of the changes we made this week to bring worker log visibility to Cloud:

Making it possible to see the same thing for task workers feels like it would address this need.

NOTE: You aren't using flow workers, so the logging improvements above shouldn't apply to you yet, but if you were trying out this feature just note that for this week, you'd need to run prefect config set PREFECT_EXPERIMENTS_WORKER_LOGGING_TO_API_ENABLED=1 (or set that env var) wherever you run the worker.

Low-latency flow and task submission

This one is a little trickier. We haven't optimized starting flow runs at low latency, as you discovered. We have optimized tasks to run faster, though we still aren't targeting super low-latency start times. They should run fast enough for most use cases, though.

Can you elaborate on what you meant when you said you don't have fine-grained control over which workers get which tasks? To start a task worker, you have to specify which tasks the worker will run, so I'm guessing you meant something deeper than that.

And I know you said you wanted to run a DAG ("flows" in our product) at lower latency than our flow workers offer. But instead of using a task to run a flow, have you considered running background tasks from other tasks? Tasks can run other tasks in the background and wait for them to complete, so you could have a background task that starts other background tasks, waits for them all to complete, then continues, etc. When you call task.delay(), you get back a future that the calling task can use to call wait() to wait for completion or result() to wait and get the result of the task. This would all show up in the task logs and graph in the UI. Then if we made seeing worker logs easier, you might have most of what you need.

Flows are useful when you need more orchestration-like features. And that may be true for your use case, of course. So flows would still be the better choice if you want features like transaction semantics, pause/resume (with or without external input), dynamic infrastructure provisioning, running functions on a schedule, etc.

0 replies

realimpat · 2024-11-14T15:09:46Z

realimpat
Nov 14, 2024
Author

cool that's a nice ui!

What is a flow worker? Is that different from a task worker? I only saw a task worker for background tasks -- is the flow worker for something on a scheduler?

Can you elaborate on what you meant when you said you don't have fine-grained control over which workers get which tasks? To start a task worker, you have to specify which tasks the worker will run, so I'm guessing you meant something deeper than that.

Like
a) which worker on which infrastructure. Some workers need large cpu others need high memory, some need very small resources, some need to live for 30 minute jobs that have to execute sequentially others can juggle many jobs at once. In the task worker config it looked to me like it treat all workers as the same and interchangeable.
b) how to handle task assignment within workers, e.g. round robin vs other approaches.

And I know you said you wanted to run a DAG ("flows" in our product) at lower latency than our flow workers offer. But instead of using a task to run a flow, have you considered running background tasks from other tasks?

I did consider this. This is what we can currently accomplish with celery though. So unless we can leverage a flow that can handle complex branching and nested pipelines, there's not advantage over existing celery tasks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to setup low-latency task-workers in a true improvement over celery/redis? #15992

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Claude Summary

Whole thread

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to setup low-latency task-workers in a true improvement over celery/redis? #15992

realimpat Nov 12, 2024

Claude Summary

Whole thread

Replies: 2 comments

abrookins Nov 14, 2024 Maintainer

Make it easier/possible to see task workers, task worker logs (i.e., the infra logs, not task run logs, which are currently visible in the UI), and see which worker ran a task.

Low-latency flow and task submission

realimpat Nov 14, 2024 Author

realimpat
Nov 12, 2024

abrookins
Nov 14, 2024
Maintainer

realimpat
Nov 14, 2024
Author