Replies: 4 comments 7 replies
-
Hi Jack, A few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I’m not familiar with MPI, but are you basically running an external
command? If so, you might get more mileage out of writing a wrapping `Task` that submits the job to your SLURM cluster directly, and then a cheap
polling loop in your wrapping task that reports the status of the SLURM job
in the Prefect GUI. This way you can control the requirements for each task
separately and the dask cluster you’ll need to run the flow can be a lot
smaller.
…On Mon, 24 Aug 2020 at 17:00, jacksund ***@***.***> wrote:
The adaptive deployment method looks super useful!! Thank you. I'll
definitely switch to SLURMCluster.adapt() over the SLURMCluster.scale()
method when I use this type of architecture. This solves one of my main
problems with Dask (of wasting resources by holding them), but I still need
to see if dask-jobqueue can limit a worker to one task, execute a task via
mpirun, and localize tasks to each worker / a single directory. Dask seems
a more complicated than the simple agents that I'm used to, but I reading
more to see if this would work. I think it will end up coming down to how
Dask manages tasks/memory between workers -- the more isolated they are,
the better. Prefect agents look to be more isolated (and higher level),
which is why I'm starting here.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3206 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4N3PYLZHO2OBTLHWZEIOLSCKFDVANCNFSM4QIHI5OQ>
.
|
Beta Was this translation helpful? Give feedback.
-
The alternative is to write a cluster level `Executor` that will let you
take control of task submission. You implement something that takes a
`Callable` and returns something like a `Future` and then something that
can wait on those futures.
The upside of this is full control of task submission. The downside is that
every single task will be a cluster task - if it only takes two seconds
you’ll still pay queue time (but prefect should gain you some parallelism)
…On Mon, 24 Aug 2020 at 18:16, jacksund ***@***.***> wrote:
I won't be able to do this unfortunately. This approach leads to a number
of issues when I'm using multiple clusters (I submit to in-lab clusters,
UNC clusters, and national clusters all through the same workflow manager)
such as bottlenecks when one cluster hits a longer queue than others and it
also leads to loss of data such as how long the task took to run (queue
time is absorbed into the task duration). There are other problems that pop
up, but these are the big two for me. It's a good idea though - it just
falls apart for my specific application.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3206 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4N3P44UOT6F5OZGAZKHKDSCKN7VANCNFSM4QIHI5OQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Just wanted to leave an update. I've been looking into writing a minimalist Agent for this purpose, but I've admittedly gotten overwhelmed with the Agent base class... I'll probably drop this for now, but perhaps revisit it later in the year. I've at the moment settled for the LocalAgent in testing. So instead of submitting a single task per job, it's one flow per slurm job. It works so far across multiple clusters, where I'm starting agents up with...
When I submit a bunch of slurm jobs on one Cluster (where many Agents are being launched/killed/relaunched), I have all the Agents simply use the same RUNNER API token. No errors occur when I run agents at the same time with the same token (tested this without the max_polls kwarg), but I haven't fully tested this because I'm still using the developer edition of Prefect Cloud. I'm actually limited to just one concurrent flow, so I have no idea if this setup will breakdown with >1 concurrency. Should I just fill out this form to request a small increase in flow concurrency for testing? The key thing I have working is a queue manager that tries to maintain N number of slurm jobs (so N number of Agents) on a cluster at any given time. This is really just a rewrite of FireWork's queue module. Once I have that all ironed out, I'll share it here and perhaps submit it to prefect.contrib. |
Beta Was this translation helpful? Give feedback.
-
Hey everyone,
I apologize in advance for how much I wrote here... I realize that I'm attempting to use Prefect in a manner that it's not originally designed for, so I have a longer explanation of why I'm doing this below.
While most users are coming from AirFlow, I'm trying to transition from FireWorks (https://github.com/materialsproject/fireworks).
So here's my main question:
Can I setup Prefect with many Agents that make a single task request and then terminate? In addition, when an agent is started and no tasks are ready for execution, the Agent will terminate. If this is possible, what would you estimate for the overhead per task? I would expect the main overhead comes from Agent connection to the Prefect Cloud/Server.
This may sound like an inefficient use of Prefect, but it is intentional. Fireworks is designed with this setup in mind and is thus limited to 6/tasks per second -- this is acceptable because the average task submitted via FireWorks is on the hour timescale.
I'm a materials chemistry researcher at UNC, where I must submit tasks as individual SLURM jobs, each with their own time and memory restrictions. These tasks (DFT energy calculations) range drastically in their required resources (one could be <1GB while another >200GB memory), launch in parallel using mpirun, and require their own isolated directory.
I could use a single Executor like Dask, which supports queues like SLURM, but this would cause a number of problems for me. Dask holds onto worker resources indefinitely, which research-cluster admins don't want. If there are no tasks ready to execute, the cluster's resources should be released. Dask (as far as I'm aware) does not allow for setting time/memory limits on a per-task basis. And I'm unsure how Dask will handle tasks that execute via mpirun and also implement isolated directories per task.
Fireworks was made by the materials chemistry community specifically for this. In the setup, you constantly submit SLURM jobs. Once a SLURM job makes it through the queue, the job itself simply starts an Agent, runs a single task, and then closes. This submission architecture is something I would like to replicate with Prefect. Fireworks has a number of limitations that I think Prefect can fix - such as their WorkFlow classes and meta-database (MongoDB) - so I'm looking into switching over.
Should this be of interest to others, it may be worth making QueueAdaptors for Prefect - similar to FireWorks' adapters (https://materialsproject.github.io/fireworks/queue_tutorial.html).
Again, sorry for the long write! Thanks for reading through, and let me know if you think a multi-Agent approach is possible with Prefect.
-Jack
Beta Was this translation helpful? Give feedback.
All reactions