make parent_flow as job router #26

xiaoyachong · 2025-04-29T18:35:07Z

Refactored Prefect worker to use a job routing architecture:

Modified parent_flow.py to act as a job router that:

Takes params_list as inputs
Selects appropriate work pool based on worker.name (hard-coded for now, may check workload of HPCs to decide it)
Extracts relevant algorithm parameters from MLflow algorithm registry for target environment
Load environment specific details from config.yml instead of io_parameters
Routes jobs to corresponding deployments ( use run_deployment() instead of directly call launch_XX())

Created separate worker management scripts:

start_xx_child_worker.sh for foreground worker execution
start_xx_child_worker_background.sh for background execution with PID tracking
Load credentials (e.g. Tiled key, MLflow password) from .env instead of io_parameters.

Implemented dedicated work pools for different execution environments:

docker_pool (type: process) for Docker-based jobs
podman_pool (type: process) for Podman-based jobs
conda_pool (type: process) for Conda environment jobs
slurm_pool (type: process) for Slurm jobs
parent_pool (type: process) for parent job router

Added proper logging and process management (for background execution):

Log files include worker PIDs for easy tracking
All PIDs saved for simplified worker management
Configurable health check ports to prevent conflicts

The determine_best_environment() function in utils.py still needs improvement in a future PR.

Ideally, it should evaluate the workload across all available resources (e.g., ALS cluster Ball, NSLS-II, NERSC, ALCF) and automatically select the optimal environment to run different job types (e.g., Docker, Podman, Slurm) on the corresponding resources.

For now, we’ve fixed worker.name to als in config.yml, which will start a Docker job.
This environment-selection mechanism is the key feature we plan to implement in the next phase.

Companion PRs: Data Clinic #17 and LSE #43

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1210113319697173

taxe10

Changes look very promising! I’ve added a couple of questions below. The main suggestion is to decouple the model metadata so it focuses purely on the algorithm itself, rather than environment or runtime-specific configuration (e.g. conda_env, network, volumes, etc.).

In addition, a few things to consider further:

Would it be worthwhile to use Prefect 3.x? I understand this would require changing mlex_utils, but are there other repos that would require major changes as well?
What mlflow version is compatible with these changes? Do we need to update the dependencies?
I'd also recommend moving sensitive information to Prefect Secrets.
As @dylanmcreynolds suggested, we should consider adding a SFapi worker for NERSC.
Updating README and creating documentation
Creating unit tests for this repo

taxe10 · 2025-08-05T17:29:01Z

flows/parent_flow.py

+        return FlowType.conda
+
+@task
+def get_algorithm_details_from_mlflow(model_name: str):


Should we add model version? If not defined, should we go for the latest?

Suggested change

def get_algorithm_details_from_mlflow(model_name: str):

def get_algorithm_details_from_mlflow(model_name: str, model_version: Optional[int] = None)

The model_name sent from the application side follows the same naming convention as in default_model.json:
"model_name": "pytorch_autoencoder_v0.0.5"
where the version is included. It is also consistent with the original algorithm display:

Alternatively, I could split it into two fields—model_name: "pytorch_autoencoder" and model_version: "1" (instead of "v0.0.5")—and have the application send them as separate variables. Note that MLflow assigns numeric versions (1, 2, …, n, with n being the latest) by default. If we want a custom version name (e.g., "v0.0.5"), we need to add it manually as a tag (which I have already done). Let me know which approach you think is better.

Splitting it into 2 fields and following mlflow version tracking convention sounds like a good plan to me

flows/parent_flow.py

taxe10 · 2025-08-05T17:40:20Z

flows/parent_flow.py

+        tags = run.data.tags
+
+        # Get relevant fields from MLflow params
+        algorithm_details = {


I’d recommend simplifying this section to focus on metadata that’s directly relevant to the algorithm itself, rather than environment or runtime configuration (e.g. conda_env, network, volumes, etc.):

Suggested change

algorithm_details = {

algorithm_details = {

"model_name": model_name,

"image_name": params.get("image_name", ""),

"image_tag": params.get("image_tag", ""),

"source_code": params.get("source_code", ""), # e.g. repo link

}

Thanks for your suggestion. I change algorithm_details to

algorithm_details = { "model_name": model_name, # Core Algorithm Information "image_name": params.get("image_name", ""), "image_tag": params.get("image_tag", ""), "source": params.get("source", ""), "is_gpu_enabled": params.get("is_gpu_enabled", "False").lower() == "true" }

And create another job_details to load information from the newly created config.yml file:

job_details = { # Container settings "volumes": config.get("container", {}).get("volumes", []), "network": config.get("container", {}).get("network", ""), # Slurm settings "num_nodes": config.get("slurm", {}).get("num_nodes", 1), "partitions": config.get("slurm", {}).get("partitions", "[]"), "reservations": config.get("slurm", {}).get("reservations", "[]"), "max_time": config.get("slurm", {}).get("max_time", "1:00:00"), "submission_ssh_key": config.get("slurm", {}).get("submission_ssh_key", ""), "forward_ports": config.get("slurm", {}).get("forward_ports", "[]"), # Get conda environment based on the model type "conda_env": _get_conda_env_for_model(model_name, config) }

flows/parent_flow.py

start_child_worker.sh

start_worker_background.sh

flows/parent_flow.py

dylanmcreynolds · 2025-11-05T17:08:33Z

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

xiaoyachong · 2025-11-05T19:35:47Z

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

With the current implementation, the setup differs slightly because the docker_pool uses the "docker" work pool type, while Podman uses the "process" type—since Prefect does not natively support a "podman" type (reference).

To elaborate, for the Docker work pool, the launch_docker() function runs inside a worker container and invokes a sibling container (e.g., Autoencoder) to execute the ML task. The workflow follows the Docker-out-of-Docker (Sibling Container Pattern) as illustrated below:

✅ Docker-out-of-Docker (Sibling Container Pattern)

┌──────────────────────────────────────────────┐
│ Host Machine                                │
│                                              │
│ ┌────────────────────────────────────────┐  │
│ │ Docker Daemon                           │ │
│ │ • Manages all containers on host        │ │
│ │ • Exposes socket: /var/run/docker.sock  │ │
│ └────────────────────────────────────────┘  │
│        ▲                  │                 │
│        │                  │                 │
│        │ ① API Call       │ ② Creates       │
│        │   via Socket     │   Container     │
│        │                  │                 │
│        │                  ▼                 │
│ ┌──────────────┐     ┌──────────────┐       │
│ │ Worker       │     │ Job          │       │
│ │ Container    │     │ Container    │       │
│ │              │     │              │       │
│ │ • Mounted    │     │ • Runs       │       │
│ │   docker.sock│     │   workload   │       │
│ │ • Uses SDK or│     │ • Isolated   │       │
│ │   CLI        │     │ • Sibling    │       │
│ └──────────────┘     └──────────────┘       │
│                                              │
│ ◄────────────── Sibling Level ─────────────► │
│                                              │
└──────────────────────────────────────────────┘

Flow Sequence:
1️⃣ The worker sends an API request through the mounted socket (docker.sock), which acts as the communication bridge.
2️⃣ The daemon receives the request and spawns a new job container at the host level (not nested).

I attempted to use the "docker" type for the podman_pool as well by mounting podman.sock in the Prefect YAML configuration. However, this approach does not work on macOS because Podman runs inside a VM. While it’s technically feasible on Linux, it’s generally not recommended to mount the Podman socket, so I’m unsure how to best proceed with the podman_worker. For now, I’ve kept its work pool type as "process".

dylanmcreynolds · 2025-11-05T23:44:33Z

start_docker_child_worker.sh

+prefect work-pool create docker_pool --type "docker"
+prefect work-pool update docker_pool --concurrency-limit $PREFECT_WORK_POOL_CONCURRENCY
+prefect deploy -n launch_docker --pool docker_pool --prefect-file prefect-docker.yaml
+PREFECT_WORKER_WEBSERVER_PORT=8081 prefect worker start --pool docker_pool --limit $PREFECT_WORKER_LIMIT --with-healthcheck


When we run on the ALS infrastructure, prefect server will be reached at something like https://prefect.computing.als.lbl.gov. I'm a little concerned to see the PREFECT_WORKER_WEBSERVER_PORT setting, which would throw that off. But I may not understand everything here.

I added the PREFECT_WORKER_WEBSERVER_PORT setting because we now run multiple workers concurrently (parent_pool and docker_pool). The prefect worker start command exposes a health check webserver on port 8080 by default. When starting multiple workers on the same machine, each needs a unique port to avoid conflicts - hence PREFECT_WORKER_WEBSERVER_PORT=8081 for the docker_pool worker.

This port setting only affects the local health check endpoint and does not impact how workers communicate with the Prefect server at https://prefect.computing.als.lbl.gov (which is controlled by PREFECT_API_URL). However, we should verify this works as expected in the real ALS deployment.

dylanmcreynolds · 2025-11-06T00:27:53Z

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

With the current implementation, the setup differs slightly because the docker_pool uses the "docker" work pool type, while Podman uses the "process" type—since Prefect does not natively support a "podman" type (reference).

I guess I was looking at this script thinking that docker run implies that it's not talking to the docker socket directly. https://github.com/xiaoyachong/mlex_prefect_worker/blob/8d27648c9ba0a186c544ab746e86b6e5760f71ed/flows/docker/bash_run_docker.sh#L16

xiaoyachong · 2025-11-16T03:57:20Z

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

With the current implementation, the setup differs slightly because the docker_pool uses the "docker" work pool type, while Podman uses the "process" type—since Prefect does not natively support a "podman" type (reference).

I guess I was looking at this script thinking that docker run implies that it's not talking to the docker socket directly. https://github.com/xiaoyachong/mlex_prefect_worker/blob/8d27648c9ba0a186c544ab746e86b6e5760f71ed/flows/docker/bash_run_docker.sh#L16

I’m not entirely sure about this, but here’s the explanation from Claude:
The docker run command does communicate with the Docker socket — it’s just abstracted away from the user.
Here’s how the flow works:

[Our Script] 
    ↓ 
[docker run command]
    ↓
[Docker CLI client]
    ↓ (sends API request via socket)
[/var/run/docker.sock]
    ↓
[Docker Daemon]
    ↓ (creates container)
[Container]

taxe10

Thanks for addressing the previous comments! These are a few new suggestions to consider:

We should avoid spanning docker containers from other docker containers as this will likely be restricted in deployment environments
The conda workers are currently not working due to a missing path to their source code - I added a suggestion on how we could manage this in the config file
Error tracking between parent and child flows - e.g. when a child flow fails, the error is not parsed to the parent flow, which keeps the Completed status

taxe10 · 2025-11-19T00:30:03Z

config.yml

+    pca: "mlex_dimension_reduction_pca"
+    umap: "mlex_dimension_reduction_umap"
+    clustering: "mlex_clustering"
+    dlsia: "dlsia"


I'm wondering if here we should add

algorithm_path: pytorch_autoencoder: /path/to/autoencoders . . dlsia: /path/to/dlsia # /src/train.py

I came up with another method to determine the algorithm_path. Since we register the image_name in MLflow (e.g., ghcr.io/mlexchange/mlex_dlsia_segmentation_prototype), we can easily extract the algorithm folder name (mlex_dlsia_segmentation_prototype) from it. With ALGORITHMS_DIR defined in the .env file, the full path becomes ALGORITHMS_DIR/folder_name. I’ve updated my code accordingly, so we no longer need to define algorithm_path in config.yml.

I also found another limitation in the segmentation app when switching the worker type. There is a model_dir variable in io_parameter (loaded from WRITE_DIR in .env) that determines where DVC results are saved. For the conda flow, this should be a local machine path; for the Docker flow, it needs to be a container path. This means that whenever we switch worker types, we must also update that value on the application side, which is not ideal.

I’m considering saving the DVC results as artifacts directly to the MLflow server instead. This would avoid path differences but would require changes in both the DLSIA and segmentation repositories.

taxe10 · 2025-11-19T00:33:58Z

config.yml

@@ -0,0 +1,32 @@
+# MLExchange Job Configuration
+# Default HPC type to use
+hpc_type: "als"


I'm also wondering if we should rename this to:

worker: name: "als" type: "podman"

Sure. I just updated it.

taxe10 · 2025-11-19T00:37:00Z

start_conda_child_worker.sh

This could be addressed in the next PR - should we add a --run-background flag? This is such that we can maintain a single file for each worker

That sounds good. I will address it in the next PR.

xiaoyachong · 2025-11-19T18:32:51Z

Thanks for addressing the previous comments! These are a few new suggestions to consider:

We should avoid spanning docker containers from other docker containers as this will likely be restricted in deployment environments

The conda workers are currently not working due to a missing path to their source code - I added a suggestion on how we could manage this in the config file

Error tracking between parent and child flows - e.g. when a child flow fails, the error is not parsed to the parent flow, which keeps the Completed status

Thanks for your reviews! I’ve addressed all the comments and closed the issue for the third one. Both the Conda and Docker workers are now ready for testing.

taxe10 · 2025-11-21T14:38:14Z

flows/utils.py

+        return FlowType.conda
+
+
+def _get_conda_env_for_model(model_name: str, config: dict) -> str:


I believe this function would need to be modified as new algorithms are registered. I wonder if we could use the {model_name}_{version} as the key to find the conda environment in the config file?

I am trying to register dlsia 2.0 (refactored)

Sounds good. I’m using {model_name}_{version} to name the Conda environments in config.yml:

# Conda environment settings conda: conda_env_name: DLSIA MSDNet_0.0.1: "dlsia" DLSIA TUNet_0.0.1: "dlsia" DLSIA TUNet3+_0.0.1: "dlsia"

Regarding versioning, there are two types to consider. You may have noticed the custom version defined in models.json (e.g., 0.0.1), and the version number used by MLflow (e.g., Version 1, Version 2, which are integer-based). In config.yml, I’m using our custom-defined version.

Not directly related—but for full algorithm version support, there will need to be substantial changes in highres_segmentation (front-end UI, params_list, and algorithm registration), as well as on the Prefect worker side. These updates are not implemented yet and can be addressed in a future PR.

make parent_flow as job router

5f3ac9e

xiaoyachong requested a review from taxe10 April 29, 2025 19:03

xiaoyachong added 4 commits April 29, 2025 12:20

make parent_flow as job router

46658c6

separate parent and child workers

a7b4884

Fix parent flow failure propagation

f39f941

update job params list

e825b4f

This was referenced Jun 30, 2025

Xiaoya add feature mlflow mlexchange/mlex_data_clinic#17

Open

Xiaoya update offline mode: update job params list mlexchange/mlex_latent_explorer#43

Draft

taxe10 requested review from Wiebke and dylanmcreynolds July 9, 2025 15:28

taxe10 requested changes Aug 5, 2025

View reviewed changes

taxe10 reviewed Aug 5, 2025

View reviewed changes

flows/parent_flow.py Outdated Show resolved Hide resolved

xiaoyachong added 2 commits August 30, 2025 14:52

address pr comments

f2c75ea

upgrade prefect version

3e889ad

xiaoyachong requested a review from taxe10 September 17, 2025 21:07

xiaoyachong added 4 commits November 3, 2025 15:25

add mlflow

46b5271

change type from process to docker for docker_pool

8683939

update env and readme

64fceca

use run_process for docker

8d27648

dylanmcreynolds reviewed Nov 5, 2025

View reviewed changes

xiaoyachong added 3 commits November 11, 2025 19:14

add conda env for segmentation

f1d83d8

update env

7f96d71

read credentials from env

e842660

xiaoyachong mentioned this pull request Nov 13, 2025

Xiaoya update prefect3 mlexchange/mlex_highres_segmentation#201

Open

refactor parent flow

4e47fb0

xiaoyachong mentioned this pull request Nov 13, 2025

Xiaoya update prefect3 mlexchange/mlex_tomo_framework#16

Open

load credentials in child flow

b944dbb

remove unified child worker

132f171

xiaoyachong added 2 commits November 15, 2025 20:57

update scripts

9624877

update conda flow

5633e33

taxe10 requested changes Nov 19, 2025

View reviewed changes

xiaoyachong added 3 commits November 18, 2025 17:20

change docker_pool type back to process

1c5ba4f

update config

6b2a4e3

fix conda

f51ff59

xiaoyachong mentioned this pull request Nov 19, 2025

save dvc html to mlflow mlexchange/mlex_dlsia_segmentation_prototype#39

Closed

xiaoyachong added 3 commits November 19, 2025 14:33

clean up

556b15f

update readme

50d21c7

add child status check

2a06c3c

xiaoyachong mentioned this pull request Nov 20, 2025

“500 Internal Server Error” in Prefect Server During Long Runs or Child Flow Status Checks #31

Closed

xiaoyachong requested a review from taxe10 November 20, 2025 20:08

taxe10 reviewed Nov 21, 2025

View reviewed changes

xiaoyachong added 4 commits November 21, 2025 09:25

update typer version

bd0522b

add version for conda env

25a496b

update volumes

5c00dfb

update conda env

bb33da0

xiaoyachong mentioned this pull request Nov 25, 2025

Refactor mlexchange/mlex_dlsia_segmentation_prototype#38

Open

	def get_algorithm_details_from_mlflow(model_name: str):
	def get_algorithm_details_from_mlflow(model_name: str, model_version: Optional[int] = None)

-        algorithm_details = {
+        algorithm_details = {
+            "model_name": model_name,
+            "image_name": params.get("image_name", ""),
+            "image_tag": params.get("image_tag", ""),
+            "source_code": params.get("source_code", ""),  # e.g. repo link
+        }

		return FlowType.conda


		def _get_conda_env_for_model(model_name: str, config: dict) -> str:

make parent_flow as job router #26

Are you sure you want to change the base?

make parent_flow as job router #26

Conversation

xiaoyachong commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taxe10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dylanmcreynolds commented Nov 5, 2025

Uh oh!

xiaoyachong commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanmcreynolds commented Nov 6, 2025

Uh oh!

xiaoyachong commented Nov 16, 2025

Uh oh!

taxe10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoyachong commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiaoyachong commented Apr 29, 2025 •

edited

Loading

taxe10 left a comment •

edited

Loading

xiaoyachong commented Nov 5, 2025 •

edited

Loading

xiaoyachong commented Nov 19, 2025 •

edited

Loading