Skip to content

Conversation

@xiaoyachong
Copy link

@xiaoyachong xiaoyachong commented Apr 29, 2025

Refactored Prefect worker to use a job routing architecture:

  1. Modified parent_flow.py to act as a job router that:
  • Takes params_list as inputs
  • Selects appropriate work pool based on worker.name (hard-coded for now, may check workload of HPCs to decide it)
  • Extracts relevant algorithm parameters from MLflow algorithm registry for target environment
  • Load environment specific details from config.yml instead of io_parameters
  • Routes jobs to corresponding deployments ( use run_deployment() instead of directly call launch_XX())
  1. Created separate worker management scripts:
  • start_xx_child_worker.sh for foreground worker execution
  • start_xx_child_worker_background.sh for background execution with PID tracking
  • Load credentials (e.g. Tiled key, MLflow password) from .env instead of io_parameters.
  1. Implemented dedicated work pools for different execution environments:
  • docker_pool (type: process) for Docker-based jobs
  • podman_pool (type: process) for Podman-based jobs
  • conda_pool (type: process) for Conda environment jobs
  • slurm_pool (type: process) for Slurm jobs
  • parent_pool (type: process) for parent job router
  1. Added proper logging and process management (for background execution):
  • Log files include worker PIDs for easy tracking
  • All PIDs saved for simplified worker management
  • Configurable health check ports to prevent conflicts
  1. The determine_best_environment() function in utils.py still needs improvement in a future PR.

Ideally, it should evaluate the workload across all available resources (e.g., ALS cluster Ball, NSLS-II, NERSC, ALCF) and automatically select the optimal environment to run different job types (e.g., Docker, Podman, Slurm) on the corresponding resources.

For now, we’ve fixed worker.name to als in config.yml, which will start a Docker job.
This environment-selection mechanism is the key feature we plan to implement in the next phase.

Companion PRs: Data Clinic #17 and LSE #43


@xiaoyachong xiaoyachong requested a review from taxe10 April 29, 2025 19:03
Copy link
Member

@taxe10 taxe10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look very promising! I’ve added a couple of questions below. The main suggestion is to decouple the model metadata so it focuses purely on the algorithm itself, rather than environment or runtime-specific configuration (e.g. conda_env, network, volumes, etc.).

In addition, a few things to consider further:

  • Would it be worthwhile to use Prefect 3.x? I understand this would require changing mlex_utils, but are there other repos that would require major changes as well?
  • What mlflow version is compatible with these changes? Do we need to update the dependencies?
  • I'd also recommend moving sensitive information to Prefect Secrets.
  • As @dylanmcreynolds suggested, we should consider adding a SFapi worker for NERSC.
  • Updating README and creating documentation
  • Creating unit tests for this repo

return FlowType.conda

@task
def get_algorithm_details_from_mlflow(model_name: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add model version? If not defined, should we go for the latest?

Suggested change
def get_algorithm_details_from_mlflow(model_name: str):
def get_algorithm_details_from_mlflow(model_name: str, model_version: Optional[int] = None)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model_name sent from the application side follows the same naming convention as in default_model.json:
"model_name": "pytorch_autoencoder_v0.0.5"
where the version is included. It is also consistent with the original algorithm display:
image

Alternatively, I could split it into two fields—model_name: "pytorch_autoencoder" and model_version: "1" (instead of "v0.0.5")—and have the application send them as separate variables. Note that MLflow assigns numeric versions (1, 2, …, n, with n being the latest) by default. If we want a custom version name (e.g., "v0.0.5"), we need to add it manually as a tag (which I have already done). Let me know which approach you think is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting it into 2 fields and following mlflow version tracking convention sounds like a good plan to me

tags = run.data.tags

# Get relevant fields from MLflow params
algorithm_details = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d recommend simplifying this section to focus on metadata that’s directly relevant to the algorithm itself, rather than environment or runtime configuration (e.g. conda_env, network, volumes, etc.):

Suggested change
algorithm_details = {
algorithm_details = {
"model_name": model_name,
"image_name": params.get("image_name", ""),
"image_tag": params.get("image_tag", ""),
"source_code": params.get("source_code", ""), # e.g. repo link
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. I change algorithm_details to

  algorithm_details = {
      "model_name": model_name,
      # Core Algorithm Information
      "image_name": params.get("image_name", ""),
      "image_tag": params.get("image_tag", ""),
      "source": params.get("source", ""),
      "is_gpu_enabled": params.get("is_gpu_enabled", "False").lower() == "true"
  }

And create another job_details to load information from the newly created config.yml file:

job_details = {
    # Container settings
    "volumes": config.get("container", {}).get("volumes", []),
    "network": config.get("container", {}).get("network", ""),
    # Slurm settings
    "num_nodes": config.get("slurm", {}).get("num_nodes", 1),
    "partitions": config.get("slurm", {}).get("partitions", "[]"),
    "reservations": config.get("slurm", {}).get("reservations", "[]"),
    "max_time": config.get("slurm", {}).get("max_time", "1:00:00"),
    "submission_ssh_key": config.get("slurm", {}).get("submission_ssh_key", ""),
    "forward_ports": config.get("slurm", {}).get("forward_ports", "[]"),
    # Get conda environment based on the model type
    "conda_env": _get_conda_env_for_model(model_name, config)
}

@xiaoyachong xiaoyachong requested a review from taxe10 September 17, 2025 21:07
@dylanmcreynolds
Copy link
Member

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

@xiaoyachong
Copy link
Author

xiaoyachong commented Nov 5, 2025

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

With the current implementation, the setup differs slightly because the docker_pool uses the "docker" work pool type, while Podman uses the "process" type—since Prefect does not natively support a "podman" type (reference).

To elaborate, for the Docker work pool, the launch_docker() function runs inside a worker container and invokes a sibling container (e.g., Autoencoder) to execute the ML task. The workflow follows the Docker-out-of-Docker (Sibling Container Pattern) as illustrated below:

✅ Docker-out-of-Docker (Sibling Container Pattern)

┌──────────────────────────────────────────────┐
│ Host Machine                                │
│                                              │
│ ┌────────────────────────────────────────┐  │
│ │ Docker Daemon                           │ │
│ │ • Manages all containers on host        │ │
│ │ • Exposes socket: /var/run/docker.sock  │ │
│ └────────────────────────────────────────┘  │
│        ▲                  │                 │
│        │                  │                 │
│        │ ① API Call       │ ② Creates       │
│        │   via Socket     │   Container     │
│        │                  │                 │
│        │                  ▼                 │
│ ┌──────────────┐     ┌──────────────┐       │
│ │ Worker       │     │ Job          │       │
│ │ Container    │     │ Container    │       │
│ │              │     │              │       │
│ │ • Mounted    │     │ • Runs       │       │
│ │   docker.sock│     │   workload   │       │
│ │ • Uses SDK or│     │ • Isolated   │       │
│ │   CLI        │     │ • Sibling    │       │
│ └──────────────┘     └──────────────┘       │
│                                              │
│ ◄────────────── Sibling Level ─────────────► │
│                                              │
└──────────────────────────────────────────────┘

Flow Sequence:
1️⃣ The worker sends an API request through the mounted socket (docker.sock), which acts as the communication bridge.
2️⃣ The daemon receives the request and spawns a new job container at the host level (not nested).

I attempted to use the "docker" type for the podman_pool as well by mounting podman.sock in the Prefect YAML configuration. However, this approach does not work on macOS because Podman runs inside a VM. While it’s technically feasible on Linux, it’s generally not recommended to mount the Podman socket, so I’m unsure how to best proceed with the podman_worker. For now, I’ve kept its work pool type as "process".

prefect work-pool create docker_pool --type "docker"
prefect work-pool update docker_pool --concurrency-limit $PREFECT_WORK_POOL_CONCURRENCY
prefect deploy -n launch_docker --pool docker_pool --prefect-file prefect-docker.yaml
PREFECT_WORKER_WEBSERVER_PORT=8081 prefect worker start --pool docker_pool --limit $PREFECT_WORKER_LIMIT --with-healthcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we run on the ALS infrastructure, prefect server will be reached at something like https://prefect.computing.als.lbl.gov. I'm a little concerned to see the PREFECT_WORKER_WEBSERVER_PORT setting, which would throw that off. But I may not understand everything here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the PREFECT_WORKER_WEBSERVER_PORT setting because we now run multiple workers concurrently (parent_pool and docker_pool). The prefect worker start command exposes a health check webserver on port 8080 by default. When starting multiple workers on the same machine, each needs a unique port to avoid conflicts - hence PREFECT_WORKER_WEBSERVER_PORT=8081 for the docker_pool worker.

This port setting only affects the local health check endpoint and does not impact how workers communicate with the Prefect server at https://prefect.computing.als.lbl.gov (which is controlled by PREFECT_API_URL). However, we should verify this works as expected in the real ALS deployment.

@dylanmcreynolds
Copy link
Member

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

With the current implementation, the setup differs slightly because the docker_pool uses the "docker" work pool type, while Podman uses the "process" type—since Prefect does not natively support a "podman" type (reference).

I guess I was looking at this script thinking that docker run implies that it's not talking to the docker socket directly. https://github.com/xiaoyachong/mlex_prefect_worker/blob/8d27648c9ba0a186c544ab746e86b6e5760f71ed/flows/docker/bash_run_docker.sh#L16

@xiaoyachong
Copy link
Author

For invoking a container from a script, I would think that the mechanisms are exactly the same except for docker vs podman as the command. Have you considered maintaining only one set of scripts, with a parameter for docker vs podman ?

With the current implementation, the setup differs slightly because the docker_pool uses the "docker" work pool type, while Podman uses the "process" type—since Prefect does not natively support a "podman" type (reference).

I guess I was looking at this script thinking that docker run implies that it's not talking to the docker socket directly. https://github.com/xiaoyachong/mlex_prefect_worker/blob/8d27648c9ba0a186c544ab746e86b6e5760f71ed/flows/docker/bash_run_docker.sh#L16

I’m not entirely sure about this, but here’s the explanation from Claude:
The docker run command does communicate with the Docker socket — it’s just abstracted away from the user.
Here’s how the flow works:

[Our Script] 
    ↓ 
[docker run command]
    ↓
[Docker CLI client]
    ↓ (sends API request via socket)
[/var/run/docker.sock]
    ↓
[Docker Daemon]
    ↓ (creates container)
[Container]

Copy link
Member

@taxe10 taxe10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous comments! These are a few new suggestions to consider:

  • We should avoid spanning docker containers from other docker containers as this will likely be restricted in deployment environments
  • The conda workers are currently not working due to a missing path to their source code - I added a suggestion on how we could manage this in the config file
  • Error tracking between parent and child flows - e.g. when a child flow fails, the error is not parsed to the parent flow, which keeps the Completed status

config.yml Outdated
pca: "mlex_dimension_reduction_pca"
umap: "mlex_dimension_reduction_umap"
clustering: "mlex_clustering"
dlsia: "dlsia"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if here we should add

algorithm_path:
  pytorch_autoencoder: /path/to/autoencoders
  .
  .
  dlsia: /path/to/dlsia # /src/train.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with another method to determine the algorithm_path. Since we register the image_name in MLflow (e.g., ghcr.io/mlexchange/mlex_dlsia_segmentation_prototype), we can easily extract the algorithm folder name (mlex_dlsia_segmentation_prototype) from it. With ALGORITHMS_DIR defined in the .env file, the full path becomes ALGORITHMS_DIR/folder_name. I’ve updated my code accordingly, so we no longer need to define algorithm_path in config.yml.

I also found another limitation in the segmentation app when switching the worker type. There is a model_dir variable in io_parameter (loaded from WRITE_DIR in .env) that determines where DVC results are saved. For the conda flow, this should be a local machine path; for the Docker flow, it needs to be a container path. This means that whenever we switch worker types, we must also update that value on the application side, which is not ideal.

I’m considering saving the DVC results as artifacts directly to the MLflow server instead. This would avoid path differences but would require changes in both the DLSIA and segmentation repositories.

config.yml Outdated
@@ -0,0 +1,32 @@
# MLExchange Job Configuration
# Default HPC type to use
hpc_type: "als"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if we should rename this to:

worker:
  name: "als"
  type: "podman"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I just updated it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be addressed in the next PR - should we add a --run-background flag? This is such that we can maintain a single file for each worker

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. I will address it in the next PR.

@xiaoyachong
Copy link
Author

xiaoyachong commented Nov 19, 2025

Thanks for addressing the previous comments! These are a few new suggestions to consider:

  • We should avoid spanning docker containers from other docker containers as this will likely be restricted in deployment environments
  • The conda workers are currently not working due to a missing path to their source code - I added a suggestion on how we could manage this in the config file
  • Error tracking between parent and child flows - e.g. when a child flow fails, the error is not parsed to the parent flow, which keeps the Completed status

Thanks for your reviews! I’ve addressed all the comments and closed the issue for the third one. Both the Conda and Docker workers are now ready for testing.

flows/utils.py Outdated
return FlowType.conda


def _get_conda_env_for_model(model_name: str, config: dict) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function would need to be modified as new algorithms are registered. I wonder if we could use the {model_name}_{version} as the key to find the conda environment in the config file?

I am trying to register dlsia 2.0 (refactored)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I’m using {model_name}_{version} to name the Conda environments in config.yml:

# Conda environment settings
conda:
  conda_env_name:
    DLSIA MSDNet_0.0.1: "dlsia"
    DLSIA TUNet_0.0.1: "dlsia"
    DLSIA TUNet3+_0.0.1: "dlsia"

Regarding versioning, there are two types to consider. You may have noticed the custom version defined in models.json (e.g., 0.0.1), and the version number used by MLflow (e.g., Version 1, Version 2, which are integer-based). In config.yml, I’m using our custom-defined version.

Not directly related—but for full algorithm version support, there will need to be substantial changes in highres_segmentation (front-end UI, params_list, and algorithm registration), as well as on the Prefect worker side. These updates are not implemented yet and can be addressed in a future PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants