Ensemble Launcher

A lightweight, scalable tool for launching and orchestrating task ensembles across HPC clusters with intelligent resource management and hierarchical execution.

Features

Flexible Execution: Support for serial, MPI, and mixed workloads
Intelligent Scheduling: Automatic resource allocation with customizable policies
Hierarchical Architecture: Efficient master-worker patterns for large-scale deployments (1-2048+ nodes)
Multiple Communication Backends: Choose between Python multiprocessing, ZMQ, or DragonHPC for performance at scale
Resource Pinning: Fine-grained CPU and GPU affinity control
Real-time Monitoring: Track task execution with configurable status updates
Fault Tolerance: Graceful handling of task failures with detailed error reporting
Python & Shell Support: Execute Python callables or shell commands seamlessly

Installation

Requirements

Python 3.6+
numpy
matplotlib
scienceplots
pytest
cloudpickle
pydantic
pyzmq

Optional Dependencies

MPI implementation (for distributed execution via mpirun or mpiexec)
DragonHPC (for extreme-scale deployment on HPC systems)

Quick Install

git clone https://github.com/argonne-lcf/ensemble_launcher.git
cd ensemble_launcher
python3 -m pip install .

Quick Start

1. Define Your Ensemble

Create a JSON configuration file describing your task ensemble:

{
    "ensembles": {
        "example_ensemble": {
            "nnodes": 1,
            "ppn": 1,
            "cmd_template": "./exe -a {arg1} -b {arg2}",
            "arg1": "linspace(0, 10, 5)",
            "arg2": "linspace(0, 1, 5)",
            "relation": "one-to-one"
        }
    }
}

The configuration specifies an ensemble with:

Tasks running on a single node with a single process per node
Tasks executed with ./exe -a {arg1} -b {arg2} taking two input arguments
The values of the two input arguments are defined as 5 linearly spaced numbers between 0-10 and 0-1 for arg1 and arg2, respectively.
The raletionship between the values of the two arguments is set to one-to-one, meaning the ensemble consists of 5 tasks, one for each pair of values.

Supported Relations:

one-to-one: Pair parameters element-wise (N tasks)
many-to-many: Cartesian product of parameters (N×M tasks)

2. Create a Launcher Script

from ensemble_launcher import EnsembleLauncher

if __name__ == '__main__':
    # Auto-configure based on system and workload
    el = EnsembleLauncher("config.json")
    results = el.run()
    
    # Write results to file
    from ensemble_launcher import write_results_to_json
    write_results_to_json(results, "results.json")

3. Execute

python3 launcher_script.py

Architecture

Key Components

EnsembleLauncher: Main API entry point with auto-configuration
Global/Local Master: Orchestrates workers, handles task distribution and aggregation
Worker: Executes tasks using configured executor
Scheduler: Allocates resources across cluster nodes with intelligent policies
Executors: Backend task launching engines (Python multiprocessing, MPI, DragonHPC)
Communication Layer: ZMQ or Python multiprocessing pipes

Hierarchical Execution Model

The master-worker architecture scales from single nodes to thousands of nodes:

Single Node (nlevels=0): Direct execution without master overhead
Small Scale (nlevels=1): Global master coordinates workers directly
Large Scale (nlevels=2): Global master → Local masters → Workers for thousands of tasks
Extreme Scale (nlevels=3): Deep hierarchy for supercomputer-scale deployments

Configuration

Basic Configuration

The launcher automatically configures itself based on your workload and system:

from ensemble_launcher import EnsembleLauncher

el = EnsembleLauncher(
    ensemble_file="config.json",
    Nodes=["node-001", "node-002"],  # Optional: auto-detects from PBS_NODEFILE, works only on PBS
    pin_resources=True,              # Enable CPU/GPU pinning
    return_stdout=True               # Capture task output
)

Advanced Configuration

For fine-grained control, explicitly configure system and launcher settings:

from ensemble_launcher import EnsembleLauncher
from ensemble_launcher.config import SystemConfig, LauncherConfig

# Define system resources
system_config = SystemConfig(
    name="my_cluster",
    ncpus=104,                      # CPUs per node
    ngpus=12,                       # GPUs per node
    cpus=list(range(104)),          # Specific CPU IDs (optional)
    gpus=list(range(12))            # Specific GPU IDs (optional)
)

# Configure launcher behavior
launcher_config = LauncherConfig(
    child_executor_name="mpi",      # multiprocessing, mpi, dragon
    task_executor_name="mpi",       # Executor for tasks
    comm_name="zmq",                # multiprocessing, zmq, dragon
    nlevels=2,                      # Hierarchy depth (auto-computed if None)
    report_interval=10.0,           # Status update frequency (seconds)
    return_stdout=True,             # Capture stdout
    worker_logs=True,               # Enable worker logging
    master_logs=True                # Enable master logging
)

el = EnsembleLauncher(
    ensemble_file="config.json",
    system_config=system_config,
    launcher_config=launcher_config,
    pin_resources=True
)

results = el.run()

Resource Pinning

Pin tasks to specific CPUs and GPUs for optimal performance:

{
    "ensembles": {
        "pinned_ensemble": {
            "nnodes": 1,
            "ppn": 4,
            "cmd_template": "./gpu_code",
            "cpu_affinity": "0,1,2,3",
            "gpu_affinity": "0,1,2,3",
            "ngpus_per_process": 1
        }
    }
}

Resources are pinned using the gpu_selector option in the LauncherConfig (defaults to "ZE_AFFINITY_MASK" for Intel GPUs). The specific string the gpu_selector is set to depends on the SystemConfig. For example, setting:

system_config = SystemConfig(
    name="my_cluster",
    cpus=list(range(104)),          # Specific CPU IDs (optional)
    gpus=['0','0','1','1','2','3']  # Specific GPU IDs (optional)
)

will overload the GPU 0 and 1 and the Scheduler assumes that node has a 6 GPUs instead of 4 GPUs.

Execution Modes

Python Callables

Execute Python functions directly:

def my_simulation(param_a, param_b):
    # Your simulation code
    return result

from ensemble_launcher.ensemble import Task

tasks = {
    "task-1": Task(
        task_id="task-1",
        nnodes=1,
        ppn=1,
        executable=my_simulation,
        args=(10, 0.5)
    )
}

el = EnsembleLauncher(
    ensemble_file=tasks,  # Pass dict directly
    return_stdout=True
)
results = el.run()

Note that, internally, the dictionary definition of the ensemble is converted to a collection of Task()s.

Shell Commands

Execute binaries and shell commands with files as inputs:

{
    "ensembles": {
        "shell_ensemble": {
            "nnodes": 1,
            "ppn": 1,
            "cmd_template": "./simulation --config {config_file}",
            "config_file": ["config1.json", "config2.json", "config3.json"],
            "relation": "one-to-one"
        }
    }
}

which is launched using the following script.

```python
from ensemble_launcher import EnsembleLauncher

if __name__ == '__main__':
    # Auto-configure based on system and workload
    el = EnsembleLauncher("config.json")
    results = el.run()
    
    # Write results to file
    from ensemble_launcher import write_results_to_json
    write_results_to_json(results, "results.json")


---

## Examples

See the [`examples`](examples/) directory for complete workflow samples:

### C++ Examples
- [`examples/c++/workflow_pattern1.py`](examples/c++/workflow_pattern1.py) - Basic parallel execution
- [`examples/c++/workflow_pattern2.py`](examples/c++/workflow_pattern2.py) - Parameter sweeps
- [`examples/c++/workflow_pattern3.py`](examples/c++/workflow_pattern3.py) - Complex dependencies

<!-- ### Python Examples
- [`examples/python/mpi_example.py`](examples/python/mpi_example.py) - MPI-based execution
- [`examples/python/serial_example.py`](examples/python/serial_example.py) - Serial task execution -->

---

## Performance Tuning

### Communication Backend Selection

| Backend          | Best For                    | Nodes    |
|------------------|-----------------------------|----------|
| `multiprocessing`| Single node, small ensembles| 1        |
| `zmq`            | Multi-node, large scale     | 2-2048+  |

### Hierarchy Levels

The launcher automatically determines hierarchy depth based on node count, but you can override it with:

```python
launcher_config = LauncherConfig(
    nlevels=0   # Direct worker execution (single node)
    nlevels=1   # Master + Workers (up to ~64 nodes)
    nlevels=2   # Master + Sub-masters + Workers (64-2048 nodes)
    nlevels=3   # Deep hierarchy (2048+ nodes)
)

Auto-computed hierarchy:

1 node: nlevels=0 (worker only)
2-64 nodes: nlevels=1 (master + workers)
65-2048 nodes: nlevels=2 (master + sub-masters + workers)
2048+ nodes: nlevels=3 (deep hierarchy)

Monitoring and Debugging

Enable logging for detailed execution traces:

# import logging
# logging.basicConfig(level=logging.INFO)

launcher_config = LauncherConfig(
    worker_logs=True,
    master_logs=True,
    report_interval=5.0,  # Report status every 5 seconds
    profile = "basic" or "timeline" #basic ouputs the communication latencies and task runtime. timeline outputs the mean, std, sum, and counts of various events in the orchestrator
)

Logs are written to logs/master-*.log and logs/worker-*.log. Profiles are written to profiles/*

API Reference