ROCm · Copilot · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025
@@ -0,0 +1,140 @@
+# irisrun
+
+`irisrun` is a command-line tool for launching distributed Iris programs, similar to `torchrun`. It automatically manages distributed initialization by finding free ports and setting up the environment for multi-GPU execution.
+
+## Features
+
+- **Automatic Port Management**: Finds and uses free TCP ports, avoiding conflicts when processes crash
+- **Environment Setup**: Automatically sets `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, and `MASTER_PORT` environment variables
+- **Compatible with Existing Scripts**: Scripts can work with both `irisrun` and standalone execution
+
+## Installation
+
+After installing Iris, `irisrun` is automatically available:
+
+```bash
+pip install -e .
+```
+
+## Usage
+
+Basic usage:
+
+```bash
+irisrun --nproc_per_node=N script.py [script_args...]
+```
+
+### Arguments
+
+- `--nproc_per_node`: Number of processes to launch per node (typically the number of GPUs)
+- `--master_addr`: Master node address (default: `127.0.0.1`)
+- `--master_port`: Master node port (default: auto-selected free port)
+- `script`: Python script to run
+- `script_args`: Arguments to pass to the script
+
+### Examples
+
+Run the load benchmark on 2 GPUs:
+
+```bash
+irisrun --nproc_per_node=2 examples/00_load/load_bench.py --verbose
+```
+
+Run the store benchmark on 4 GPUs with custom buffer size:
+
+```bash
+irisrun --nproc_per_node=4 examples/01_store/store_bench.py --buffer_size 8192 --verbose
+```
+
+Run with a specific master port:
+
+```bash
+irisrun --nproc_per_node=2 --master_port=29600 examples/00_load/load_bench.py
+```
+
+## How It Works
+
+1. `irisrun` finds a free TCP port (unless `--master_port` is specified)
+2. It spawns `N` processes using `torch.multiprocessing.spawn`
+3. Each process gets environment variables set:
+   - `RANK`: The process rank (0 to N-1)
+   - `LOCAL_RANK`: Same as `RANK` for single-node execution
+   - `WORLD_SIZE`: Total number of processes
+   - `MASTER_ADDR`: Address of the master node
+   - `MASTER_PORT`: Port for distributed communication
+4. The script executes in each process with these environment variables available
+
+## Updating Scripts to Support irisrun
+
+Scripts can support both `irisrun` and standalone execution by checking for environment variables:
+
+```python
+def _worker(local_rank, world_size, init_url, args):
+    backend = "nccl" if torch.cuda.is_available() else "gloo"
+
+    # Check if running via irisrun
+    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
+        rank = int(os.environ["RANK"])
+        world_size = int(os.environ["WORLD_SIZE"])
+        master_addr = os.environ.get("MASTER_ADDR", "127.0.0.1")
+        master_port = os.environ.get("MASTER_PORT", "29500")
+        init_method = f"tcp://{master_addr}:{master_port}"
+
+        dist.init_process_group(
+            backend=backend,
+            init_method=init_method,
+            world_size=world_size,
+            rank=rank,
+            device_id=torch.device(f"cuda:{rank}"),
+        )
+    else:
+        # Standalone execution with hardcoded port
+        dist.init_process_group(
+            backend=backend,
+            init_method=init_url,
+            world_size=world_size,
+            rank=local_rank,
+            device_id=torch.device(f"cuda:{local_rank}"),
+        )
+
+def main():
+    args = parse_args()
+
+    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
+        # Running via irisrun - already spawned
+        _worker(None, None, None, args)
+    else:
+        # Standalone - spawn processes
+        init_url = "tcp://127.0.0.1:29500"
+        mp.spawn(
+            fn=_worker,
+            args=(args["num_ranks"], init_url, args),
+            nprocs=args["num_ranks"],
+            join=True,
+        )
+```
+
+## Benefits
+
+- **No Port Conflicts**: Automatically finds free ports, eliminating the common issue of port conflicts when scripts crash
+- **Easier Development**: Simplifies multi-GPU development by handling distributed setup automatically
+- **Cleaner Code**: Separates infrastructure concerns from application logic
+- **Familiar Interface**: Similar to `torchrun`, making it easy for PyTorch users to adopt
+
+## Troubleshooting
+
+### Port Already in Use
+
+If you specify `--master_port` and get a "port already in use" error, let `irisrun` auto-select a port by omitting the `--master_port` argument.
+
+### CUDA Device Mismatch
+
+Ensure `--nproc_per_node` matches the number of available GPUs or that `ROCR_VISIBLE_DEVICES` is set correctly.
+
+### Script Not Found
+
+Use absolute or relative paths to the script. For example:
+
+```bash
+irisrun --nproc_per_node=2 ./examples/00_load/load_bench.py
+```
@@ -9,9 +9,30 @@ Load benchmark using Iris.
 
 ## Usage
 
+### Using irisrun (Recommended)
+
+The recommended way to run this example is using `irisrun`, which automatically manages port allocation:
+
+```terminal
+irisrun --nproc_per_node=8 examples/00_load/load_bench.py
+```
+
+With verbose output:
+
+```terminal
+irisrun --nproc_per_node=8 examples/00_load/load_bench.py --verbose
+```
+
+### Standalone execution
+
+You can also run the example directly (uses hardcoded port 29500):
+
 ```terminal
 python examples/00_load/load_bench.py --num_ranks 8
 ```
+
+## Output
+
 On an MI300X, this example will run on 8 GPUs. It prints:
 ```terminal
 Unidirectional LOAD bandwidth GiB/s [Remote read]

@@ -3,6 +3,7 @@
 # Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
 
 import argparse
+import os
 
 import torch
 import torch.distributed as dist
@@ -235,13 +236,33 @@ def print_bandwidth_matrix(matrix, label="Unidirectional LOAD bandwidth GiB/s [R
 def _worker(local_rank: int, world_size: int, init_url: str, args: dict):
     """Worker function for PyTorch distributed execution."""
     backend = "nccl" if torch.cuda.is_available() else "gloo"
-    dist.init_process_group(
-        backend=backend,
-        init_method=init_url,
-        world_size=world_size,
-        rank=local_rank,
-        device_id=torch.device(f"cuda:{local_rank}"),
-    )
+
+    # Check if running via irisrun (environment variables will be set)
+    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
+        # Running via irisrun - use environment variables
+        # In this mode, local_rank/world_size/init_url parameters are ignored
+        rank = int(os.environ["RANK"])
+        world_size_env = int(os.environ["WORLD_SIZE"])
+        master_addr = os.environ.get("MASTER_ADDR", "127.0.0.1")
+        master_port = os.environ.get("MASTER_PORT", "29500")
+        init_method = f"tcp://{master_addr}:{master_port}"
+
+        dist.init_process_group(
+            backend=backend,
+            init_method=init_method,
+            world_size=world_size_env,
+            rank=rank,
+            device_id=torch.device(f"cuda:{rank}"),
+        )
+    else:
+        # Running standalone - use provided parameters
+        dist.init_process_group(
+            backend=backend,
+            init_method=init_url,
+            world_size=world_size,
+            rank=local_rank,
+            device_id=torch.device(f"cuda:{local_rank}"),
+        )
 
     # Main benchmark logic
     shmem = iris.iris(args["heap_size"])
@@ -283,13 +304,19 @@ def main():
 
     num_ranks = args["num_ranks"]
 
-    init_url = "tcp://127.0.0.1:29500"
-    mp.spawn(
-        fn=_worker,
-        args=(num_ranks, init_url, args),
-        nprocs=num_ranks,
-        join=True,
-    )
+    # Check if running via irisrun
+    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
+        # Running via irisrun - called once per process, run directly
+        _worker(None, None, None, args)
+    else:
+        # Running standalone - spawn multiple processes
+        init_url = "tcp://127.0.0.1:29500"
+        mp.spawn(
+            fn=_worker,
+            args=(num_ranks, init_url, args),
+            nprocs=num_ranks,
+            join=True,
+        )
 
 
 if __name__ == "__main__":

@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: MIT
+# Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+
+"""irisrun_cli package."""
@@ -0,0 +1,139 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: MIT
+# Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+
+"""
+irisrun: A launcher for distributed Iris programs.
+
+Similar to torchrun, this tool automatically manages distributed initialization
+by finding free ports and setting up the environment for multi-GPU execution.
+
+Usage:
+    irisrun --nproc_per_node=N script.py [script_args...]
+
+Example:
+    irisrun --nproc_per_node=2 examples/00_load/load_bench.py --verbose
+"""
+
+import argparse
+import os
+import socket
+import sys
+
+
+def _find_free_port():
+    """Find an available TCP port on localhost."""
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+        s.bind(("127.0.0.1", 0))
+        return s.getsockname()[1]
+
+
+def _distributed_worker(local_rank, world_size, master_addr, master_port, script_path, script_args):
+    """Worker function that sets up environment and runs the target script."""
+    # Set environment variables for distributed training
+    os.environ["RANK"] = str(local_rank)
+    os.environ["LOCAL_RANK"] = str(local_rank)
+    os.environ["WORLD_SIZE"] = str(world_size)
+    os.environ["MASTER_ADDR"] = master_addr
+    os.environ["MASTER_PORT"] = str(master_port)
+
+    # Set CUDA device for this process
+    try:
+        import torch
+
+        if torch.cuda.is_available():
+            torch.cuda.set_device(local_rank)
+    except ImportError:
+        pass  # torch may not be installed yet, that's ok
+
+    # Restore sys.argv to make it appear as if the script was called directly
+    sys.argv = [script_path] + script_args
+
+    # Execute the script in the current namespace
+    try:
+        with open(script_path, encoding="utf-8") as f:
+            code = compile(f.read(), script_path, "exec")
+            exec(code, {"__name__": "__main__", "__file__": script_path})
+    except SystemExit as e:
+        # Propagate exit code from script
+        sys.exit(e.code if isinstance(e.code, int) else 1)
+    except Exception as e:
+        print(f"Error in worker {local_rank}: {e}", file=sys.stderr)
+        import traceback
+
+        traceback.print_exc()
+        sys.exit(1)
+
+
+def main():
+    """Main entry point for irisrun."""
+    parser = argparse.ArgumentParser(
+        description="Launch distributed Iris programs with automatic port management",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  irisrun --nproc_per_node=2 examples/00_load/load_bench.py --verbose
+  irisrun --nproc_per_node=4 examples/01_store/store_bench.py
+        """,
+    )
+
+    parser.add_argument(
+        "--nproc_per_node",
+        type=int,
+        required=True,
+        help="Number of processes to launch per node (typically number of GPUs)",
+    )
+
+    parser.add_argument(
+        "--master_addr",
+        type=str,
+        default="127.0.0.1",
+        help="Master node address (default: 127.0.0.1)",
+    )
+
+    parser.add_argument(
+        "--master_port",
+        type=int,
+        default=None,
+        help="Master node port (default: auto-selected free port)",
+    )
+
+    parser.add_argument("script", type=str, help="Python script to run")
+
+    parser.add_argument("script_args", nargs=argparse.REMAINDER, help="Arguments for the script")
+
+    args = parser.parse_args()
+
+    # Find a free port if not specified
+    master_port = args.master_port if args.master_port is not None else _find_free_port()
+    master_addr = args.master_addr
+
+    print(f"[irisrun] Launching {args.nproc_per_node} processes")
+    print(f"[irisrun] Master address: {master_addr}:{master_port}")
+    print(f"[irisrun] Script: {args.script}")
+    print(f"[irisrun] Script args: {args.script_args}")
+
+    try:
+        # Import torch.multiprocessing here, after args are parsed
+        import torch.multiprocessing as mp
+
+        mp.spawn(
+            _distributed_worker,
+            args=(args.nproc_per_node, master_addr, master_port, args.script, args.script_args),
+            nprocs=args.nproc_per_node,
+            join=True,
+        )
+    except ImportError as e:
+        print(f"[irisrun] Error: PyTorch is required to run irisrun: {e}", file=sys.stderr)
+        sys.exit(1)
+    except KeyboardInterrupt:
+        print("\n[irisrun] Interrupted by user")
+        sys.exit(130)
+    except Exception as e:
+        print(f"[irisrun] Error: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()