Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions docs/irisrun.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# irisrun

`irisrun` is a command-line tool for launching distributed Iris programs, similar to `torchrun`. It automatically manages distributed initialization by finding free ports and setting up the environment for multi-GPU execution.

## Features

- **Automatic Port Management**: Finds and uses free TCP ports, avoiding conflicts when processes crash
- **Environment Setup**: Automatically sets `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, and `MASTER_PORT` environment variables
- **Compatible with Existing Scripts**: Scripts can work with both `irisrun` and standalone execution

## Installation

After installing Iris, `irisrun` is automatically available:

```bash
pip install -e .
```

## Usage

Basic usage:

```bash
irisrun --nproc_per_node=N script.py [script_args...]
```

### Arguments

- `--nproc_per_node`: Number of processes to launch per node (typically the number of GPUs)
- `--master_addr`: Master node address (default: `127.0.0.1`)
- `--master_port`: Master node port (default: auto-selected free port)
- `script`: Python script to run
- `script_args`: Arguments to pass to the script

### Examples

Run the load benchmark on 2 GPUs:

```bash
irisrun --nproc_per_node=2 examples/00_load/load_bench.py --verbose
```

Run the store benchmark on 4 GPUs with custom buffer size:

```bash
irisrun --nproc_per_node=4 examples/01_store/store_bench.py --buffer_size 8192 --verbose
```

Run with a specific master port:

```bash
irisrun --nproc_per_node=2 --master_port=29600 examples/00_load/load_bench.py
```

## How It Works

1. `irisrun` finds a free TCP port (unless `--master_port` is specified)
2. It spawns `N` processes using `torch.multiprocessing.spawn`
3. Each process gets environment variables set:
- `RANK`: The process rank (0 to N-1)
- `LOCAL_RANK`: Same as `RANK` for single-node execution
- `WORLD_SIZE`: Total number of processes
- `MASTER_ADDR`: Address of the master node
- `MASTER_PORT`: Port for distributed communication
4. The script executes in each process with these environment variables available

## Updating Scripts to Support irisrun

Scripts can support both `irisrun` and standalone execution by checking for environment variables:

```python
def _worker(local_rank, world_size, init_url, args):
backend = "nccl" if torch.cuda.is_available() else "gloo"

# Check if running via irisrun
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
master_addr = os.environ.get("MASTER_ADDR", "127.0.0.1")
master_port = os.environ.get("MASTER_PORT", "29500")
init_method = f"tcp://{master_addr}:{master_port}"

dist.init_process_group(
backend=backend,
init_method=init_method,
world_size=world_size,
rank=rank,
device_id=torch.device(f"cuda:{rank}"),
)
else:
# Standalone execution with hardcoded port
dist.init_process_group(
backend=backend,
init_method=init_url,
world_size=world_size,
rank=local_rank,
device_id=torch.device(f"cuda:{local_rank}"),
)

def main():
args = parse_args()

if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
# Running via irisrun - already spawned
_worker(None, None, None, args)
else:
# Standalone - spawn processes
init_url = "tcp://127.0.0.1:29500"
mp.spawn(
fn=_worker,
args=(args["num_ranks"], init_url, args),
nprocs=args["num_ranks"],
join=True,
)
```

## Benefits

- **No Port Conflicts**: Automatically finds free ports, eliminating the common issue of port conflicts when scripts crash
- **Easier Development**: Simplifies multi-GPU development by handling distributed setup automatically
- **Cleaner Code**: Separates infrastructure concerns from application logic
- **Familiar Interface**: Similar to `torchrun`, making it easy for PyTorch users to adopt

## Troubleshooting

### Port Already in Use

If you specify `--master_port` and get a "port already in use" error, let `irisrun` auto-select a port by omitting the `--master_port` argument.

### CUDA Device Mismatch

Ensure `--nproc_per_node` matches the number of available GPUs or that `ROCR_VISIBLE_DEVICES` is set correctly.

### Script Not Found

Use absolute or relative paths to the script. For example:

```bash
irisrun --nproc_per_node=2 ./examples/00_load/load_bench.py
```
21 changes: 21 additions & 0 deletions examples/00_load/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,30 @@ Load benchmark using Iris.

## Usage

### Using irisrun (Recommended)

The recommended way to run this example is using `irisrun`, which automatically manages port allocation:

```terminal
irisrun --nproc_per_node=8 examples/00_load/load_bench.py
```

With verbose output:

```terminal
irisrun --nproc_per_node=8 examples/00_load/load_bench.py --verbose
```

### Standalone execution

You can also run the example directly (uses hardcoded port 29500):

```terminal
python examples/00_load/load_bench.py --num_ranks 8
```

## Output

On an MI300X, this example will run on 8 GPUs. It prints:
```terminal
Unidirectional LOAD bandwidth GiB/s [Remote read]
Expand Down
55 changes: 41 additions & 14 deletions examples/00_load/load_bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.

import argparse
import os

import torch
import torch.distributed as dist
Expand Down Expand Up @@ -235,13 +236,33 @@ def print_bandwidth_matrix(matrix, label="Unidirectional LOAD bandwidth GiB/s [R
def _worker(local_rank: int, world_size: int, init_url: str, args: dict):
"""Worker function for PyTorch distributed execution."""
backend = "nccl" if torch.cuda.is_available() else "gloo"
dist.init_process_group(
backend=backend,
init_method=init_url,
world_size=world_size,
rank=local_rank,
device_id=torch.device(f"cuda:{local_rank}"),
)

# Check if running via irisrun (environment variables will be set)
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
# Running via irisrun - use environment variables
# In this mode, local_rank/world_size/init_url parameters are ignored
rank = int(os.environ["RANK"])
world_size_env = int(os.environ["WORLD_SIZE"])
master_addr = os.environ.get("MASTER_ADDR", "127.0.0.1")
master_port = os.environ.get("MASTER_PORT", "29500")
init_method = f"tcp://{master_addr}:{master_port}"

dist.init_process_group(
backend=backend,
init_method=init_method,
world_size=world_size_env,
rank=rank,
device_id=torch.device(f"cuda:{rank}"),
)
else:
# Running standalone - use provided parameters
dist.init_process_group(
backend=backend,
init_method=init_url,
world_size=world_size,
rank=local_rank,
device_id=torch.device(f"cuda:{local_rank}"),
)

# Main benchmark logic
shmem = iris.iris(args["heap_size"])
Expand Down Expand Up @@ -283,13 +304,19 @@ def main():

num_ranks = args["num_ranks"]

init_url = "tcp://127.0.0.1:29500"
mp.spawn(
fn=_worker,
args=(num_ranks, init_url, args),
nprocs=num_ranks,
join=True,
)
# Check if running via irisrun
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
# Running via irisrun - called once per process, run directly
_worker(None, None, None, args)
else:
# Running standalone - spawn multiple processes
init_url = "tcp://127.0.0.1:29500"
mp.spawn(
fn=_worker,
args=(num_ranks, init_url, args),
nprocs=num_ranks,
join=True,
)


if __name__ == "__main__":
Expand Down
4 changes: 4 additions & 0 deletions irisrun_cli/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# SPDX-License-Identifier: MIT
# Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.

"""irisrun_cli package."""
139 changes: 139 additions & 0 deletions irisrun_cli/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
# Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.

"""
irisrun: A launcher for distributed Iris programs.

Similar to torchrun, this tool automatically manages distributed initialization
by finding free ports and setting up the environment for multi-GPU execution.

Usage:
irisrun --nproc_per_node=N script.py [script_args...]

Example:
irisrun --nproc_per_node=2 examples/00_load/load_bench.py --verbose
"""

import argparse
import os
import socket
import sys


def _find_free_port():
"""Find an available TCP port on localhost."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("127.0.0.1", 0))
return s.getsockname()[1]


def _distributed_worker(local_rank, world_size, master_addr, master_port, script_path, script_args):
"""Worker function that sets up environment and runs the target script."""
# Set environment variables for distributed training
os.environ["RANK"] = str(local_rank)
os.environ["LOCAL_RANK"] = str(local_rank)
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["MASTER_ADDR"] = master_addr
os.environ["MASTER_PORT"] = str(master_port)

# Set CUDA device for this process
try:
import torch

if torch.cuda.is_available():
torch.cuda.set_device(local_rank)
except ImportError:
pass # torch may not be installed yet, that's ok

# Restore sys.argv to make it appear as if the script was called directly
sys.argv = [script_path] + script_args

# Execute the script in the current namespace
try:
with open(script_path, encoding="utf-8") as f:
code = compile(f.read(), script_path, "exec")
exec(code, {"__name__": "__main__", "__file__": script_path})
except SystemExit as e:
# Propagate exit code from script
sys.exit(e.code if isinstance(e.code, int) else 1)
except Exception as e:
print(f"Error in worker {local_rank}: {e}", file=sys.stderr)
import traceback

traceback.print_exc()
sys.exit(1)


def main():
"""Main entry point for irisrun."""
parser = argparse.ArgumentParser(
description="Launch distributed Iris programs with automatic port management",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
irisrun --nproc_per_node=2 examples/00_load/load_bench.py --verbose
irisrun --nproc_per_node=4 examples/01_store/store_bench.py
""",
)

parser.add_argument(
"--nproc_per_node",
type=int,
required=True,
help="Number of processes to launch per node (typically number of GPUs)",
)

parser.add_argument(
"--master_addr",
type=str,
default="127.0.0.1",
help="Master node address (default: 127.0.0.1)",
)

parser.add_argument(
"--master_port",
type=int,
default=None,
help="Master node port (default: auto-selected free port)",
)

parser.add_argument("script", type=str, help="Python script to run")

parser.add_argument("script_args", nargs=argparse.REMAINDER, help="Arguments for the script")

args = parser.parse_args()

# Find a free port if not specified
master_port = args.master_port if args.master_port is not None else _find_free_port()
master_addr = args.master_addr

print(f"[irisrun] Launching {args.nproc_per_node} processes")
print(f"[irisrun] Master address: {master_addr}:{master_port}")
print(f"[irisrun] Script: {args.script}")
print(f"[irisrun] Script args: {args.script_args}")

try:
# Import torch.multiprocessing here, after args are parsed
import torch.multiprocessing as mp

mp.spawn(
_distributed_worker,
args=(args.nproc_per_node, master_addr, master_port, args.script, args.script_args),
nprocs=args.nproc_per_node,
join=True,
)
except ImportError as e:
print(f"[irisrun] Error: PyTorch is required to run irisrun: {e}", file=sys.stderr)
sys.exit(1)
except KeyboardInterrupt:
print("\n[irisrun] Interrupted by user")
sys.exit(130)
except Exception as e:
print(f"[irisrun] Error: {e}", file=sys.stderr)
sys.exit(1)


if __name__ == "__main__":
main()
Loading