This guide covers prerequisites, installation, and initial setup for AORTA.
- PyTorch >= 2.2 with FSDP2 APIs (ROCm 7/RCCL)
- ROCm tooling (
rocm-smi,rocminfo) - PyYAML, matplotlib
- GPU nodes with RCCL capable interconnects
- Sufficient GPU memory for the configured model (see
config/default.yaml)
- TorchTitan components required by your wider stack are pre-installed (the synthetic workload does not import TorchTitan directly).
- The code gracefully degrades when optional dependencies are absent.
- All processes run under a job launcher that sets
LOCAL_RANK(e.g.,torchrun, Slurm, or similar). - The synthetic dataset is intended for profiling and does not reflect production data distributions.
Training runs in Docker containers with all dependencies pre-installed.
The interactive setup script guides you through creating your personal .env configuration:
cd docker
bash setup-env.sh
docker compose -f docker-compose.build.yaml up -dThe script will prompt you to:
- Select a Dockerfile (ROCm version, with/without Shampoo optimizer, etc.)
- Choose a container name (defaults to
${USER}-${variant}-${date}) - Configure workspace and RCCL paths
- Set up optional volume mounts
For more control, manually create your .env file:
cd docker
cp .env.example .env
# Edit .env with your preferred editor
nano .env # or vim, code, etc.
docker compose -f docker-compose.build.yaml up -dAvailable Dockerfiles:
Dockerfile.rocm70_9-1- Standard ROCm 7.0.9.1 buildDockerfile.rocm70_9-1-shampoo- ROCm 7.0.9.1 with Shampoo optimizerDockerfile.rocm70_2-ubuntu-pytorch- ROCm 7.0.2 Ubuntu PyTorch buildDockerfile.rocm70_2-ubuntu-nan- ROCm 7.0.2 with NaN debugging tools
Example .env configurations:
For standard ROCm development:
DOCKERFILE=Dockerfile.rocm70_9-1
CONTAINER_NAME=myuser-rocm70-dev
AORTA_WORKSPACE=..
RCCL_PATH=/tmp/rccl_placeholderFor Shampoo optimizer testing with custom RCCL:
DOCKERFILE=Dockerfile.rocm70_9-1-shampoo
CONTAINER_NAME=myuser-shampoo-exp1
AORTA_WORKSPACE=/apps/username/aorta_work/aorta_1
RCCL_PATH=/apps/username/rcclIf you prefer using a pre-built image instead of building from a Dockerfile:
cd docker
docker compose up -dThis uses the default docker-compose.yaml with a pre-configured image.
Connect to the running container via CLI or VSCode:
# Via Docker CLI
docker exec -it <your-container-name> bash
# Or use VSCode's "Attach to Running Container" featurepython -m torchrec.distributed.benchmark.benchmark_train_pipeline \
--yaml_config=$ROOT/config/torchrec_dist/sparse_data_dist_base.yaml \
--name="sparse_data_dist_q_contend$(git rev-parse --short HEAD || echo $USER)"This captures a profiler trace file locally.
What runs in Docker:
train.py- Model training- Distributed workloads
- GPU profiling
For running analysis scripts and processing traces locally.
We recommend using uv for fast, reliable Python environment management.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/ROCm/aorta.git
cd aorta
# Create and activate a virtual environment
uv venv && source .venv/bin/activate
# Install PyTorch nightly for ROCm 7.1
uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/
# Install dependencies for analysis scripts
uv pip install -r requirements.txt
# For contributors: install development tools (pytest, pre-commit, etc.)
uv pip install -r requirements-dev.txt
pre-commit installWhat runs locally:
scripts/utils/merge_gpu_trace_ranks.py- Merge distributed tracesanalysis/overlap_report.py- Generate analysis reportsscripts/analyze_*.py- Analysis utilities- Test suite (
pytest tests/)
- On ROCm systems, verify
rocm-smiandrocminfoare in$PATH. - Run scripts from the repository root so path bootstrapping works correctly.
- Running the Benchmark - Launch your first training run
- Configuration Guide - Customize model and training parameters