This guide covers how to capture, analyze, and interpret profiling data from AORTA benchmark runs.
Each rank writes artifacts/rank_<rank>_metrics.jsonl containing iteration-level telemetry:
- Per-stream durations (ms) for compute, all-reduce, reduce-scatter, and auxiliary streams
- Overlap segments with concurrency statistics and utilisation ratios
- ROCm diagnostic output when enabled (rank-local)
- Loss, learning rate, gradient norms, and global step counters
Events are captured using torch.cuda.Event(enable_timing=True) for microsecond fidelity. Distributed collectives are monkey-patched at runtime so that all-reduce and reduce-scatter operations execute on dedicated streams and contribute to overlap calculation.
Enable PyTorch's profiler by toggling the profiling block in your config or via CLI override:
torchrun --nproc_per_node 4 train.py \
--config config/default.yaml \
--override profiling.enabled=true \
--override profiling.wait=1 \
--override profiling.warmup=1 \
--override profiling.active=2- TensorBoard traces write to
artifacts/torch_profiler/rank*/by default - Launch
tensorboard --logdir artifacts/torch_profilerand use the Profile tab for stream timelines
- Enable via
profiling.chrome_trace=true - Not recommended on ROCm - the toolkit disables them automatically to avoid known Kineto crashes
- Adjust
wait,warmup,active, andrepeatto control capture cadence - Shapes and memory statistics are recorded by default
Use the wrapper script to profile an entire ROCm run:
bash scripts/rocprof_capture.sh config/default.yaml --override training.max_steps=50Outputs land under rocprof_traces/run_<timestamp>/.
Override location or extra flags with environment variables:
ROCPROF_OUTPUT_DIR=/path/to/outROCPROF_ARGS="--att --kernel-trace --kernel-symbols"
The script mirrors launch_rocm.sh but executes through rocprofv3, so you can merge traces with the JSONL metrics using the shared iteration timestamps.
Run the analyser to build summaries and plots from one or more log directories:
python analysis/overlap_report.py \
--log-dir artifacts_rocm --label rocm \
--log-dir artifacts_cuda --label cuda \
--output reports/2024-roc-vs-cuda \
--reference cuda --candidate rocmsummary.json- Aggregate metrics per dataset plus comparative ratios{label}_timeline.png- Overlays showing compute and overlap durations per global step
Use these artefacts to pinpoint scheduling or synchronisation regressions between hardware backends.
| Metric | Interpretation |
|---|---|
Overlap Ratio (overlap_ratio) |
Values close to 1 indicate strong overlap; values near 0 imply communications block compute |
Compute All-Reduce (compute_allreduce_ms) |
Time spent in all-reduce operations |
Compute Reduce-Scatter (compute_reducescatter_ms) |
Time spent in reduce-scatter operations |
- Compare
compute_allreduce_msvscompute_reducescatter_msto determine which collective dominates stall time - Inspect
active_segmentsin the JSONL logs to align iteration windows with external profilers (e.g., ROCm tracer) - Cross-reference
rocm_smi_outputagainst overlap dips to correlate DVFS throttling or memory pressure with scheduling gaps
For deeper inspection, combine these scripts with nsys, rocprof, or PyTorch profiler traces using the iteration timestamps documented in the JSON traces.
- Troubleshooting - Common issues and solutions
- Configuration Guide - Tune parameters for better overlap

