Common issues and solutions when running AORTA benchmarks.
Solution: Install PyYAML or supply a JSON config instead.
pip install pyyamlSolution: Install ROCm utilities or omit --enable-rocm-metrics.
Ensure ROCm tools are in your $PATH:
export PATH=$PATH:/opt/rocm/binSolution: Verify CUDA_DEVICE_MAX_CONNECTIONS=1 (set in launcher) to encourage overlap-friendly scheduling.
This is typically set automatically by the launch scripts, but you can verify:
export CUDA_DEVICE_MAX_CONNECTIONS=1Solution: Increase dataloader.num_workers or reduce dataset volume.
torchrun --nproc_per_node 4 train.py \
--config config/default.yaml \
--override dataloader.num_workers=8Cause: Launchers' NPROC exceeds available GPUs.
Solution: The toolkit remaps surplus local ranks modulo the visible devices, but persistent failures usually indicate mismatched visibility.
Check your device visibility:
# CUDA
echo $CUDA_VISIBLE_DEVICES
# ROCm
echo $HIP_VISIBLE_DEVICESEnsure the launcher's --nproc_per_node matches your visible GPU count.
- Adjust model depth/width in
config/default.yamlto stress-test memory and communication pressure - Swap
MixedPrecisionmodes viatraining.mixed_precision(none,fp16, orbf16) - Leverage the JSONL logs to integrate with external profilers or dashboards (e.g., Prometheus, Weights & Biases)
- Implement custom communication hooks by editing
StreamProfiler.intercept_distributed_ops
If you encounter issues not covered here:
- Check that all prerequisites are installed (see Getting Started)
- Verify your configuration is valid (see Configuration Guide)
- Review profiling outputs for error messages (see Profiling Guide)
- Open an issue on the GitHub repository with:
- Your configuration file
- Error messages and stack traces
- Hardware/software environment details