You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary
Running PyTorch DDP on Kubeflow Trainer (torch-distributed runtime) shows high communication overhead during multi-node training. Link speed is 2.5 Gbit/s, but effective pod-to-pod throughput during training plateaus around ~1.4 Gbit/s, limiting scaling.
Measured throughput
Using an internal measurement script that times transfers of varying message sizes (P2P vs. collective) for ~1 minute, inter-node bandwidth is ~1.8 Gbit/s with a few machines and ~1.4 Gbit/s when using all 7 nodes.
What I observe
Collective ops dominate step time beyond one node.
Sub-linear speedup as total GPUs/nodes increase.
Ask
Kubeflow-specific best practices to improve NCCL communication on Kubernetes.
Recommended Trainer/runtime settings or pod/network tweaks (and any example manifests) for higher throughput.
Known CNI/overlay pitfalls that could explain the gap between link speed (2.5 Gbit/s) and effective throughput (1.4–1.5 Gbit/s).