[Trainer][torch-distributed/DDP] Communication bottleneck between pods during multi-node training — guidance for optimization and diagnosis

**Summary**
Running PyTorch DDP on Kubeflow Trainer (torch-distributed runtime) shows high communication overhead during multi-node training. Link speed is 2.5 Gbit/s, but effective pod-to-pod throughput during training plateaus around ~1.4 Gbit/s, limiting scaling.

Measured throughput
Using an internal measurement script that times transfers of varying message sizes (P2P vs. collective) for ~1 minute, inter-node bandwidth is ~1.8 Gbit/s with a few machines and ~1.4 Gbit/s when using all 7 nodes.

**What I observe**

Collective ops dominate step time beyond one node.

Sub-linear speedup as total GPUs/nodes increase.

**Ask**

Kubeflow-specific best practices to improve NCCL communication on Kubernetes.

Recommended Trainer/runtime settings or pod/network tweaks (and any example manifests) for higher throughput.

Known CNI/overlay pitfalls that could explain the gap between link speed (2.5 Gbit/s) and effective throughput (1.4–1.5 Gbit/s).

**Environment**

Kubernetes version:
Client Version: v1.33.2
Kustomize Version: v5.6.0
Server Version: v1.33.2

Kubeflow Trainer version:

Kubeflow Python SDK version:
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/sdk
Author:
Author-email: The Kubeflow Authors [kubeflow-discuss@googlegroups.com](mailto:kubeflow-discuss@googlegroups.com)
License-Expression: Apache-2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: kubeflow-trainer-api, kubernetes, pydantic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Trainer][torch-distributed/DDP] Communication bottleneck between pods during multi-node training — guidance for optimization and diagnosis #2870

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Trainer][torch-distributed/DDP] Communication bottleneck between pods during multi-node training — guidance for optimization and diagnosis #2870

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions