Skip to content

[Trainer][torch-distributed/DDP] Communication bottleneck between pods during multi-node training — guidance for optimization and diagnosis #2870

@luigicollesi

Description

@luigicollesi

Summary
Running PyTorch DDP on Kubeflow Trainer (torch-distributed runtime) shows high communication overhead during multi-node training. Link speed is 2.5 Gbit/s, but effective pod-to-pod throughput during training plateaus around ~1.4 Gbit/s, limiting scaling.

Measured throughput
Using an internal measurement script that times transfers of varying message sizes (P2P vs. collective) for ~1 minute, inter-node bandwidth is ~1.8 Gbit/s with a few machines and ~1.4 Gbit/s when using all 7 nodes.

What I observe

Collective ops dominate step time beyond one node.

Sub-linear speedup as total GPUs/nodes increase.

Ask

Kubeflow-specific best practices to improve NCCL communication on Kubernetes.

Recommended Trainer/runtime settings or pod/network tweaks (and any example manifests) for higher throughput.

Known CNI/overlay pitfalls that could explain the gap between link speed (2.5 Gbit/s) and effective throughput (1.4–1.5 Gbit/s).

Environment

Kubernetes version:
Client Version: v1.33.2
Kustomize Version: v5.6.0
Server Version: v1.33.2

Kubeflow Trainer version:

Kubeflow Python SDK version:
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/sdk
Author:
Author-email: The Kubeflow Authors [email protected]
License-Expression: Apache-2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: kubeflow-trainer-api, kubernetes, pydantic

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions