[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS #1147

ChenchaoZhao · 2025-04-27T20:48:50Z

In Megatron repo https://github.com/NVIDIA/Megatron-LM/blob/4429e8ebe21fb011529d7401c370841ce530785a/megatron/training/arguments.py#L779

It’s recommended that FSDP should use larger values of CUDA_DEVICE_MAX_CONNECTIONS but Megatron TP requires it to be 1. Is it also the case for torch implementation of TP using DTensor?

How should I configure the environment variable when using torch implementation of FSDP(2) and/or TP/CP/SP?

The text was updated successfully, but these errors were encountered:

fegin · 2025-04-29T01:05:02Z

@weifengpy Do you have insights on this?

weifengpy · 2025-04-29T02:47:42Z

@ChenchaoZhao @fegin for FSDP2 + torch native TP, we recommend setting CUDA_DEVICE_MAX_CONNECTIONS to number of cuda streams. for example, 16 or 32. This makes sure compute and nccl kernels can execute in parallel

ChenchaoZhao · 2025-04-29T21:54:07Z

Thanks for the quick answer. Does it mean that PyTorch native TP is superior to the Megatron TP which requires the variable to be 1 in order to turn on tp comm overlap (comm+GEMM)?

tianyu-l added documentation Improvements or additions to documentation module: fsdp question Further information is requested labels Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS #1147

[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS #1147

ChenchaoZhao commented Apr 27, 2025

fegin commented Apr 29, 2025 •

edited

Loading

Uh oh!

weifengpy commented Apr 29, 2025

Uh oh!

ChenchaoZhao commented Apr 29, 2025

Uh oh!

[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS #1147

[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS #1147

Comments

ChenchaoZhao commented Apr 27, 2025

fegin commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weifengpy commented Apr 29, 2025

Uh oh!

ChenchaoZhao commented Apr 29, 2025

Uh oh!

fegin commented Apr 29, 2025 •

edited

Loading