Create option to control data type for tensor parallel all-reduce

**Is your feature request related to a problem? Please describe.**

I have found that using BF16 for Tensor Parallel all-reduce causes major differences in the logit values when using different TP values. This is because using BF16 buffers for TP transfers, while reducing the memory, causes errors to accumulate with each layer.

However, this can cause lots of problems when logit values are important. Currently, an average error of 15% is not uncommon when using different TP levels with BF16 buffers. This error is negligible when using FP32 buffers.

Also, now that async-TP has been introduced, I think that using FP32 for TP data transfer will not have the same effect on latency as before.

**Describe the solution you'd like**

Many users may prefer using BF16 for TP data transfer to reduce latency. However, having an option for FP32 TP/CP model parallel would be great for accuracy sensitive users. Also, this would allow easier changing of TP degree during training, etc.

**Describe alternatives you've considered**

N/A.

**Additional context**

From the code in the TE PyTorch Linear module, I am not 100% certain if BF16 is being used for TP/CP data transfer. However, I am reasonably certain that this is the case since no documentation has been provided on the issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create option to control data type for tensor parallel all-reduce #1761

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create option to control data type for tensor parallel all-reduce #1761

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions