Description
Is your feature request related to a problem? Please describe.
I have found that using BF16 for Tensor Parallel all-reduce causes major differences in the logit values when using different TP values. This is because using BF16 buffers for TP transfers, while reducing the memory, causes errors to accumulate with each layer.
However, this can cause lots of problems when logit values are important. Currently, an average error of 15% is not uncommon when using different TP levels with BF16 buffers. This error is negligible when using FP32 buffers.
Also, now that async-TP has been introduced, I think that using FP32 for TP data transfer will not have the same effect on latency as before.
Describe the solution you'd like
Many users may prefer using BF16 for TP data transfer to reduce latency. However, having an option for FP32 TP/CP model parallel would be great for accuracy sensitive users. Also, this would allow easier changing of TP degree during training, etc.
Describe alternatives you've considered
N/A.
Additional context
From the code in the TE PyTorch Linear module, I am not 100% certain if BF16 is being used for TP/CP data transfer. However, I am reasonably certain that this is the case since no documentation has been provided on the issue.