Skip to content

Create option to control data type for tensor parallel all-reduce #1761

Open
@veritas9872

Description

@veritas9872

Is your feature request related to a problem? Please describe.

I have found that using BF16 for Tensor Parallel all-reduce causes major differences in the logit values when using different TP values. This is because using BF16 buffers for TP transfers, while reducing the memory, causes errors to accumulate with each layer.

However, this can cause lots of problems when logit values are important. Currently, an average error of 15% is not uncommon when using different TP levels with BF16 buffers. This error is negligible when using FP32 buffers.

Also, now that async-TP has been introduced, I think that using FP32 for TP data transfer will not have the same effect on latency as before.

Describe the solution you'd like

Many users may prefer using BF16 for TP data transfer to reduce latency. However, having an option for FP32 TP/CP model parallel would be great for accuracy sensitive users. Also, this would allow easier changing of TP degree during training, etc.

Describe alternatives you've considered

N/A.

Additional context

From the code in the TE PyTorch Linear module, I am not 100% certain if BF16 is being used for TP/CP data transfer. However, I am reasonably certain that this is the case since no documentation has been provided on the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions