Open
Description
The fp8 implementation of TransformerEngine seems to cache the transposed fp8 inputs to get better perfomance (#1261), which bring same memory usage as bf16. Is there any way to disable this behavior? I want to use the dualPipe/dualPipeV mentioned by deepseek, but its high video memory overhead will lead to OOM. So I wanted to use TransformerEngine fp8 to reduce the activation memory usage, but I didn't obtain the expected benefits.
Metadata
Metadata
Assignees
Labels
No labels