How to reduce cuda memory usage of fp8?

The fp8 implementation of TransformerEngine seems to cache the transposed fp8 inputs to get better perfomance (https://github.com/NVIDIA/TransformerEngine/issues/1261), which bring same memory usage as bf16. Is there any way to disable this behavior? I want to use the dualPipe/dualPipeV mentioned by deepseek, but its high video memory overhead will lead to OOM. So I wanted to use TransformerEngine fp8 to reduce the activation memory usage, but I didn't obtain the expected benefits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to reduce cuda memory usage of fp8? #1764

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to reduce cuda memory usage of fp8? #1764

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions