Description
Related to #1261 and #1764 but it is not entire clear there:
TransformerEngine could support storage in fp8 and could be dropping storage of weights in native precision after initialization. This might seem counterintuitive in a training environment, but please consider LoRA and other adapter trainings. Most of your weights you never need at their original precision again - you just want to use TransformerEngine for its efficient calculations.
The LoRA weights you keep at a higher precision.
The vram usage of TransformerEngine is currently prohibitive for training a small adapter to a large transformer.
Describe alternatives you've considered
Continue to use a custom Linear layer that stores in fp8, but doesn't have the efficient calculations performed by TransformerEngine