Open
Description
In e.g. the LongLoRA paper they fully train both the embedding and the norm layers while still applying LoRA to self-attention layers. Our recipes set only LoRA parameters to trainable here, but it shouldn't be too hard to support passing additional layers to that function from the config. E.g. it could be similar to our usage of custom_sharded_layers.