Skip to content

Conversation

rockerBOO
Copy link
Contributor

@rockerBOO rockerBOO commented Aug 19, 2025

LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization

https://arxiv.org/abs/2502.14538

It is a update to GGPO, seems to replace it as they updated their paper.

performance dynamics
Screenshot 2025-08-19 at 02-49-26 LoRA-MGPO Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization - 2502 14538v2 pdf Screenshot 2025-08-19 at 02-50-14 LoRA-MGPO Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization - 2502 14538v2 pdf
network_args = [
   "mgpo_rho=0.05", # (ρ): Perturbation radius
   "mgpo_beta=0.9" #  (β): EMA smoothing factor for gradient magnitude normalization
]
--network_args "mgpo_rho=0.05" "mgpo_beta=0.9"

May need to play with these values but as a starting point. Might need to lower rho to 0.01 and beta to 0.8. They suggest it for "larger models" to move in that direction.

For NLU tasks, we
fine-tune T5-base (Raffel et al., 2020) with a learn-
ing rate of 1×10−4, a sequence length of 128, and a
batch size of 32. ρ = 0.05, μ = 0.9, β = 0.9. For
the NLG tasks, we fine-tune LLaMA-2-7B (Tou-
vron et al., 2023) with a learning rate of 2 × 10−5,
a sequence length of 1024, and a macro batch size
of 32. ρ = 0.01, μ = 0.8, β = 0.8.

also finding this is about 10-12% faster than GGPO (and almost the same speed as regular LoRA, though maybe indisernable)

@rockerBOO
Copy link
Contributor Author

Same settings except

MGPO:

rho: 0.05
beta: 0.9

GGPO:

beta: 0.01
sigma: 0.03

Trained on Flux dev2pro. Inference on Flux Krea.

Krea MGPO GGPO LoRA
c-f1-2025-08-19_00096_ c-f1-2025-08-19_00076_ c-f1-2025-08-19_00089_ c-f1-2025-08-19_00094_
c-f1-2025-08-19_00008_ c-f1-2025-08-19_00070_ c-f1-2025-08-19_00086_ c-f1-2025-08-19_00093_
c-f1-2025-08-19_00095_ c-f1-2025-08-19_00075_ c-f1-2025-08-19_00085_ c-f1-2025-08-19_00092_
c-f1-2025-08-19_00053_ c-f1-2025-08-19_00072_ c-f1-2025-08-19_00087_ c-f1-2025-08-19_00091_

@rockerBOO rockerBOO marked this pull request as draft August 21, 2025 23:58
@rockerBOO
Copy link
Contributor Author

Currently a couple smaller issues

  • _grad_magnitude_ema_down and _grad_magnitude_ema_up are getting saved to the file. Need to make sure that this is properly being ignored or move away from it being a Parameters
  • they hinted at lower gradients causing issues earlier in training. So we can add this to the GGPO representation to make it more stable. In my testing it seems to work better. Since this will change previous, I have made it opt-in through a new ggpo variable (ggpo_min_grad). With this I also refactored out the implementation into a separate function for GGPO to make it cleaner in the forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant