Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Muon #203

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Add Muon #203

wants to merge 5 commits into from

Conversation

murrellb
Copy link

@murrellb murrellb commented Dec 21, 2024

This adds Muon (https://kellerjordan.github.io/posts/muon/), which uses an approximate orthogonalization before the update. There isn't a publication, but it gave a key improvement in the nanoGPT training "speedrun" attempt: https://github.com/KellerJordan/modded-nanogpt

This was started by @mashu

PR Checklist

  • Tests are added
  • Documentation, if applicable

src/rules.jl Outdated
Comment on lines 659 to 663
X = G / (norm(G) + eps(T))
transposed = size(G, 1) > size(G, 2)
if transposed
X = X'
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like re-using X for different things, and sometimes the compiler feels the same way:

Suggested change
X = G / (norm(G) + eps(T))
transposed = size(G, 1) > size(G, 2)
if transposed
X = X'
end
X1 = G / (norm(G) + eps(T))
transposed = size(G, 1) > size(G, 2)
X2 = transposed ? transpose(X1) : X1

But also, I don't like creating type instability. So even better, the inner loop could be pulled out to another function:

Suggested change
X = G / (norm(G) + eps(T))
transposed = size(G, 1) > size(G, 2)
if transposed
X = X'
end
X = G / (norm(G) + eps(T))
if size(G, 1) <= size(G, 2)
_innerloop(X, a, b, c)
else
transpose(_innerloop(transpose(X), a, b, c))
end
end
function _innerloop(X, a, b, c)
for _ in 1:5
A = X * X'

Comment on lines +619 to +625
- Fallback function (`fallback`): Function to control when, in addition to 1D arrays, the fallback optimizer should be used. Will be passed the parameter array and must return a boolean.

Note: Works best with large batch sizes and may not be suitable for fine-tuning.
In nanoGPT speedrun experiments, Muon is used for the internal layer >2D weights, and AdamW is used for the 1D weights, embeddings, and heads.

`Optimisers.adjust!(optimiser_state, η::Real)` will adjust the fallback optimizer's `eta` to `η * (opt.eta / eta)`, and Muon's `eta` to `η`, preserving their ratio,
but `Optimisers.adjust!(optimiser, eta = η)` will only adjust Muon's learning rate (allowing you to adjust the fallback optimizer's learning rate separately).
Copy link
Member

@mcabbott mcabbott Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it seems you really want is for setup to use Muon on some arrays, and AdamW on others. But instead this is rolled into this particular meta-optimisation rule, which is also called Muon. Maybe we should think about how to do that in a bit more generality?

One spelling would be setup(OptimiserIfElse(fun, AdamW(), Muon()), model) which does fun(x) ? init(AdamW(), x) : init(Muon(), x), with some stuct OptimiserIfElse <: AbstractRule. But it's a "fake rule" which is digested at setup time.

Another would be setup(fun::Function, model::Any) which is like

opt_state = setup(model) do x
  ndims(x) == 1 ? AdamW() : Muon()
end

I'm sure we batted around such ideas when writing this package, but nobody had a concrete need. One thing we wondered was whether this which-rule function ought to see just x or, for instance, the field name, or the layer's type, or what? ndims(x) == 1 is a way of selecting bias but can't distinguish weight matrices from different layers.

Edit, a 3rd way is just to let you handle it. Too obscure, even if documented? Not sure.

opt_state = fmapstructure(model; exclude=isnumeric) do x
  rule = ndims(x) == 1 ? AdamW() : Muon()
  setup(rule, x)
end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree that this functionality would be good to have in general, but as you say it isn't quite clear what to switch on.

But in this case, when people say "Muon" they mean "Muon for >2D with an AdamW fallback for 1D" - see https://github.com/KellerJordan/Muon/blob/master/muon.py
And the only things they use to switch are directly inferable from the tensor itself, which we have access to inside the optimiser so we can get the same behavior. So I think it makes sense to, here, just call this Muon? Also, it supports the same (as the python version) way of adjusting the two eta values during warmup/cooldown (retaining their ratio) saving the user a little effort if they want to use it as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants