Add Muon #203

murrellb · 2024-12-21T18:01:53Z

This adds Muon (https://kellerjordan.github.io/posts/muon/), which uses an approximate orthogonalization before the update. There isn't a publication, but it gave a key improvement in the nanoGPT training "speedrun" attempt: https://github.com/KellerJordan/modded-nanogpt

This was started by @mashu

PR Checklist

Tests are added
Documentation, if applicable

mcabbott · 2024-12-21T18:26:42Z

src/rules.jl

+    X = G / (norm(G) + eps(T))
+    transposed = size(G, 1) > size(G, 2)
+    if transposed
+        X = X'
+    end


I really don't like re-using X for different things, and sometimes the compiler feels the same way:

Suggested change

X = G / (norm(G) + eps(T))

transposed = size(G, 1) > size(G, 2)

if transposed

X = X'

end

X1 = G / (norm(G) + eps(T))

transposed = size(G, 1) > size(G, 2)

X2 = transposed ? transpose(X1) : X1

But also, I don't like creating type instability. So even better, the inner loop could be pulled out to another function:

Suggested change

X = G / (norm(G) + eps(T))

transposed = size(G, 1) > size(G, 2)

if transposed

X = X'

end

X = G / (norm(G) + eps(T))

if size(G, 1) <= size(G, 2)

_innerloop(X, a, b, c)

else

transpose(_innerloop(transpose(X), a, b, c))

end

end

function _innerloop(X, a, b, c)

for _ in 1:5

A = X * X'

mcabbott · 2024-12-21T18:34:19Z

src/rules.jl

+- Fallback function (`fallback`): Function to control when, in addition to 1D arrays, the fallback optimizer should be used. Will be passed the parameter array and must return a boolean.
+
+Note: Works best with large batch sizes and may not be suitable for fine-tuning.
+In nanoGPT speedrun experiments, Muon is used for the internal layer >2D weights, and AdamW is used for the 1D weights, embeddings, and heads.
+
+`Optimisers.adjust!(optimiser_state, η::Real)` will adjust the fallback optimizer's `eta` to `η * (opt.eta / eta)`, and Muon's `eta` to `η`, preserving their ratio,
+but `Optimisers.adjust!(optimiser, eta = η)` will only adjust Muon's learning rate (allowing you to adjust the fallback optimizer's learning rate separately).


What it seems you really want is for setup to use Muon on some arrays, and AdamW on others. But instead this is rolled into this particular meta-optimisation rule, which is also called Muon. Maybe we should think about how to do that in a bit more generality?

One spelling would be setup(OptimiserIfElse(fun, AdamW(), Muon()), model) which does fun(x) ? init(AdamW(), x) : init(Muon(), x), with some stuct OptimiserIfElse <: AbstractRule. But it's a "fake rule" which is digested at setup time.

Another would be setup(fun::Function, model::Any) which is like

opt_state = setup(model) do x ndims(x) == 1 ? AdamW() : Muon() end

I'm sure we batted around such ideas when writing this package, but nobody had a concrete need. One thing we wondered was whether this which-rule function ought to see just x or, for instance, the field name, or the layer's type, or what? ndims(x) == 1 is a way of selecting bias but can't distinguish weight matrices from different layers.

Edit, a 3rd way is just to let you handle it. Too obscure, even if documented? Not sure.

opt_state = fmapstructure(model; exclude=isnumeric) do x rule = ndims(x) == 1 ? AdamW() : Muon() setup(rule, x) end

I strongly agree that this functionality would be good to have in general, but as you say it isn't quite clear what to switch on.

But in this case, when people say "Muon" they mean "Muon for >2D with an AdamW fallback for 1D" - see https://github.com/KellerJordan/Muon/blob/master/muon.py
And the only things they use to switch are directly inferable from the tensor itself, which we have access to inside the optimiser so we can get the same behavior. So I think it makes sense to, here, just call this Muon? Also, it supports the same (as the python version) way of adjusting the two eta values during warmup/cooldown (retaining their ratio) saving the user a little effort if they want to use it as-is.

mashu and others added 4 commits December 20, 2024 09:16

src/rules.jl: Add Moun optimiser.

8de4866

Muon with fallback

58a9bef

Adding in a scaling factor that was missed

dec3d76

Fixing fallback+constructor

ca00d5f

mcabbott reviewed Dec 21, 2024

View reviewed changes

Tweaking Newton Schulz

cbb02be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Muon #203

Add Muon #203

murrellb commented Dec 21, 2024 •

edited

Loading

mcabbott Dec 21, 2024

mcabbott Dec 21, 2024 •

edited

Loading

murrellb Dec 21, 2024

-    X = G / (norm(G) + eps(T))
-    transposed = size(G, 1) > size(G, 2)
-    if transposed
-        X = X'
-    end
+    X = G / (norm(G) + eps(T))
+    if size(G, 1) <= size(G, 2)
+        _innerloop(X, a, b, c)
+    else
+        transpose(_innerloop(transpose(X), a, b, c))
+    end
+end
+function _innerloop(X, a, b, c)
+    for _ in 1:5
+        A = X * X'

Add Muon #203

Are you sure you want to change the base?

Add Muon #203

Conversation

murrellb commented Dec 21, 2024 • edited Loading

PR Checklist

mcabbott Dec 21, 2024

Choose a reason for hiding this comment

mcabbott Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

murrellb Dec 21, 2024

Choose a reason for hiding this comment

murrellb commented Dec 21, 2024 •

edited

Loading

mcabbott Dec 21, 2024 •

edited

Loading