-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of AdamW
differs from PyTorch
#2433
Comments
We fixed this There is some ambiguity in the paper. They call On the other hand, the pytorch implementation seems equal to #1612, so I think we should fix AdamW again. |
Thank you for unravelling that for me and sorry that I didn't notice those issues/PRs in the first place. Short elaboration for future reference: The paper on AdamW uncouples what it calls the “schedule multiplier Pytorch only exposes two parameters, I can't quite tell how important the additional control of an uncoupled |
We should adhere to pytorch's implementation for sure. Would you mind filing PRs here and in Optimisers.jl? |
I don't have time to comment on this in detail now (will do so later), but the decision to diverge from PyTorch was not made lightly. IIRC it was something about how their handling and interpretation of the learning rate was unintuitive and would trip up people moving from other optimizers -> AdamW. I also didn't find their justification particularly compelling. |
Ok, I did some more digging into why PyTorch decided to couple the learning rate and weight decay coefficient for their AdamW implementation. My best guess is that this comment on one of the AdamW PRs triggered changes which cascaded all the way to the ultimate AdamW PR. I don't find the point super compelling here because Flux lacks a Adam + coupled L2 norm constructor unlike PyTorch. Moreover, changing the calculation would be a breaking change for Flux and Optimisers.jl. Now for an argument on semantics and usability. I agree that separate scheduling alone is not enough to justify a separate learning rate and weight decay rate. The problem lies more with tweaking hyperparameters. The AdamW paper makes a big point about being able to control both independently. With both coupled as PyTorch does, you have to always remember to tweak the weight decay every time you tweak the learning rate, otherwise you will be increasing/decreasing both simultaneously. We may even have public examples of people not realizing this, e.g. fastai/fastai#1806 (funnily enough, FastAI's AdamW used to not couple the two hyperparams). There's also a practical concern if we do introduce hyperparam scheduling (i.e. controlling As such, I think the best path forward would be to add a keyword arg to the |
AdamW
may be misleadingAdamW
differs from PyTorch
I have to think a bit about it. Another datapoint is that also optax couples the two |
I actually opened an issue on the Optax repo about this, and their more or less said they wanted to copy PyTorch... |
I think we should simply implement AdamW by copy-pasting the code from Adam. We can add the |
No objections, we'd just have to make a breaking release with it. Anything else we'd want to get in said release? |
Thank you for the extended discussion! Just to make sure I understand correctly (I'll try to find time to submit a PR):
(where where we expose to the user |
now that we are having a breaking release, we should try to match pytorch's implementation by default |
Hi, thank you for developing and maintaining this awesome library and ecosystem!
I'm not entirely sure but could it be that the documentation for the
AdamW
optimizer is a bit misleading? If I understand correctly, then its definition ofmeans that it performs this update (where$-\eta A$ is Adam's update):
However, the paper on AdamW (which is linked to by the docs) parametrizes this differently as:
I.e. Flux's$\eta\alpha$ and Flux's $\eta \lambda$ .
eta
corresponds to the paper'sdecay
corresponds to the paper'sThis is probably super unimportant (in that case, sorry for the noise) but since I just noticed this during bug hunting in an implementation of mine (which uses AdamW), I thought I'd report it.
The text was updated successfully, but these errors were encountered: