AdamW implementation does not truly decouple learning rate and weight decay


**Describe the bug**

AdamW implementation (see [here](https://github.com/NVIDIA/apex/blob/a7de60e57f0534266841e1733262601ad76aaa74/csrc/multi_tensor_adam.cu#L333)) does not truly decouple the weight decay and learning rate parameters in line with the [adamw paper](https://arxiv.org/pdf/1711.05101). This coupling often complicates HP tuning as tuning the learning rate also changes the effective WD used to train the model.

The implementation computes the updates as

 $w_{t} = (1- \eta_{\text{effective}} \lambda) w_{t-1} - \eta_{\text{effective}} {\hat{m}_t} / {\sqrt{\hat{v}_t} + \epsilon}$ 

where $\eta_{\text{effective}} = \eta_t \eta_{\text{max}}$ with $\eta_t$ denoting the scheduler and $\eta_{\text{max}}$ the max/base LR. 

This clearly couples LR and WD and is not in line with the paper which proposes to compute the updates as 

$w_{t} = (1- \eta_t \lambda) w_{t-1} - \eta_t \eta_{\text{max}} {\hat{m}_t} / {\sqrt{\hat{v}_t} + \epsilon}$

For easier and more intuitive tuning, it would be useful to enable the completely decoupled version of AdamW via the simple fix: $\lambda = (\eta_{\text{effective}} / \eta_{\text{max}}) \lambda$ with updates: $w_{t} = (1-  \lambda) w_{t-1} - \eta_{\text{effective}} {\hat{m}_t}/{\sqrt{\hat{v}_t} + \epsilon}$.

Note: This bug also exists in implementations of [AdamW in Pytorch](https://github.com/pytorch/pytorch/blob/d921891f5788b37ea92eceddf7417d11e44290e6/torch/optim/_functional.py#L125) and [Optax](https://github.com/google-deepmind/optax/blob/2cdb89cc4935d8dc5c8a06344e7d50dc7a7419b0/optax/_src/alias.py#L640) and has already been highlighted a few times across different papers, libraries, and blogs. More links below for reference. 

1. [Mosaic ML Library](https://docs.mosaicml.com/projects/composer/en/latest/method_cards/decoupled_weight_decay.html#:~:text=The%20informed%20reader,this%20line.) 
2. [Optimi](https://optimi.benjaminwarner.dev/fully_decoupled_weight_decay/)
3. Paper: [How to set AdamW's weight decay as you scale model and dataset size](https://arxiv.org/pdf/2405.13698v1)
4. [Fabian Schaipp's blog](https://fabian-sp.github.io/posts/2024/02/decoupling/)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AdamW implementation does not truly decouple learning rate and weight decay #1849

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AdamW implementation does not truly decouple learning rate and weight decay #1849

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions