AdamW optimizer implemented incorrectly - weight decay does not incorporate learning rate

In Optimisers.jl, `AdamW` is implemented as an `OptimiserChain` of `Adam` and `WeightDecay`:
https://github.com/FluxML/Optimisers.jl/blob/c2ae321518b2948dc56af3357f6a206b511c7b3e/src/rules.jl#L510-L514

WeightDecay here simply multiplies the decay value by the parameter:
https://github.com/FluxML/Optimisers.jl/blob/c2ae321518b2948dc56af3357f6a206b511c7b3e/src/rules.jl#L569-L574

In AdamW, and indeed in PyTorch, the WeightDecay value needs to be multiplied by the learning rate too:
![image](https://github.com/user-attachments/assets/86addc2d-fac3-46c9-ae5a-a9f97e862985)
From: https://arxiv.org/pdf/1711.05101

This appears to be the source of some great frustration for me, as I was observing extreme misbehavior from the model I've been trying to port from PyTorch.

The following optimiser produces the correct behavior:
```

Optimisers.@def struct LearningWeightDecay <: Optimisers.AbstractRule
  lambda = 5e-4
  eta = 0.001
end

Optimisers.init(o::LearningWeightDecay, x::AbstractArray) = nothing

function Optimisers.apply!(o::LearningWeightDecay, state, x::AbstractArray{T}, dx) where T
  λ, η = T(o.lambda), T(o.eta)
  dx′ = Optimisers.@lazy dx + η * λ * x

  return state, dx′
end

CorrectAdamW(η, β = (0.9, 0.999), λ = 0.0, ϵ = 1e-8) =
  Optimisers.OptimiserChain(Optimisers.Adam(η, β, ϵ), LearningWeightDecay(λ, η))
```

	AdamW(η, β = (0.9, 0.999), λ = 0.0, ϵ = 1e-8) =
	OptimiserChain(Adam(η, β, ϵ), WeightDecay(λ))

	AdamW(; eta = 0.001, beta = (0.9, 0.999), lambda = 0, epsilon = 1e-8) =
	OptimiserChain(Adam(eta, beta, epsilon), WeightDecay(lambda))

	function apply!(o::WeightDecay, state, x::AbstractArray{T}, dx) where T
	λ = T(o.lambda)
	dx′ = @lazy dx + λ * x

	return state, dx′
	end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AdamW optimizer implemented incorrectly - weight decay does not incorporate learning rate #182

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

AdamW optimizer implemented incorrectly - weight decay does not incorporate learning rate #182

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions