Amazing Flux FFT quality on a 5090: (N)AdamW / Adan with fused backwards pass & CPU offloading #2187

araleza · 2025-08-24T15:24:05Z

So I've been trying to raise the image quality bar for Flux Full Fine Tuning (not LoRA) on a consumer GPU, specifically a 5090. One area I found that was costing significant quality was the effectively forced use of the Adafactor optimizer, which is an optimizer that sacrifices a lot of weight update accuracy in order to reduce its GPU memory load. As a result, I started to wonder if it was possible to get AdamW working on a 5090 GPU.

Not only did I manage to get it working, but I managed to combine:

AdamW
Nesterov Momentum (i.e. NAdamW)
Fused backward pass support
Kahan summation
Batch size 5

all in one optimizer, and the training quality difference between this and Adafactor is just amazing.

The key to getting it all working is CPU offloading. AdamW's two memory-hungry structures are the gradient momentum and squared gradient momentum values for each weight, but with a fused backward pass in place, these can be offloaded to the CPU after each parameter group of the model is updated, and then brought back on the next step. As a result, to use this branch, you'll need at least 128 GB of CPU memory.

The transfers back and forth to the CPU for the momentum tensors and the Kahan residuals make each training step pretty slow, around 1 minute per training step for batch size 5. Or 12 seconds per training image, to put it that way. But I've found I can successfully train Flux Dev from the base model all the way to fully trained in a single day, so the speed is fast enough to be practical.

In order to try this branch out, just fetch the branch and then change your command line to remove Adafactor, and instead include these parameters:

--optimizer_type NAdamOffload --kahan_summation --fused_backward_pass --train_batch_size 5 --blocks_to_swap 35 --mem_eff_save

and that's it. (Edit: Also delete any --seed parameter if you're continuing training a model you've already trained from Flux Dev base!) (I've found batch size 5 works on my 5090, with enough VRAM left to still use my machine. Maybe batch size 1 will work on a 4090?) You can also use AdamOffload, or NAdamWOffload etc. I'd start with NAdamOffload though, as the Nesterov NAdam variants have given me much better quality versus Adam for each of my test datasets, and I'd also avoid using the W weight decay variants for fine tuning as these might damage the existing model values when fine tuning, unlike weight decay when training a LoRA.

I've also found I can run this optimizer to further tune an already-trained model that was first trained with Adafactor, and I still see a big jump up in quality after just 15 minutes or so of training, so it should be pretty quick to see if this branch works for you. I'd recommend an LR of around 3.5e-6 as a starting point.

…an apply to other optimizers now too.

…orrect member variable

araleza · 2025-08-31T12:00:13Z

I've just made significant improvements to this pull request. Training quality is now even higher than the first check-in. If you're FFT training with a 5090 and 128 GB of main memory, then you should definitely try out this branch. I think you'll be glad you did.

As for the changes, I found that casting Adam's exponential averages to bf16 for storage on the cpu cost more accuracy that I'd first thought it would. So now I scale the exp_avg[_sq] structures and store them in scaled/biased uint16 format. In addition to that, since the spread of values of these tensors is generally exponential in nature, I apply a pow() function to try to more evenly distribute them across the resulting 0-65535 range, which in my tests improves the accuracy further. Looking at the difference between the original f32 values and the bf16 compressed values versus the u16 compressed values, the new scheme reduces the L1 error by around 90%.

In addition to that, I followed the same scheme as the Adam8bit optimizer, in that any tensors that have 4096 or fewer elements are not converted to bf16/u16 at all, but are just stored on the cpu in f32 format. About a third of Flux's tensors are of that type. It's been suggested that tensors with small numbers of elements are more sensitive to loss of precision, and since they take up so little memory it's worth keeping them in full f32 format. I found I can still train with batch size 5 like before, so the resulting memory profile is pretty similar.

You don't need to do anything special to use the changes in this update, just fetch the code and use the same command line parameters as I gave above, and it should all just happen automatically.

RougeXAi · 2025-08-31T15:57:05Z

@araleza would this work with 8bit optimizers?

araleza · 2025-08-31T15:59:17Z

@RougeXAi - yes. I only implemented Adam instead of Adam8bit because I wasn't familiar with optimizers enough to get Adam8bit working. But now I've learned more, I'm effectively borrowing more and more techniques from Adam8bit to improve this optimizer.

RougeXAi · 2025-08-31T16:03:51Z

@araleza ahh hopefully you'll be able to release a 8bit version so I can try it locally :), i rented a gpu to rest out the one you released, results seem pretty consistent

… the exp_avg[_sq] values

…e final Flux stages to f32.

araleza · 2025-09-28T14:43:08Z

I've now checked in another major improvement. The quality of this version of the check-in probably exceeds the previous version by as much as that version was better than Adafactor. 👀

The big change is that I realized that 128 GB of main memory is actually enough to store AdamW's exp_avg and exp_avg_sq arrays in f24 format, rather than bf16. Just as bf16 is the upper two bytes of f32 format, f24 is the upper three bytes - i.e. with 8 more bits of mantissa. It's not a native CUDA format, so it can only be used for storage rather than operations, but that's all the optimizer needs.

This means that the exp_avg and exp_avg_sq tensors are now around 256 times more accurate than they were at bf16. The quality bump is significant. I've found that learning quality, realism, and particularly image sharpness are all notably improved.

The other change I made is actually putting the final Flux layer in f32 format for training, rather than all layers in bf16. The last layer seems to matter a lot to image sharpness, and putting it in f32 format doesn't take much VRAM, so I believe it's worth it.

@RougeXAi, if you liked what I'd checked in before, I think you'd very much like this new version. I'd actually appreciate getting your opinion on the new level of quality, if you have the time to try it out.

No changes to the command line are required to make use of these improvements; they should just activate automatically.

Edit: Oh one other big point: if you have --seed 42 or some other seed in your command line, you should delete it if you continuing training from a checkpoint that you've saved out (as opposed to training from the standard Flux.Dev model). That's because if you train a model twice with the same seed, it repeats the same random noise each time, and the same timestep numbers, and this has bad effects on image quality! I was doing this for months before I realized. I'm intending to make a full post about this subject soon, but just delete your seed for now unless you're doing A/B comparison runs.

FurkanGozukara · 2025-10-11T08:55:17Z

i plan to test this. this is only for fine tuning right? this f24 format also sounding really interesting.

araleza · 2025-10-11T21:14:18Z

@FurkanGozukara, some good news for your testing: I've just added a new optimizer to this branch: a cpu-offloading version of ADAN.

I only recently found out about Adan. The paper suggests in many cases it can outperform even NAdam:

https://arxiv.org/abs/2208.06677

Adan is slower and heavier than NAdam, but in a way that's useful, because when you have a small fine tuning dataset, the longer you can spend on each image and the more the model can learn from it, the longer the first epoch lasts before training images start to repeat.

Adan is already available for sd-scripts (in the optionally installed D-Adaptation library), but that version doesn't support CPU offloading. And since Adan needs four optimizer parameters for each model parameter, it's much heavier than even (already heavy) AdamW, which only needs two optimizer parameters for each model parameter - which is itself still significantly heavier than Adafactor.

But, I've managed to find some effective compression techniques for these four parameters, and so it just fits in a system with a 5090 GPU and 128 GB of main memory, still with batch size 5. You do have to switch off Kahan Summation though, going back to the default stochastic updates, as the 2 bytes per model parameter that Kahan Summation needs push the CPU memory over the limit. Stochastic works fine though, and Adan produces very high quality Full Fine Tuning of Flux.Dev.

That was a lot of words, so here are some parameters to use to get Adan working if you're using this branch:

--optimizer_type AdanOffload --learning_rate 3.3e-5 --fused_backward_pass --train_batch_size 5 --blocks_to_swap 35 --mem_eff_save

Make sure you're not using --kahan_summation, and multiply your LR by around 5 to 10 from what you'd use for AdamW. If you want a starting LR for experimentation, maybe use 3.3e-5. Yes, that high. Adan works better at higher LRs. The --mem_eff_save parameter is useful if you get out-of-memory crashes when writing out checkpoints.

@FurkanGozukara, yes, this branch is for Full Fine Tuning. When it comes to creating a LoRA you can already use heavy optimizers, as there are so few parameters in a LoRA to optimize. But as you know, FFT typically produces better results than LoRA does. Thanks for taking a look at my branch!

Support for fused (N)AdamW + Kahan + momentum offloading FFT on a 5090.

225ea36

araleza changed the base branch from main to sd3 August 24, 2025 15:24

araleza added 2 commits August 24, 2025 16:25

Removed mention of Adafactor from warning message as this situation c…

f583e35

…an apply to other optimizers now too.

Fixed fused_backward_pass error message as it was not accessing the c…

c7b62f7

…orrect member variable

araleza mentioned this pull request Aug 24, 2025

Big image quality improvement! Kahan summation for Adafactor-optimized Flux FFT #2159

Open

araleza added 2 commits August 24, 2025 16:49

The optimizer type check in a warning message was incorrect

1cf1f2b

Changed cpu storage of exp_avg[_sq] from bf16 to powed/scaled u16

5ea1ada

araleza added 2 commits August 31, 2025 18:51

Changed u16 pow() factor from 16.0 to 8.0, seems to better distribute…

6578133

… the exp_avg[_sq] values

Now exp_avg[_sq] are stored on cpu in 24 bit format. Also changed som…

f6f3d6e

…e final Flux stages to f32.

Added Adan offloading optimizer, fp32 params, and 'cautious' updates

da17be0

araleza changed the title ~~Amazing Flux FFT quality on a 5090: (N)AdamW with fused backwards pass, CPU offloading, and Kahan~~ Amazing Flux FFT quality on a 5090: (N)AdamW / Adan with fused backwards pass & CPU offloading Oct 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Amazing Flux FFT quality on a 5090: (N)AdamW / Adan with fused backwards pass & CPU offloading #2187

Amazing Flux FFT quality on a 5090: (N)AdamW / Adan with fused backwards pass & CPU offloading #2187

Uh oh!

araleza commented Aug 24, 2025 •

edited

Loading

Uh oh!

araleza commented Aug 31, 2025 •

edited

Loading

Uh oh!

RougeXAi commented Aug 31, 2025

Uh oh!

araleza commented Aug 31, 2025

Uh oh!

RougeXAi commented Aug 31, 2025

Uh oh!

araleza commented Sep 28, 2025 •

edited

Loading

Uh oh!

FurkanGozukara commented Oct 11, 2025

Uh oh!

araleza commented Oct 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Amazing Flux FFT quality on a 5090: (N)AdamW / Adan with fused backwards pass & CPU offloading #2187

Are you sure you want to change the base?

Amazing Flux FFT quality on a 5090: (N)AdamW / Adan with fused backwards pass & CPU offloading #2187

Uh oh!

Conversation

araleza commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

araleza commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RougeXAi commented Aug 31, 2025

Uh oh!

araleza commented Aug 31, 2025

Uh oh!

RougeXAi commented Aug 31, 2025

Uh oh!

araleza commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FurkanGozukara commented Oct 11, 2025

Uh oh!

araleza commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

araleza commented Aug 24, 2025 •

edited

Loading

araleza commented Aug 31, 2025 •

edited

Loading

araleza commented Sep 28, 2025 •

edited

Loading

araleza commented Oct 11, 2025 •

edited

Loading