-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Amazing Flux FFT quality on a 5090: (N)AdamW / Adan with fused backwards pass & CPU offloading #2187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sd3
Are you sure you want to change the base?
Conversation
…an apply to other optimizers now too.
…orrect member variable
I've just made significant improvements to this pull request. Training quality is now even higher than the first check-in. If you're FFT training with a 5090 and 128 GB of main memory, then you should definitely try out this branch. I think you'll be glad you did. As for the changes, I found that casting Adam's exponential averages to bf16 for storage on the cpu cost more accuracy that I'd first thought it would. So now I scale the In addition to that, I followed the same scheme as the Adam8bit optimizer, in that any tensors that have 4096 or fewer elements are not converted to bf16/u16 at all, but are just stored on the cpu in f32 format. About a third of Flux's tensors are of that type. It's been suggested that tensors with small numbers of elements are more sensitive to loss of precision, and since they take up so little memory it's worth keeping them in full f32 format. I found I can still train with batch size 5 like before, so the resulting memory profile is pretty similar. You don't need to do anything special to use the changes in this update, just fetch the code and use the same command line parameters as I gave above, and it should all just happen automatically. |
@araleza would this work with 8bit optimizers? |
@RougeXAi - yes. I only implemented Adam instead of Adam8bit because I wasn't familiar with optimizers enough to get Adam8bit working. But now I've learned more, I'm effectively borrowing more and more techniques from Adam8bit to improve this optimizer. |
@araleza ahh hopefully you'll be able to release a 8bit version so I can try it locally :), i rented a gpu to rest out the one you released, results seem pretty consistent |
… the exp_avg[_sq] values
…e final Flux stages to f32.
I've now checked in another major improvement. The quality of this version of the check-in probably exceeds the previous version by as much as that version was better than Adafactor. 👀 The big change is that I realized that 128 GB of main memory is actually enough to store AdamW's exp_avg and exp_avg_sq arrays in f24 format, rather than bf16. Just as bf16 is the upper two bytes of f32 format, f24 is the upper three bytes - i.e. with 8 more bits of mantissa. It's not a native CUDA format, so it can only be used for storage rather than operations, but that's all the optimizer needs. This means that the exp_avg and exp_avg_sq tensors are now around 256 times more accurate than they were at bf16. The quality bump is significant. I've found that learning quality, realism, and particularly image sharpness are all notably improved. The other change I made is actually putting the final Flux layer in f32 format for training, rather than all layers in bf16. The last layer seems to matter a lot to image sharpness, and putting it in f32 format doesn't take much VRAM, so I believe it's worth it. @RougeXAi, if you liked what I'd checked in before, I think you'd very much like this new version. I'd actually appreciate getting your opinion on the new level of quality, if you have the time to try it out. No changes to the command line are required to make use of these improvements; they should just activate automatically. Edit: Oh one other big point: if you have |
i plan to test this. this is only for fine tuning right? this f24 format also sounding really interesting. |
@FurkanGozukara, some good news for your testing: I've just added a new optimizer to this branch: a cpu-offloading version of ADAN. I only recently found out about Adan. The paper suggests in many cases it can outperform even NAdam: https://arxiv.org/abs/2208.06677 ![]() Adan is slower and heavier than NAdam, but in a way that's useful, because when you have a small fine tuning dataset, the longer you can spend on each image and the more the model can learn from it, the longer the first epoch lasts before training images start to repeat. Adan is already available for sd-scripts (in the optionally installed D-Adaptation library), but that version doesn't support CPU offloading. And since Adan needs four optimizer parameters for each model parameter, it's much heavier than even (already heavy) AdamW, which only needs two optimizer parameters for each model parameter - which is itself still significantly heavier than Adafactor. But, I've managed to find some effective compression techniques for these four parameters, and so it just fits in a system with a 5090 GPU and 128 GB of main memory, still with batch size 5. You do have to switch off Kahan Summation though, going back to the default stochastic updates, as the 2 bytes per model parameter that Kahan Summation needs push the CPU memory over the limit. Stochastic works fine though, and Adan produces very high quality Full Fine Tuning of Flux.Dev. That was a lot of words, so here are some parameters to use to get Adan working if you're using this branch:
Make sure you're not using @FurkanGozukara, yes, this branch is for Full Fine Tuning. When it comes to creating a LoRA you can already use heavy optimizers, as there are so few parameters in a LoRA to optimize. But as you know, FFT typically produces better results than LoRA does. Thanks for taking a look at my branch! |
So I've been trying to raise the image quality bar for Flux Full Fine Tuning (not LoRA) on a consumer GPU, specifically a 5090. One area I found that was costing significant quality was the effectively forced use of the Adafactor optimizer, which is an optimizer that sacrifices a lot of weight update accuracy in order to reduce its GPU memory load. As a result, I started to wonder if it was possible to get AdamW working on a 5090 GPU.
Not only did I manage to get it working, but I managed to combine:
all in one optimizer, and the training quality difference between this and Adafactor is just amazing.
The key to getting it all working is CPU offloading. AdamW's two memory-hungry structures are the gradient momentum and squared gradient momentum values for each weight, but with a fused backward pass in place, these can be offloaded to the CPU after each parameter group of the model is updated, and then brought back on the next step. As a result, to use this branch, you'll need at least 128 GB of CPU memory.
The transfers back and forth to the CPU for the momentum tensors and the Kahan residuals make each training step pretty slow, around 1 minute per training step for batch size 5. Or 12 seconds per training image, to put it that way. But I've found I can successfully train Flux Dev from the base model all the way to fully trained in a single day, so the speed is fast enough to be practical.
In order to try this branch out, just fetch the branch and then change your command line to remove Adafactor, and instead include these parameters:
--optimizer_type NAdamOffload --kahan_summation --fused_backward_pass --train_batch_size 5 --blocks_to_swap 35 --mem_eff_save
and that's it. (Edit: Also delete any
--seed
parameter if you're continuing training a model you've already trained from Flux Dev base!) (I've found batch size 5 works on my 5090, with enough VRAM left to still use my machine. Maybe batch size 1 will work on a 4090?) You can also use AdamOffload, or NAdamWOffload etc. I'd start with NAdamOffload though, as the Nesterov NAdam variants have given me much better quality versus Adam for each of my test datasets, and I'd also avoid using the W weight decay variants for fine tuning as these might damage the existing model values when fine tuning, unlike weight decay when training a LoRA.I've also found I can run this optimizer to further tune an already-trained model that was first trained with Adafactor, and I still see a big jump up in quality after just 15 minutes or so of training, so it should be pretty quick to see if this branch works for you. I'd recommend an LR of around 3.5e-6 as a starting point.