written by cybermeow
The following guidelines are subject to discussion and based on our limited experiments (though with hundreds of trained checkpoints and millions of images generated it is probably by far the largest set of fine-tuning experiments performed so far in the stable diffusion community, we acknowledge it is still rather limited and many aspects have not been investigated).
For further details you can refer to our paper: Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation
Check also Resources to learn more about Stable Diffusion fine-tuning
To be taken with a grain of salt
Full | LoRA | LoHa | LoKr low factor | LoKr high factor | |
---|---|---|---|---|---|
Fidelity | ★ | ● | ▲ | ◉ | ▲ |
Flexibility |
★ | ● | ◉ | ▲ | ● |
Diversity | ▲ | ◉ | ★ | ● | ★ |
Size |
▲ | ● | ● | ● | ★ |
Training Speed Linear |
★ | ● | ● | ★ | ★ |
Training Speed Conv |
● | ★ | ▲ | ● | ● |
★ > ◉ > ● > ▲ [> means better and smaller size is better]
- If you are looking for the “best” model, then you should stick to native training and curate your dataset. Note you can perform native training with LyCORIS by setting
algo=full
innetwork_args
. By so doing, you can use the full model as a LoRA! - If space is a more a concern than quality, you probably want to use LoRA. In this case, if you find your model “does not learn well enough” (for example, if some intricate details are not correctly captured), you can try LoKr with low factors (like 4 to 8) and full dimension (by setting dimension to arbitrarily large numbers like 100000).
- On the contrary, if you find your model “learns too well” (for example, if some unwanted style from training images get leaked or it becomes hard to change pose), your can try LoHa, LoKr with large factors, lower dimension, or even IA3 if you want some fun.
- Other formats such as DyLoRA and GLoRA are not studied in this experiment, so no recommendations here.
- Let’s make it clear. Everything depends on the dataset and hyperparameters. The comparaison approximately holds true if LoRA, LoHa, and LoKr have roughly the same size and are otherwise trained with the same hyperparameters. Things may also drastically change if you adjust others things such as
scaled_weight_norm
because the weights produced by different methods do not have the same scale (for example, in our experiments LoRA tend to have much smaller weight than LoHa and LoKr). - LoHa’s dampening effect provides new explanation to several observations made in the community so far
- LoHa is better at changing character style especially when character is trained with images of a single style: This is because LoHa does not learn the style of the character that much.
- LyCORIS is worse at complex characters as for example pointed out by https://rentry.org/59xed3: This refers to especially LoHa and "LoKr with large factors". This occurs because we sacrifice fidelity in exchange for better generalization. Of course you can counter that by adjusting dimension and factor as well.
- LoHa is better at learning multiple concepts: Pollution from other concepts are more observed when they are learned “too well”, so this does not happen so much to LoHa thanks to the dampening.
- Let’s dig into the third point a bit deeper. Is a more underfitted model necessarily better at generalization? This may not be the case as for example illustrated in SVDiff (in more detail, they show that combining multiple SVDiffs is worse than combining LoRAs). Moreover, native training seems to be good both at fidelity and its generalization capacity if things are properly tuned, while an ultra small LoKr does not seem to perform well when we switch base model. This make me conjecture the following: we need to first fix the model capacity, as roughly measured by the parameter count. Then, there is an inherent trade-off between learning the concept better and better combination of learned concepts, but you can probably improve the two by increasing the model size (so here comes a third tradeoff)!
- Things become more complex when we consider the effect of captions. Can it be the case that LoHa just attributes the learned concept more uniformly to different words appearing in the caption? This warrants further investigation.
- Also pay attention to the effect of alpha. Fixing alpha to 1 and half of dimensions is not the same thing when you increase dimension! For the latter you would feel the concept is better learned, while for the former it could be quite the contrary. See appendix B.1 of the paper for a relation between alpha and learning rate and initialization.
- As for the training of convolutional layers or not, the difference is much harder to see as we do not have a systematic way to evaluate "details" or "texture". Recall also that SD is a latent diffusion model, so it is hard to really talk about details without understanding how the VAE reacts change in latent space. I would personally recommend sticking to
preset=attn-mlp
in most cases. - Sidenote: I note that some people complain that increasing batch size causes result to be worse. It is important to remember that you should also increase learning rate / training epochs accordingly when you increase batch size to achieve similar results.
On RTX 4090, batch 8, 49622 steps, Adamw8bit.
The time is the estimated time after running for about 500 steps
Algorithm | Preset | Dim [Factor] | Time | Vram (MiB) | Size |
---|---|---|---|---|---|
LoRA | attn-mlp | 1 | 4hr | 16410 | 1.7 M |
LoRA | attn-mlp | 8 | 4hr | 16358 | 9.5 M |
LoRA | attn-only | 32 | 3hr25min | 15222 | 18 M |
LoRA | attn-mlp | 32 | 4hr5min | 15420 | 37 M |
LoRA | full | 32 | 4hr30min | 17350 | 75 M |
LoRA | attn-mlp | 64 | 4hr5min | 16168 | 73 M |
LoRA (kohya) | 64 | 4hr5min | 16348 | 73 M | |
LoHa | full | 4 | 4hr10min | 15960 | 9.6 M |
LoHa | attn-only | 16 | 3hr30min | 15244 | 18 M |
LoHa | attn-mlp | 16 | 4hr10min | 16876 | 37 M |
LoHa | full | 16 | 5hr40min | 16750 | 75 M |
LoHa | attn-mlp | 32 | 4hr10min | 16294 | 73 M |
LoKr | attn-mlp | 1 [-1] | 3hr50min | 16812 | 940 K |
LoKr | attn-mlp | full [-1] | 3hr45min | 16806 | 1.6 M |
LoKr | attn-mlp | 8 [4] | 3hr55min | 16010 | 2.8 M |
LoKr | attn-mlp | full [8] | 3hr40min | 16034 | 12 M |
LoKr | attn-only | full [4] | 3hr17min | 15374 | 15 M |
LoKr | attn-mlp | full [4] | 3hr45min | 16116 | 43 M |
LoKr | full | full [4] | 5hr10min | 17916 | 113 M |
IA3 | 3hr7min | 14954 | 596 K | ||
Full | attn-only | 3hr15min | 15912 | 233 M | |
Full | attn-mlp | 3hr50min | 18668 | 673 M | |
Full | full | 5h20min | 23008 | 1.8 G | |
Full (db) | 4hr47min | 22118 | 2 G (with 200mb VAE) |