SD3 full finetuning LR question #487

a-l-e-x-d-s-9 · 2024-06-19T00:07:51Z

a-l-e-x-d-s-9
Jun 19, 2024

I have been experimenting with different learning rates.
With LR 1e-6 it doesn't seem to make much of a change after 4k steps and a batch size ~25.
With LR 1e-5, batch size 27, and after 600 steps mild change.
Last run with LR 1e-4, batch size 27, after 1200 steps, the model seemed to improve with the style, but it is still not consistent and not very well learned. Even 1800 steps look like it's not enough.
LR 1e-4 for full finetune is a huge LR that would have nuked 1.5 and SDXL, does it make sense that even with such a huge LR, the model is learning so slowly? Is it possible that the LR is ignored, normalized, or automatically adjusted somehow?

multidatabackend.json:

[
    {
        "id": "all_dataset",
        "type": "local",
        "instance_data_dir": "/workspace/input/dataset",
        "crop": false,
        "crop_style": "random",
        "crop_aspect": "preserve",
        "resolution": 0.5,
        "resolution_type": "area",
        "minimum_image_size": 0.1,
        "maximum_image_size": 1.0,
        "target_downsample_size":0.55,
        "prepend_instance_prompt": false,
        "instance_prompt": null,
        "only_instance_prompt": false,
        "caption_strategy": "textfile",
        "cache_dir_vae": "/workspace/cache_images/",
        "vae_cache_clear_each_epoch": false,
        "probability": 1.0,
        "repeats": 1,
        "text_embeds": "alt-embed-cache",
        "skip_file_discovery": "",
        "preserve_data_backend_cache": true
    },
    {
        "id": "alt-embed-cache",
        "dataset_type": "text_embeds",
        "default": true,
        "type": "local",
        "cache_dir": "/workspace/cache_text_embeds/"
    }
]

sdxl-env.sh:

# Configure these values.

# 'lora' or 'full'
# lora - train a small network for a character or style, or both. quite versatile.
# full - requires lots of vram, trains very slowly, needs a lot of data and concepts.
export MODEL_TYPE='full'

# Set this to 'true' if you are training a Stable Diffusion 3 checkpoint.
export STABLE_DIFFUSION_3=true

# ControlNet model training is only supported when MODEL_TYPE='full'
# DeepFloyd does not currently support ControlNet model training.
# See this document for more information: https://github.com/bghira/SimpleTuner/blob/main/documentation/CONTROLNET.md
export CONTROLNET=false

# DoRA enhances the training style of LoRA, but it will run more slowly at the same rank.
# See: https://arxiv.org/abs/2402.09353
# See: https://github.com/huggingface/peft/pull/1474
export USE_DORA=false

# BitFit freeze strategy for the u-net causes everything but the biases to be frozen.
# This may help retain the full model's underlying capabilities. LoRA is currently not tested/known to work.
if [[ "$MODEL_TYPE" == "full" ]]; then
    # When training a full model, we will rely on BitFit to keep the u-net intact.
    export USE_BITFIT=true
elif [[ "$MODEL_TYPE" == "lora" ]]; then
    # As of v0.9.2 of SimpleTuner, LoRA can not use BitFit.
    export USE_BITFIT=false
elif [[ "$MODEL_TYPE" == "deepfloyd-full" ]]; then
    export USE_BITFIT=true
fi

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.
# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=200
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=20

# This is decided as a relatively conservative 'constant' learning rate.
# Adjust higher or lower depending on how burnt your model becomes.
export LEARNING_RATE=1e-4 #@param {type:"number"}

# Using a Huggingface Hub model:
export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
# Using a local path to a huggingface hub model or saved checkpoint:
#export MODEL_NAME="/datasets/models/pipeline"

# Make DEBUG_EXTRA_ARGS empty to disable wandb.
export DEBUG_EXTRA_ARGS="--report_to=wandb"
export TRACKER_PROJECT_NAME="SD3_first"
export TRACKER_RUN_NAME="sd3_first_tr05"

# Max number of steps OR epochs can be used. Not both.
export MAX_NUM_STEPS=0
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=100

# A convenient prefix for all of your training paths.
export BASE_DIR="/workspace/input/dataset"
export DATALOADER_CONFIG="${BASE_DIR}/multidatabackend.json"
export OUTPUT_DIR="/workspace/output"
# Set this to "true" to push your model to Hugging Face Hub.
export PUSH_TO_HUB="false"
# If PUSH_TO_HUB and PUSH_CHECKPOINTS are both enabled, every saved checkpoint will be pushed to Hugging Face Hub.
export PUSH_CHECKPOINTS="false"
# This will be the model name for your final hub upload, eg. "yourusername/yourmodelname"
# It defaults to the wandb project name, but you can override this here.
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME

# By default, images will be resized so their SMALLER EDGE is 1024 pixels, maintaining aspect ratio.
# Setting this value to 768px might result in more reasonable training data sizes for SDXL.
export RESOLUTION=512
# If you want to have the training data resized by pixel area (Megapixels) rather than edge length,
#  set this value to "area" instead of "pixel", and uncomment the next RESOLUTION declaration.
export RESOLUTION_TYPE="pixel"
#export RESOLUTION=1.0          # 1.0 Megapixel training sizes
# If RESOLUTION_TYPE="pixel", the minimum resolution specifies the smaller edge length, measured in pixels. Recommended: 1024.
# If RESOLUTION_TYPE="area", the minimum resolution specifies the total image area, measured in megapixels. Recommended: 1.
export MINIMUM_RESOLUTION=$RESOLUTION

# How many decimals to round aspect buckets to.
#export ASPECT_BUCKET_ROUNDING=2

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# If you also supply a user prompt library or `--use_prompt_library`, this will be added to those lists.
export VALIDATION_PROMPT="minimalism simple illustration vector art style clean single black circle symmetric shape sharp professional print quality white background highres high contrast black and white"
export VALIDATION_GUIDANCE=5.0
# You'll want to set this to 0.7 if you are training a terminal SNR model.
export VALIDATION_GUIDANCE_RESCALE=0.0
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=99999999
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="ugly, old, mutation, lowres, low quality, doll, long neck, extra limbs, text, signature, artist name, bad anatomy, poorly drawn, malformed, deformed, blurry, out of focus, noise, dust"
export VALIDATION_SEED=10
export VALIDATION_RESOLUTION=$RESOLUTION


# Adjust this for your GPU memory size. This, and resolution, are the biggest VRAM killers.
export TRAIN_BATCH_SIZE=27
# Accumulate your update gradient over many steps, to save VRAM while still having higher effective batch size:
# effective batch size = ($TRAIN_BATCH_SIZE * $GRADIENT_ACCUMULATION_STEPS).
export GRADIENT_ACCUMULATION_STEPS=1

# Use any standard scheduler type. constant, polynomial, constant_with_warmup
export LR_SCHEDULE="constant"
# A warmup period allows the model and the EMA weights more importantly to familiarise itself with the current quanta.
# For the cosine or sine type schedules, the warmup period defines the interval between peaks or valleys.
# Use a sine schedule to simulate a warmup period, or a Cosine period to simulate a polynomial start.
#export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export LR_WARMUP_STEPS=0

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
# You may wish to disable dropout if you want to limit your changes strictly to the prompts you show the model.
# You may wish to increase the rate of dropout if you want to more broadly adopt your changes across the model.
export CAPTION_DROPOUT_PROBABILITY=0.1

export METADATA_UPDATE_INTERVAL=65
export VAE_BATCH_SIZE=12

# If this is set, any images that fail to open will be DELETED to avoid re-checking them every time.
export DELETE_ERRORED_IMAGES=1
# If this is set, any images that are too small for the minimum resolution size will be DELETED.
export DELETE_SMALL_IMAGES=0

# Bytedance recommends these be set to "trailing" so that inference and training behave in a more congruent manner.
# To follow the original SDXL training strategy, use "leading" instead, though results are generally worse.
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"

# Removing this option or unsetting it uses vanilla training. Setting it reweights the loss by the position of the timestep in the noise schedule.
# A value "5" is recommended by the researchers. A value of "20" is the least impact, and "1" is the most impact.
export MIN_SNR_GAMMA=5

# Set this to an explicit value of "false" to disable Xformers. Probably required for AMD users.
export USE_XFORMERS=false

# There's basically no reason to unset this. However, to disable it, use an explicit value of "false".
# This will save a lot of memory consumption when enabled.
export USE_GRADIENT_CHECKPOINTING=true

##
# Options below here may require a bit more complicated configuration, so they are not simple variables.
##

# TF32 is great on Ampere or Ada, not sure about earlier generations.
export ALLOW_TF32=true
# AdamW 8Bit is a robust and lightweight choice. Adafactor might reduce memory consumption, and Dadaptation is slow and experimental.
# AdamW is the default optimizer, but it uses a lot of memory and is slower than AdamW8Bit or Adafactor.
# Choices: adamw, adamw8bit, adafactor, dadaptation
export OPTIMIZER="adamw"

# EMA is a strong regularisation method that uses a lot of extra VRAM to hold two copies of the weights.
# This is worthwhile on large training runs, but not so much for smaller training runs.
export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS="--adam_bfloat16 "
## For offset noise training:
# Not recommended for terminal SNR models.
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offset_noise --noise_offset=0.25 --noise_offset_probability=0.5"

## For noise input perturbation - adds extra noise, randomly. This is separate from offset noise, but can help stabilize it and reduce overfitting.
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --input_perturbation=0.01"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
## You may benefit from directing training toward a specific weighted subset of timesteps.
# In this example, we train the final 25% of the timestep schedule with a 3x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=later --timestep_bias_portion=0.25 --timestep_bias_multiplier=3"
# In this example, we train the earliest 25% of the timestep schedule with a 5x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=earlier --timestep_bias_portion=0.25 --timestep_bias_multiplier=5"
# Here, we designate that specifically, timesteps 200 to 500 should be prioritised.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=range --timestep_bias_begin=200 --timestep_bias_end=500 --timestep_bias_multiplier=3"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

# For Wasabi S3 filesystem backend (experimental)
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --data_backend=aws --aws_bucket_name=test123"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_endpoint_url=https://s3.wasabisys.com"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_access_key=1234567890"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_secret_access_key=0987654321"


# Reproducible training. Set to -1 to disable.
export TRAINING_SEED=44

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export PURE_BF16=true

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=1
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other similar flags for huggingface accelerate

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
# Well, then again... Pytorch 2.2 has support for dynamic shapes. Why not?
export TRAINING_DYNAMO_BACKEND='no'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)

export TOKENIZERS_PARALLELISM=false

bghira · 2024-06-19T00:12:40Z

bghira
Jun 19, 2024
Maintainer

try

export USE_BITFIT=true

to comment out these lines

0 replies

bghira · 2024-06-19T00:14:09Z

bghira
Jun 19, 2024
Maintainer

so the trainer for a full model utilises the BitFit technique for tuning, which freezes all the weights and tunes just the models' bias. i was wondering what it'd be like with SD3, and yes you can use a much higher LR and it will cook less. this is why I made it the default.

however, just this morning I changed the default example configs so that this isn't applied out of the box, but left there commented-out as an example of how it might work to apply a setting conditionally.

0 replies

bghira · 2024-06-19T00:16:48Z

bghira
Jun 19, 2024
Maintainer

see some experiments here: https://wandb.ai/bghira/sd3-training?nw=nwuserbghira

but generally the full unfrozen weights & biases will cook no matter what LR you set, it's like it does nothing to the model at all, or suddenly it's bearing down like the boulder behind indiana jones and it picks up all the worst parts of the dataset and then fries itself.

you can tell it's frying because it goes into square grid nonsense and then loses all depth, contrast, and prompt adherence (in that order)

0 replies

a-l-e-x-d-s-9 · 2024-06-19T10:42:11Z

a-l-e-x-d-s-9
Jun 19, 2024
Author

In your tests, you haven't been using BITFIT?

1 reply

bghira Jun 23, 2024
Maintainer

correct. though bitfit worked and improved details. it does not teach new concepts.

leonary · 2024-06-23T14:46:31Z

leonary
Jun 23, 2024

th the style, but it is still not consistent and not very well learned. Even 1800 steps look like it's not enough.
LR 1e-4 for full finetune is a huge LR that would have nuked 1.5 and SDXL, does it make sense that even with such a huge LR, the model is learning so slowly? Is it possible that the LR is ignored, normalized, or automatically adjusted somehow?

So far in my tests, SimpleTuner can overfit on a single or a few graphs, but the learning effect starts to drop sharply on dozens or more graphs. Even in extreme training settings where overfitting is inevitable, it will never fit.
In the example you gave, 1e-4 is a relatively high learning rate, and it definitely does not need to be higher. Too high LR can easily lead to blurred details, poor logic, and loss=nan. Under the appropriate learning rate, overfitting is related to the number of times each graph is learned. In my tests, on a larger dataset, learning more than 300 times per graph will not overfit, or even not fit.
BitFit is a parameter that will cause underfitting. It tries to retain the model's own capabilities and make minor adjustments. It will not help fine-tuning small datasets (within tens of thousands of images).
Maybe Text encoder training can help solve the problem of fitting, but all SD3 training scripts are still in a very basic starting stage. Kohya's sd-scripts have not even been updated. OneTrainer is supporting Text encoder fine-tuning but there are also big problems. It is estimated that the SD3 training technology may be perfected in a week?

8 replies

a-l-e-x-d-s-9 Jun 23, 2024
Author

Reddit has a lot of examples of simple SD3 prompts that turn people into piles of limbs. Given this behavior of the model, combined with my experience of finetuning to improve the model, I think that SD3 needs significant training to fix it. I'm not sure how much money it will cost, but my guess is it won't be cheap. Considering the license of the model may revoke your ability to use your finetuning if SAI decides so at any point, I think any company that invests money in SD3 training or reliance is taking a big risk, both in terms of the features that SD3 can provide and especially due to possible legal issues.

leonary Jun 24, 2024

nope, i think Kandinsky 3.1 is much more state of the art than SD3 and it received nearly no attention.

The reason why SD3 is so popular is that it uses the DIT architecture and demonstrates strong logical understanding and text rendering capabilities. Its aesthetics, detail, and realism are also quite good. I haven't observed these qualities in Kandinsky 3.1. Here's a simple example I created in ComfyUI:

Prompt: There are three cats stacked on top of each other. From top to bottom, they are blue cat, red cat, and yellow cat.

This is definitely not a simple example for text-to-image models. Among the text-to-image models I've encountered, only Dalle3 and SD3 can barely execute this prompt.

leonary Jun 24, 2024

Reddit has a lot of examples of simple SD3 prompts that turn people into piles of limbs.

The limb rendering in SD3 is indeed quite poor. In fact, it's not just the limbs; the entire SD3 seems to be underfitted. I can't figure out the reason; could it be because SAI used too high of a batch size and too low of a learning rate? I hope this underfitting is not due to the model architecture. Underfitting can be both good and bad. At least an underfitted base model has fewer stylistic biases and can better learn the features of small datasets without being interfered with by the base model itself. As for fine-tuning to achieve better limb rendering, I think the data volume and training methods of some large SDXL models are more than sufficient. For example, Animagine XL with 2.6 million images or Pony Diffusion; the training methods of these two must be adequate to achieve decent limb rendering.

The only downside of SD3 is its ridiculous license; everything else is pretty good.

bghira Jun 24, 2024
Maintainer

i guess you haven't tried training SD3. the model is broken, it doesn't even align with the lessons shared in the SD3 paper.

leonary Jun 24, 2024

i guess you haven't tried training SD3. the model is broken, it doesn't even align with the lessons shared in the SD3 paper.

I have actually tried all the training methods. The training code for diffuser is difficult to fit and almost learns nothing. Your code can achieve relatively ideal fitting results with smaller datasets, but the learning effectiveness declines rapidly with slightly larger datasets. OneTrainer cannot achieve ideal fitting results on small datasets, so I have not tested its performance on larger datasets, especially since its code has not been officially released yet. Sd-scripts updated their training code yesterday, but I encountered an error and was unable to start the training.

It seems that all current training scripts, including SD3, are encountering fitting difficulties. I am still waiting for the authors to update their codes. If this is due to issues with the model itself, it would be very unfortunate.

mhirki · 2024-06-25T08:31:56Z

mhirki
Jun 25, 2024

I've been experimenting with LoRA training on a larger dataset. Setting --max_grad_norm=0.01 seems to be beneficial for SD3 too in addition to PixArt. It allows training at higher learning rate for longer before cooking.

Also setting --weighting_scheme=none has helped with anatomical cohesion (I modified helpers/arguments.py to allow this). This gets rid of logit-normal timestep sampling so that earlier and later timesteps get more training. The earlier timesteps should be responsible for anatomic features. Since the SD3 base model has been trained with logit-normal sampling, it may be the root cause of all the body horror people are seeing with SD3.

I was able to reach about 30k steps with a learning rate of 2e-4 before the LoRA really started overcooking. I was training at batch size 1 just to get some quick results. I was also running with --lora_rank=128 --lora_alpha=128 but these could maybe be increased further. Maybe I'll also need to try full finetuning.

5 replies

mhirki Jun 26, 2024

I finished a 48k step full finetuning run with a learning rate of 1e-5. Anatomy is still kinda wonky but seems to be improving with more training. Let's see what happens if I continue training for another 48k steps.

bghira Jun 26, 2024
Maintainer

if you're training the full network at 1e-5 and it's not collapsing, something is wrong.

mhirki Jun 26, 2024

I think --max_grad_norm=0.01 is the only thing somehow holding the model together. I did some inferencing in ComfyUI and the images look mostly fine aside from the anatomy issues. The training data includes some NSFW concepts as well as the pseudo-camera-10k dataset. This model is surpisingly good at non-NSFW concepts as well. For example, I'm attaching an image generated with the prompt "ethnographic photography of teddy bear at a picnic holding a sign that says SOON, sitting next to a red sphere which is inside a capsule". It's cherry picked out of 4 images.

leonary Jun 28, 2024

I think --max_grad_norm=0.01 is the only thing somehow holding the model together.

max_grad_norm is usually a parameter that should not be changed. Setting max_grad_norm = 0.01 will cause the model to converge extremely slowly. In short, by the end of the training, your model might not learn anything. Alternatively, could you show how well your model handles NSFW content since you mentioned that the dataset includes NSFW parts?

mhirki Jun 28, 2024

The model did learn the NSFW concepts just fine. I'm not gonna post NSFW images here.

I tried running another experiment with --adam_weight_decay=0 (as opposed to the default value of 0.01) in order to better preserve prior knowledge in the model. This resulted in the training converging more slowly. Anatomy also looks worse. The prior knowledge was also still slowly degrading with more training. So it seems trying to preserve the prior knowledge is not really worth it.

Also, given that I'm still training with --weighting_scheme=none, it may be beneficial to erase the prior knowledge because it would have been trained with logit-normal sampling enabled.

SD3 full finetuning LR question #487

Uh oh!

Replies: 6 comments · 14 replies

Uh oh!

bghira Jun 19, 2024 Maintainer

Uh oh!

bghira Jun 19, 2024 Maintainer

Uh oh!

bghira Jun 19, 2024 Maintainer

Uh oh!

a-l-e-x-d-s-9 Jun 19, 2024 Author

Uh oh!

bghira Jun 23, 2024 Maintainer

Uh oh!

Uh oh!

a-l-e-x-d-s-9 Jun 23, 2024 Author

Uh oh!

Uh oh!

Uh oh!

bghira Jun 24, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bghira Jun 26, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 14 replies

bghira
Jun 19, 2024
Maintainer

bghira
Jun 19, 2024
Maintainer

bghira
Jun 19, 2024
Maintainer

a-l-e-x-d-s-9
Jun 19, 2024
Author

bghira Jun 23, 2024
Maintainer

a-l-e-x-d-s-9 Jun 23, 2024
Author

bghira Jun 24, 2024
Maintainer

bghira Jun 26, 2024
Maintainer