Flux LoRA training relaunch error when using Automagic Optimizer. #237

AfterHAL · 2024-12-16T19:58:22Z

This is for bugs only

Did you already ask in the discord?
No

You verified that this is a bug and not a feature request or question by asking in the discord?
Yes

Describe the bug

I'm trying the Automagic optimizer for a week, and I get this error (KeyError: 'lr_mask') when I restart a Flux LoRA training after a clean stop (ctrl-C).

the training parameters are:

network:
  type: "lora"
  linear: 32
  linear_alpha: 32
  # (no network_kwargs params)
train:
  optimizer: "automagic"
  lr: 1.0e-5 # needed with automagic ?
  optimizer_params:
    min_lr: 1e-6
    max_lr: 1e-4

The error is:

#############################################
# Running job: MaisonClose_L02_AutoM_GAS1
#############################################


Running  1 process
Loading Flux model
Loading transformer
Quantizing transformer
Loading vae
Loading t5
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3470.67it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.30it/s]
Quantizing T5
Loading clip
making pipe
preparing
create LoRA network. base dim (rank): 24, alpha: 24
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder: 0 modules.
create LoRA for U-Net: 494 modules.
enable LoRA for U-Net
#### IMPORTANT RESUMING FROM output/MaisonClose_L02_AutoM_GAS1/MaisonClose_L02_AutoM_GAS1_000000500.safetensors ####
Loading from output/MaisonClose_L02_AutoM_GAS1/MaisonClose_L02_AutoM_GAS1_000000500.safetensors
Missing keys: []
Found step 500 in metadata, starting from there
Total training paramiters: 128,876,544
Loading optimizer state from output/MaisonClose_L02_AutoM_GAS1/optimizer.pt
Updating optimizer LR from params
Dataset: MaisonCloseSet02
  -  Preprocessing image dimensions
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 835/835 [00:43<00:00, 19.01it/s]
  -  Found 835 images
Bucket sizes for MaisonCloseSet02:
384x576: 835 files
1 buckets made
Dataset: MaisonCloseSet02
  -  Preprocessing image dimensions
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 835/835 [00:00<00:00, 103058.70it/s]
  -  Found 835 images
Bucket sizes for MaisonCloseSet02:
576x896: 835 files
1 buckets made
Dataset: MaisonCloseSet02
  -  Preprocessing image dimensions
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 835/835 [00:00<00:00, 114531.01it/s]
  -  Found 835 images
Bucket sizes for MaisonCloseSet02:
832x1216: 835 files
1 buckets made
MaisonClose_L02_AutoM_GAS1:   2%|█▋                                                                                                 | 500/30000 [00:00<?, ?it/s]Error running job: 'lr_mask'

========================================
Result:
 - 0 completed jobs
 - 1 failure
========================================
Traceback (most recent call last):
  File "/workspace/apps/ai-toolkit0/run.py", line 90, in <module>
    main()
  File "/workspace/apps/ai-toolkit0/run.py", line 86, in main
    raise e
  File "/workspace/apps/ai-toolkit0/run.py", line 78, in main
    job.run()
  File "/mnt/d/TODAI/apps/ai-toolkit0/jobs/ExtensionJob.py", line 22, in run
    process.run()
  File "/mnt/d/TODAI/apps/ai-toolkit0/jobs/process/BaseSDTrainProcess.py", line 1826, in run
    loss_dict = self.hook_train_loop(batch_list)
  File "/mnt/d/TODAI/apps/ai-toolkit0/extensions_built_in/sd_trainer/SDTrainer.py", line 1647, in hook_train_loop
    self.scaler.step(self.optimizer)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 457, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 352, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
    return func.__get__(opt, opt.__class__)(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
    out = func(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/toolkit/optimizers/automagic.py", line 249, in step
    lr_mask = state['lr_mask'].to(torch.float32)
KeyError: 'lr_mask'

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux LoRA training relaunch error when using Automagic Optimizer. #237

Flux LoRA training relaunch error when using Automagic Optimizer. #237

AfterHAL commented Dec 16, 2024

Flux LoRA training relaunch error when using Automagic Optimizer. #237

Flux LoRA training relaunch error when using Automagic Optimizer. #237

Comments

AfterHAL commented Dec 16, 2024

This is for bugs only

Describe the bug