Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux LoRA training relaunch error when using Automagic Optimizer. #237

Open
AfterHAL opened this issue Dec 16, 2024 · 0 comments
Open

Flux LoRA training relaunch error when using Automagic Optimizer. #237

AfterHAL opened this issue Dec 16, 2024 · 0 comments

Comments

@AfterHAL
Copy link

This is for bugs only

Did you already ask in the discord?
No

You verified that this is a bug and not a feature request or question by asking in the discord?
Yes

Describe the bug

I'm trying the Automagic optimizer for a week, and I get this error (KeyError: 'lr_mask') when I restart a Flux LoRA training after a clean stop (ctrl-C).

the training parameters are:

network:
  type: "lora"
  linear: 32
  linear_alpha: 32
  # (no network_kwargs params)
train:
  optimizer: "automagic"
  lr: 1.0e-5 # needed with automagic ?
  optimizer_params:
    min_lr: 1e-6
    max_lr: 1e-4

The error is:

#############################################
# Running job: MaisonClose_L02_AutoM_GAS1
#############################################


Running  1 process
Loading Flux model
Loading transformer
Quantizing transformer
Loading vae
Loading t5
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3470.67it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.30it/s]
Quantizing T5
Loading clip
making pipe
preparing
create LoRA network. base dim (rank): 24, alpha: 24
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder: 0 modules.
create LoRA for U-Net: 494 modules.
enable LoRA for U-Net
#### IMPORTANT RESUMING FROM output/MaisonClose_L02_AutoM_GAS1/MaisonClose_L02_AutoM_GAS1_000000500.safetensors ####
Loading from output/MaisonClose_L02_AutoM_GAS1/MaisonClose_L02_AutoM_GAS1_000000500.safetensors
Missing keys: []
Found step 500 in metadata, starting from there
Total training paramiters: 128,876,544
Loading optimizer state from output/MaisonClose_L02_AutoM_GAS1/optimizer.pt
Updating optimizer LR from params
Dataset: MaisonCloseSet02
  -  Preprocessing image dimensions
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 835/835 [00:43<00:00, 19.01it/s]
  -  Found 835 images
Bucket sizes for MaisonCloseSet02:
384x576: 835 files
1 buckets made
Dataset: MaisonCloseSet02
  -  Preprocessing image dimensions
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 835/835 [00:00<00:00, 103058.70it/s]
  -  Found 835 images
Bucket sizes for MaisonCloseSet02:
576x896: 835 files
1 buckets made
Dataset: MaisonCloseSet02
  -  Preprocessing image dimensions
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 835/835 [00:00<00:00, 114531.01it/s]
  -  Found 835 images
Bucket sizes for MaisonCloseSet02:
832x1216: 835 files
1 buckets made
MaisonClose_L02_AutoM_GAS1:   2%|█▋                                                                                                 | 500/30000 [00:00<?, ?it/s]Error running job: 'lr_mask'

========================================
Result:
 - 0 completed jobs
 - 1 failure
========================================
Traceback (most recent call last):
  File "/workspace/apps/ai-toolkit0/run.py", line 90, in <module>
    main()
  File "/workspace/apps/ai-toolkit0/run.py", line 86, in main
    raise e
  File "/workspace/apps/ai-toolkit0/run.py", line 78, in main
    job.run()
  File "/mnt/d/TODAI/apps/ai-toolkit0/jobs/ExtensionJob.py", line 22, in run
    process.run()
  File "/mnt/d/TODAI/apps/ai-toolkit0/jobs/process/BaseSDTrainProcess.py", line 1826, in run
    loss_dict = self.hook_train_loop(batch_list)
  File "/mnt/d/TODAI/apps/ai-toolkit0/extensions_built_in/sd_trainer/SDTrainer.py", line 1647, in hook_train_loop
    self.scaler.step(self.optimizer)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 457, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 352, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
    return func.__get__(opt, opt.__class__)(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
    out = func(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/d/TODAI/apps/ai-toolkit0/toolkit/optimizers/automagic.py", line 249, in step
    lr_mask = state['lr_mask'].to(torch.float32)
KeyError: 'lr_mask'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant