SD3 LoRA training time #619

primecai · 2024-08-03T07:05:28Z

primecai
Aug 3, 2024

Hi,

Thanks for this nice repo!
I have been trying to train a LoRA on SD3 using the multi-GPU setting. I am on 4 48GPUs (A6000), and my dataset is 1000 1024x1024 images --- I set up the batch size as 10 and do no gradient accumulation. Each iteration takes 40-50 seconds to complete.
It seems this training speed is dramatically slower comparing with many of the other logs I find in issues of this repo. Is it a normal phenomenon, or potentially I have some wrong setups?

Best regards and thanks in advance.

bghira · 2024-08-03T13:11:03Z

bghira
Aug 3, 2024
Maintainer

you need --vae_cache_preprocess

0 replies

primecai · 2024-08-03T21:22:42Z

primecai
Aug 3, 2024
Author

I added --vae_cache_preprocess in the bash script, but it seems the training time is still pretty much the same (it may have been reduced by 15%). Is ~40 seconds per iteration somewhat expected? I saw logs in other issues, such as this one, to be 1–2 seconds per iteration.

0 replies

bghira · 2024-08-04T03:35:57Z

bghira
Aug 4, 2024
Maintainer

can you try with just one GPU?

0 replies

bghira · 2024-08-04T03:43:36Z

bghira
Aug 4, 2024
Maintainer

for what it's worth, 40 seconds is a lot more than i'd expect. especially for a LoRA on a 2B model. it should be more like 3-5 seconds per second at worst, and 10 seconds per step when training from S3 backend

edit: make sure you're not using DoRA. it's slower.

0 replies

primecai · 2024-08-04T06:03:04Z

primecai
Aug 4, 2024
Author

Tested with 1 GPU (A8000 with 48GB), with batch size 10 it still takes ~40 seconds each iteration.

0 replies

primecai · 2024-08-07T01:38:15Z

primecai
Aug 7, 2024
Author

I finally found out the problem: SimpleTuner has defaulted to use bf16, but it is not supported on my GPU (RTX 8000). I tried to set --mixed_precision as fp16, but it shows me train.py: error: argument --mixed_precision: invalid choice: 'fp16' (choose from 'bf16', 'no'). Does this mean I cannot actually activate fp16 mixed precision training and has to go with fp32 on my RTX 8000s?

4 replies

bghira Aug 7, 2024
Maintainer

hmm. you can use --mixed_precision=no to load the initial weights in fp32, and then --base_model_precision=int8-quanto to quantise it using variable dtypes rather than having to rely on autocast for fp16 support which is neither supported by flux itself or simpletuner

bghira Aug 7, 2024
Maintainer

though #644 prevents you from quantising and using multi-GPU... seems like A8000 has a tough position currently for Flux fine-tuning. fp32 training needs more than 48G of VRAM - about 50 maybe. but you could try it anyway?

primecai Aug 7, 2024
Author

Thanks! I just gave --mixed_precision=no a try --- for now, I am not doing the --base_model_precision=int8-quanto part, just to get a sense of the performances. I am using LoRA on SD3, which seems to be fine for me to tune some models.
However, I notice that the training time per iteration is polynomial increasing as I increase my batch size, to have a reference, batch size 1 typically takes <4 seconds, batch size 4 takes <16 seconds and batch size 10 takes ~40 seconds --- which is how I got this 40sec/iter performance at the first place. This problem seems to persist no matter whether I do mixed-precision with bfloat16 or not. I am a bit out of idea on why this happens...

bghira Aug 7, 2024
Maintainer

it's possible that it's not using memory efficient attention either, and so batch size performance scaling will be awful too.

SD3 LoRA training time #619

Uh oh!

Uh oh!

primecai Aug 3, 2024

Replies: 6 comments · 4 replies

Uh oh!

bghira Aug 3, 2024 Maintainer

Uh oh!

primecai Aug 3, 2024 Author

Uh oh!

bghira Aug 4, 2024 Maintainer

Uh oh!

Uh oh!

bghira Aug 4, 2024 Maintainer

Uh oh!

primecai Aug 4, 2024 Author

Uh oh!

primecai Aug 7, 2024 Author

Uh oh!

bghira Aug 7, 2024 Maintainer

Uh oh!

bghira Aug 7, 2024 Maintainer

Uh oh!

primecai Aug 7, 2024 Author

Uh oh!

bghira Aug 7, 2024 Maintainer

primecai
Aug 3, 2024

Replies: 6 comments 4 replies

bghira
Aug 3, 2024
Maintainer

primecai
Aug 3, 2024
Author

bghira
Aug 4, 2024
Maintainer

bghira
Aug 4, 2024
Maintainer

primecai
Aug 4, 2024
Author

primecai
Aug 7, 2024
Author

bghira Aug 7, 2024
Maintainer

bghira Aug 7, 2024
Maintainer

primecai Aug 7, 2024
Author

bghira Aug 7, 2024
Maintainer