I just tested FLUX Fine Tuning on Windows (RTX 5090) and Linux (RunPod RTX 5090 and Massed Compute RTX 6000 PRO)
The thing is on Linux the training speed is at least 25% faster than Windows
I am using Adafactor optimizer and Full bf16 training
How could it be?
No block swap used in all tests since it fits into 32 GB VRAM
I am using Torch 2.8 and CUDA 12.9 exactly same libraries on both platforms
There weren't this much difference before
Moreover I was getting like 8.5 second / it before on RTX 3090 TI on Windows and now exactly same config is around 11 second / it on Windows
How can we debug reason you think? What could be the culprit? @kohya-ss