Basic script for "Speedtesting" the GPU before installing all dependencies and starting the training? #2206

bjspi · 2024-04-04T05:37:39Z

bjspi
Apr 4, 2024

Hi all,

would it be possible (or maybe it exists already) to have a small Kohya skript somewhere in the tools/ folder which does nothing else than dry-running some sort of learning just to see some sort of "default" learning-speed on your current machine/computer? To see how it performs (X seconds / iteration).

Motivation:
I use, as many others, RunPod to train models (because I only own a office Laptop). I always chose a Pod with GTX3090 but still, the performance differs a lot! On some days I get 1.5 sec / it, sometimes it gets really bad and is only ~4.5 sec / it. I would like to test it before setting up Kohya with app dependencies, requirements etc to see as quickly as possible, if I caught a "good" or "bad" Pod.

The training checkpoints themselves are completely irrelevant and wouldn't even need to be saved.

What do you think?

5KilosOfCheese · 2024-04-04T18:59:47Z

5KilosOfCheese
Apr 4, 2024

It isn't as simple as that. Speeds vary depending on settings. And here is a thing that... GPU plays just a PART of it.

I speed test by running a simple training with batch size 1, 2, 4, 6 and whatever dim I'm interested in.
Here is the results on the last one I did (based on my notebook).
I run 4060 Ti.
Dim 32, Standard Lora, Prodigy, mixed set of images sizes:
Batch size 1 = 1,95 it/s
Batch size 2 = 2,6 it/s
Batch size 4 = 3,6 it/s
Batch size 6 = 6,8 it/s (15,6 Gb VRAM use - so that's about the cap, but going to shared doesn't actually slow that much).

So whats interesting about these? Lets divide these by the batch size:
Batch size 1 = 1,95
Batch size 2 = 1,3
Batch size 4 = 0,9
Batch size 6 = 1,13

Isn't that curious... Batch size 4 is the fastest. Why? I don't know. You'd imagine the scaling to be constant. But it isn't because batch size 6 is slower than 4, but faster than 2. However the slowest is batch size 1? Whats going on (I don't know exactly. BUT! Looking that after every GPU cycle, the CPU gets a small spike. But since I don't keep everything on the GPU Vram (because I only got 16 Gb of it) gradient check point is on the disk, and I suspect this action of going bank and forth is the delay.

The only test you can do is literally testing. You can't know what you are getting, because you don't know what the exact hardware is and is doing on the other side.

2 replies

bjspi Apr 5, 2024
Author

Yes I totally get that. And I also believe it depends on a lot of parameters.
BUT: I was not asking at all for an "absolute" scaled KPI / performance-value which is comparable to other training parameters. That's not the goal here at all.

I'm actually only searching for a "relative" measure. Why? Because even for the COMPLETE 100% identical training data, training parameters and device type on Runpod (gtx3090) I get totally varying speeds, as described above. Sometimes factor 3 (300%) slower. So yes, performance-measurements will def. depend on parameters, and an absolute value of performance doesnt matter to me... It's just a "relative" value which I'm interested in.

5KilosOfCheese Apr 5, 2024

I have no idea about how Runpod machines are rated, but the fact is that different hardware combinations affect performance of the training. I know that 30xx and below cards regularly have questions about strangely bad performance on training and generation discussion.

The problem is that you can't really know what the effect of hardware combination is, or if the GPU driver has a problem, or whether the system is actually working as intended or if there is thermal throttling of the hardware.

You can't even compare this GUI and the original Kohya script together because they are implemented differently. This implementation seems to be slightly faster.

Like I'm sure there is some... way to do this. There seem to always be some way to do stuff like this, but since you are on paid time the question of whether it is of value to do that test. But the fact is that you can't really test the behavior of the system, without testing the system itself. There are just too many moving parts to consider.

If I were you I'd just run the traning same way every time, and then save the last state, and train from that again on another instance if need be.

Personally I'd ask RunPod for a solution for this, afterall they are selling this as a product. And I assume you are using their templates?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic script for "Speedtesting" the GPU before installing all dependencies and starting the training? #2206

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Basic script for "Speedtesting" the GPU before installing all dependencies and starting the training? #2206

bjspi Apr 4, 2024

Replies: 1 comment · 2 replies

5KilosOfCheese Apr 4, 2024

bjspi Apr 5, 2024 Author

5KilosOfCheese Apr 5, 2024

bjspi
Apr 4, 2024

Replies: 1 comment 2 replies

5KilosOfCheese
Apr 4, 2024

bjspi Apr 5, 2024
Author