Reason for higher VRAM use than base library? #803

SerialKicked · 2024-06-23T23:17:24Z

SerialKicked
Jun 23, 2024

I've been integrating LLamaSharp in a project for a bit, but I'm starting to notice that it uses a lot more VRAM than llama.cpp or kobold.cpp in particular during inference. For instance, with 12GB or VRAM, I could normally load a Mistral 0.2 7B model in Q8_0 with 16k context, fill it and still have 1-2GB available. With the same model, LLamaSharp will immediately run out of memory and crash the moment inference is run, even with like 100 tokens worth of context out of the 16K (all settings being more or less equal, running on CUDA, with flash attention and all layers on GPU). It's the same with L3 models.

Is there a reason why, or am I missing some hidden setting that's no documented in the, well, documentation?

martindevans · 2024-06-23T23:50:44Z

martindevans
Jun 23, 2024
Maintainer

LlamaSharp itself doesn't allocate any GPU memory, that's all under the control of llama.cpp. So it seems likely this is some kind of configuration error.

It's hard to say what though, all the basic settings should have pretty sensible defaults.

The setting that will consume the most GPU memory by day is the context size. By default this is unspecified, which means it will use the full context the model is trained with (the full 16k up front). I don't know how the other kinds you mentioned handle that, but my guess would be they have a smaller default context size.

0 replies

SerialKicked · 2024-06-24T00:35:42Z

SerialKicked
Jun 24, 2024
Author

I understand that, which is why I'm puzzled by the discrepancy between the results. Yes, I do set the context size. It's technically a 32K model, but i set context size to 16K. Even with a native 8K L3 model, while, of course, it's no longer running out of memory, it's still taking noticeably more VRAM.

If it was just a few megabytes, I'd brush that off as overhead but it's still a gigabyte compared to llama.cpp run from command line. It's interesting to note that it's just during inference. If the model is just loaded but not run, the size is consistent on all 3 mentioned implementations. The moment inference is run, however, Sharp's memory usage increases to ~12GB while base llama.cpp is at 11GB exactly.

I suspect it's a default setting somewhere, but it's not the context size setting alone. I hoped you could point me in the right direction. It's not that big of a deal in the grand scheme of things, I can work with slightly stronger quantization, and LLamaSharp is pretty straightforward to use. It's just weird, you know?

0 replies

martindevans · 2024-06-24T00:45:15Z

martindevans
Jun 24, 2024
Maintainer

If the difference is only during inference, it must be one of the context params: https://github.com/SciSharp/LLamaSharp/blob/master/LLama/Abstractions/IContextParams.cs.

Looking through that list, the only ones that I think should significantly change GPU memory usage are:

ContextSize (default = model context)
UBatchSize (default = 512)
TypeK (default = none)
TypeV (default = none)
NoKqvOffload (default = false)

So if possible, double check that those values are exactly the same.

Another possibility if that doesn't turn anything up is that our binaries are currently 3 weeks old (0.13.0). You could try comparing to this version. If llama.cpp has shaved off an entire gigabyte of memory usage in the last 3 weeks I'll be impressed (and very happy!).

0 replies

SerialKicked · 2024-06-24T00:55:39Z

SerialKicked
Jun 24, 2024
Author

I'll check if tweaking TypeK and TypeV makes a difference. I already account for the other settings. Thanks for your help. I'll get back to you if I figure out what's happening.

Another possibility if that doesn't turn anything up is that our binaries are currently 3 weeks old (0.13.0). You could try comparing to this version. If llama.cpp has shaved off an entire gigabyte of memory usage in the last 3 weeks I'll be impressed (and very happy!).

I wish it had too, but nope. Outside of a bunch of bug fixes, memory use has stayed the same (as far as I can tell). If you're looking for very optimized memory consumption, it's toward kobold.cpp you should look, it's also using llama.cpp as a base, but they've added and changed quite a few things (I mean that casually, I'm not asking you to do the same here), shaving off 10-ish% compared to llama.cpp during inference.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason for higher VRAM use than base library? #803

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Reason for higher VRAM use than base library? #803

SerialKicked Jun 23, 2024

Replies: 4 comments

martindevans Jun 23, 2024 Maintainer

SerialKicked Jun 24, 2024 Author

martindevans Jun 24, 2024 Maintainer

SerialKicked Jun 24, 2024 Author

SerialKicked
Jun 23, 2024

martindevans
Jun 23, 2024
Maintainer

SerialKicked
Jun 24, 2024
Author

martindevans
Jun 24, 2024
Maintainer

SerialKicked
Jun 24, 2024
Author