Reason for higher VRAM use than base library? #803
Replies: 4 comments
-
LlamaSharp itself doesn't allocate any GPU memory, that's all under the control of llama.cpp. So it seems likely this is some kind of configuration error. It's hard to say what though, all the basic settings should have pretty sensible defaults. The setting that will consume the most GPU memory by day is the context size. By default this is unspecified, which means it will use the full context the model is trained with (the full 16k up front). I don't know how the other kinds you mentioned handle that, but my guess would be they have a smaller default context size. |
Beta Was this translation helpful? Give feedback.
-
I understand that, which is why I'm puzzled by the discrepancy between the results. Yes, I do set the context size. It's technically a 32K model, but i set context size to 16K. Even with a native 8K L3 model, while, of course, it's no longer running out of memory, it's still taking noticeably more VRAM. If it was just a few megabytes, I'd brush that off as overhead but it's still a gigabyte compared to llama.cpp run from command line. It's interesting to note that it's just during inference. If the model is just loaded but not run, the size is consistent on all 3 mentioned implementations. The moment inference is run, however, Sharp's memory usage increases to ~12GB while base llama.cpp is at 11GB exactly. I suspect it's a default setting somewhere, but it's not the context size setting alone. I hoped you could point me in the right direction. It's not that big of a deal in the grand scheme of things, I can work with slightly stronger quantization, and LLamaSharp is pretty straightforward to use. It's just weird, you know? |
Beta Was this translation helpful? Give feedback.
-
If the difference is only during inference, it must be one of the context params: https://github.com/SciSharp/LLamaSharp/blob/master/LLama/Abstractions/IContextParams.cs. Looking through that list, the only ones that I think should significantly change GPU memory usage are:
So if possible, double check that those values are exactly the same. Another possibility if that doesn't turn anything up is that our binaries are currently 3 weeks old (0.13.0). You could try comparing to this version. If llama.cpp has shaved off an entire gigabyte of memory usage in the last 3 weeks I'll be impressed (and very happy!). |
Beta Was this translation helpful? Give feedback.
-
I'll check if tweaking TypeK and TypeV makes a difference. I already account for the other settings. Thanks for your help. I'll get back to you if I figure out what's happening.
I wish it had too, but nope. Outside of a bunch of bug fixes, memory use has stayed the same (as far as I can tell). If you're looking for very optimized memory consumption, it's toward kobold.cpp you should look, it's also using llama.cpp as a base, but they've added and changed quite a few things (I mean that casually, I'm not asking you to do the same here), shaving off 10-ish% compared to llama.cpp during inference. |
Beta Was this translation helpful? Give feedback.
-
I've been integrating LLamaSharp in a project for a bit, but I'm starting to notice that it uses a lot more VRAM than llama.cpp or kobold.cpp in particular during inference. For instance, with 12GB or VRAM, I could normally load a Mistral 0.2 7B model in Q8_0 with 16k context, fill it and still have 1-2GB available. With the same model, LLamaSharp will immediately run out of memory and crash the moment inference is run, even with like 100 tokens worth of context out of the 16K (all settings being more or less equal, running on CUDA, with flash attention and all layers on GPU). It's the same with L3 models.
Is there a reason why, or am I missing some hidden setting that's no documented in the, well, documentation?
Beta Was this translation helpful? Give feedback.
All reactions