You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.
I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)
I could probably make a good PR for this with some guidance.
First Bad Commit
This has never worked afaik
Relevant log output
No logging for this problem. Need to watch OS cache usage with a tool.
The text was updated successfully, but these errors were encountered:
Name and Version
build: 4449 (8a1d9c2) with cc (Debian 13.3.0-11) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
Problem description & steps to reproduce
If I load a dense model, it will warmup the model correctly, loading the whole thing into OS cache.
However, if I load a big MoE in (eg. deepseek 3), it will only load a small portion (93GB/660GB)
I tested this and made an inefficient bruteforce patch to common.cpp:
The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.
I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)
I could probably make a good PR for this with some guidance.
First Bad Commit
This has never worked afaik
Relevant log output
No logging for this problem. Need to watch OS cache usage with a tool.
The text was updated successfully, but these errors were encountered: