Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Open
cpumaxx opened this issue Jan 9, 2025 · 0 comments
Open

Misc. bug: model warmup doesn't work correctly for MoE models #11163

cpumaxx opened this issue Jan 9, 2025 · 0 comments

Comments

@cpumaxx
Copy link
Contributor

cpumaxx commented Jan 9, 2025

Name and Version

build: 4449 (8a1d9c2) with cc (Debian 13.3.0-11) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./build/bin/llama-cli -m ds3-q8.gguf -t 128 --numa distribute -c 8192 -ngl 0 --interactive-first --chat-template deepseek3

Problem description & steps to reproduce

If I load a dense model, it will warmup the model correctly, loading the whole thing into OS cache.

However, if I load a big MoE in (eg. deepseek 3), it will only load a small portion (93GB/660GB)

I tested this and made an inefficient bruteforce patch to common.cpp:

>             if (decoder_start_token_id == -1) {
995,1003c992,993
<             printf("decoding warmup tokens.");
<             for (int i = 1; i <256 ; i++) {
<                 llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
<                 tmp.clear();
<                 tmp.push_back(i);
<                 printf(".");
<             }
<         } else { LOG_WRN("No Decoder Present. Warmup impossible"); }
<         printf("\n");

The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.

I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)

I could probably make a good PR for this with some guidance.

First Bad Commit

This has never worked afaik

Relevant log output

No logging for this problem. Need to watch OS cache usage with a tool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant